How AI Will Shine a Light on Dark Data

Can AI make it possible to tap into dark data like never before? What are the risks in doing so?

May 9, 2023

How AI Will Shine a Light on Dark Data

AI (Artificial Intelligence) has the potential to shine a light on dark data by analyzing and interpreting vast amounts of unstructured data that were previously difficult or impossible to analyze with traditional methods. But buyer beware, not all data is created equally, warns Brian Platz, Fluree CEO & co-founder.

ChatGPT has brought new attention to the ability of generative AI to contextualize and order the Internet into simple summaries and easy answers. It’s also highlighted some of the dangers of relying too heavily on data you cannot see. Forums like Reddit quickly pushed back against AI-generated responses prone to errors. The fundamental problem is that existing tools like ChatGPT were trained on potentially untrustworthy data that was never vetted for accuracy, bias, quality, or meaning. We need to start thinking about how AI can help automate the process of turning this dark data into trusted, linked data. 

Transforming Dark Data

Today, considerable expertise is required to vet data due to legacy data management processes. We must develop new workflows and tools for understanding, cleaning, contextualizing and linking data. This is where new AI tools could help enterprises automate the processes of turning dark data locked into different applications and departmental silos into trusted linked data. We are still in the early days of these new AI-powered workflows. 

Enterprises today spend excessive time finding and ordering data when they create a new app, report, or decision engine. Leading enterprise vendors provide ERP, CRM, or transaction processing systems that organize data for a narrowly defined purpose. However, these tools need more heavy lifting around integration, metadata management, and data cleansing to evolve to support innovative new use cases or business models. And much of the quality and integrity of this enterprise data is still questioned. A recent HFS Research survey found that 75% of executives don’t trust their dataOpens a new window

The idea of big data took hold in the last couple of decades as enterprises explored ways to make sense of ever-growing data stores. Enterprises built data warehouses when they knew how to structure them into formats upfront. Data lakes came along as a way to aggregate data, which could be repurposed after the fact. But enterprises soon discovered that this also required a lot of work to structure, clean, and understand the data. Data scientists and others that need access to data also don’t need vast amounts of data. In most cases, they need access to minimal, specific data.

Today, the enterprise data industry is in the same place as in the early days of the World Wide Web when people had to manually curate links to other pages. Google quickly eclipsed giants like Yahoo! and Excite with a better way of automating the process of indexing and prioritizing information. Companies that figure out new ways of turning dark data into trusted linked data more efficiently with AI may see similar gains in the next wave of the Internet. 

See More: Exploring Generative AI’s Rise Across the Enterprise

From Linked Pages To Linked Data

It is common knowledge that Tim Berners-Lee introduced the web in the early 1990s, providing the infrastructure for finding information online. It’s important to note that the success of the web was built on prior efforts to link knowledge. Doug Engelbart created the oN-Line System (NLS) in the early 1960s. But it required a difficult learning curve that limited its use to a few experts. Berners-Lee actually made the first effort to connect documents with an app called ENQUIRE while working as a temporary contractor at CERN in the early 1980s. 

When he returned to work at CERN in 1984, he realized that considerable manual effort was required to keep links up to date. The introduction of HTML allowed publishers to structure information in a way that reduced the burden on everyone else. This improved structure provided the proper foundation for the Internet to grow into what it is today. 

However, he also envisioned the Internet as a connected semantic web where users could write and read data. His early browser ran on powerful NeXT computers. The Mosaic Browser that finally drove widespread adoption was designed for less capable computers and did not support linked data or identity, so they could only be used to read data and not write it. Linked data makes it easier to understand how data is connected, while identity is required to follow the data trail back to its source. 

Since then, Berners-Lee has been pushing for new standards and tools for linking data to make it more valuable and accessible as part of the semantic web. Early examples include how a Google search for a movie can organize related information into cards showing nearby theaters where it plays, its rating on Rotten Tomatoes, length, actors, and a quick summary. Google Search has reported on numerous cases where companies saw increases in traffic or time spent after adding structured data to their websites. For example, Nestlé increased click-throughs by 82%Opens a new window after adding structured data markup, while Rakuten found users spent 1.5x more time on pages with structured data. 

Automating Linked Data

Linked data tools work great for well-defined entities like movies, recipes, and restaurants. Still, they are more complicated for other domains, like tracking a customer journey across various channels or linking supply chain data and IoT data streams with third-party sources. This is where the next generation of AI augmentation will help shine a light on dark data. There are a couple of elements to this. 

Running AI algorithms on existing data sets can help provide an excellent first effort in organizing data from many sources. One approach might be to explore ways to automate the FAIR Guiding PrinciplesOpens a new window for scientific data management and stewardship best practices introduced by Scientific Data in 2016. The term refers to making data Findable, Accessible, Interoperable, and Reusable. 

But it’s also essential to track the identity associated with the data. For example, you are more likely to trust a product review posted by Consumer Reports than Frankie456 on Amazon. Similarly, it would be helpful to securely track a chain of data that is securely linked to banks, credit card reports, and government agencies when making a loan decision. 

Identities do not just apply to humans. The same identity infrastructure can also connect data directly pulled from various sensors in a supply chain to track where and how products were harvested, produced, and shipped across a supply chain. For example, Fluree and Sinisana Technologies have recently shown how identity can improve verification and traceability for Halal food across supply chains. Sinisana has already seen cost savings of more than 50%Opens a new window compared to manual approaches for managing data and the identity associated with data elements in the past. 

A Connected Data Ecosystem

Finding ways to automate this process will help enterprises gain the same advantages Google saw with automated approaches to ranking pages. Automating the process of linking data connected to verified identities will open new opportunities for efficiently using more data for different use cases. We need to find better ways to transform data, develop processes for labeling it, and prioritize data quality efforts for subject matter experts to review. 

In the short run, this will provide a way to automate the data cleansing and labeling needed to take siloed data and move it into a connected data ecosystem. Today, very expensive data scientists and engineers must write custom code to process data. 

The scripts break and must be fixed each time the data gets updated. AI automation can help enterprises create linked sets at scale. Over time, it will provide a path to bring in dark data and lighten it up to make it more accessible both within the enterprise and across the decentralized web. 

How are you tapping into dark data? Share with us on  FacebookOpens a new window , TwitterOpens a new window , and LinkedInOpens a new window . We’d love to hear from you!

MORE ON DARK DATA

Brian Platz
Brian Platz is the Co-founder and CEO of Fluree, PBC, a North Carolina-based Public Benefit Corporation focused on transforming data security, ownership and access with a scalable blockchain graph database. Platz was an entrepreneur and executive throughout the early internet days and SaaS boom, having founded the popular A-list apart web development community, along with a host of successful SaaS companies. Previous to establishing Fluree, Brian co-founded SilkRoad Technology which grew to over 2,000 customers and 500 employees in 12 global offices.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.