The Challenge Of Getting Clean Data

AI systems are only as good at the data that you feed them
Image Credit: dirkcuys

The person with the CIO job has a problem on their hands. The world has changed and a new breed of technology has rolled into the IT department: artificial intelligence (AI). The entire company is getting excited about the importance of information technology and the possibilities that AI can bring to the company: processing of large volumes of data, accurate forecasts, and improved automation. However, the person in the CIO position knows that they actually have a bit of a problem on their hands: AI systems are only as good as the data that you feed to them. It turns out that most of us have a lot of “dirty data”.


The Problem With Data

So what’s the big deal with our data? Data is just data, right? Well, actually no. It turns out that it is a CIO’s job to manage your data properly in order to find ways to avoid both errors and bias. The reason that this is so important to do these days is because your data is what you will be feeding to your AI applications. CIOs are looking to their AI applications to do a lot of wonderful things. These include diagnosing problems with the company’s production systems, automating processes, and making customer service better. However, AI won’t be able to do any of these things if you have not fed it high-quality data.

So what’s the problem? Well it turns out that the data that the IT department has is more often than not is stored in many different formats. We also tend to store our data in multiple different data centers often based on where and how the data was either produced or collected. Just to make things a bit more difficult, it is not unusual for us to have multiple copies of the same data. The problem that CIOs are facing is that if they make the mistake of feeding their AI applications dirty data that is either not complete, not current, not consistent, or, even worse, not accurate then the company’s AI systems may create decisions that are either biased or erroneous.

Many companies have some form of an internal digital library that is used by their employees. Often these libraries have listings and descriptions of different things. Employees will often download data from these libraries in order to complete their work assignments. The key to making a library like this be a success is to make sure that its users are able to quickly find the information that they need. In most IT departments, in order to make sure that the data in the library is correct, a manual data cleaning process is required to be performed. The problem is that often none of the data coming into the library is standardized. Specifications, spellings, weight, size, etc. can all be described in many different but similar ways.


Solving Our Data Problem

The problem that CIOs are facing is that the source of each data stream is messy. The source creates their own data and they don’t seem to use the same set of standards. This results in multiple layers of data and the level of detail can keep increasing as the amount of data increases. CIOs need to understand that they have a significant issue on their hands. The manual process for cleaning this data can only go on for so long. When the library reaches a given size, new ways of dealing with the data will have to be implemented.

What CIOs need are automated tools that will be able to process the data that is to be added to their libraries. The goal of these applications will be to replace and automate the human tasks that have been used to clean the data up. The new applications will be required to help the IT department collect their data. Once it has been collected, the software can then process it in order to create standardized data and then prepare it to be added to the digital library. One of the biggest benefits of software like this is that it can transform the data based on how users search it in order to allow users to more easily find what they are looking for.

Just cleaning up data is not enough. As long as CIOs are going to make the investment in software that will process the streams of data that are being added to their digital libraries, they need to do more. The ultimate goal of this process is to allow the applications that are processing the data to be able to spot relationships. These relationships can include things such as misspelled words. They can also be used to find abbreviated words that might sound similar when spoken by humans. Adding this kind of data to the library can be very valuable. Additionally, with a little bit of luck most of these data cleansing applications will not require any coding on the part of the IT department.


What All Of This Means For You

There is no question that CIOs are currently sitting on a mountain of data. However, all of that data is not going to do them any good. They need to find ways to process it so that the company can use what the data is trying to tell them. One of the best ways to process large quantities of data is by using AI applications to sift through it. However, CIOs need to be aware that they are facing a significant problem. Their AI applications can only function correctly if they are fed good data. Much of the data that CIOs currently have is dirty data.

The challenge that CIOs have on their hands is that AI applications can do a lot of different wonderful things. They can automate parts of the company and they can make very accurate forecasts. However, they can’t do any of these things if they have been fed dirty data. If the data that they get is bad, then their results will contain errors and bias. The problem with our data is that it is being stored in multiple formats, in different data centers, and we often have multiple copies of the same data. Most companies maintain a digital library that they feed with their data streams. However, they have to use a manual process to clean their data. The problem is a result of the fact that each data stream has its own way of organizing its data. Once the company’s digital library gets big enough, a manual cleaning process is no longer practical. CIOs need automated tools that can be used to process the data streams that are being added to their libraries. These tools can standardize the data and find relationships between pieces of data.

The job of the CIO is to support the rest of the company. If we do our job well, we can help the company to run smoothly and automate many of its processes. One of the best ways to accomplish all of this is by using all of the data that we have collected to help the company to make good decisions. However, a lot of the data that we have is unusable because it is dirty – in multiple formats and stored in a many different locations. With the proper applications, we can clean up our data and then feed it into our AI applications. Once we’ve done this, we can sit back and allow our AI applications to tell us how to make our company run more smoothly.


– Dr. Jim Anderson Blue Elephant Consulting –
Your Source For Real World IT Department Leadership Skills™


Question For You: Should CIOs create a production line to have their data scrubbed and cleaned before being processed?


Click here to get automatic updates when The Accidental Successful CIO Blog is updated.
P.S.: Free subscriptions to The Accidental Successful CIO Newsletter are now available. Learn what you need to know to do the job. Subscribe now: Click Here!

What We’ll Be Talking About Next Time

I am willing to bet that when you go to the grocery store, the store is filled with food products. In fact, I’m also willing to bet that pretty much any day of the week or any time of the day that you go to the grocery store its filled head-to-toe with the food products that you want to buy. Most of us don’t spend a lot of time thinking about just exactly how all of this happens day in and day out. However, the CIOs at the grocery chains spend most of their day worrying about just exactly how each one of those food products will make it into the store and onto the shelves. The good news for them is that technology is starting to be created that will help them to do things like this even better.