Skip to main content

Stop data pollution from turning your company’s data lake into a swamp

Digital binary code abstract concept.
Image Credit: metamorworks/Getty

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


This article was contributed by Kevin Campbell, CEO of Syniti

Today, every organization is a data organization. It doesn’t matter if you work for a tech company in Silicon Valley, an established manufacturer, a legacy financial services firm, or even a government agency, your company is collecting, storing, and aiming to use more data than ever before.

Globally, we are in the middle of a data explosion right now; the total global volume of enterprise data is projected to double from 1,005 to 2,025 terabytes between 2020 and 2022. It’s no wonder that many organizations are playing a game of perpetual catch-up, lacking the knowledge and tools to effectively manage the data they’re collecting so it’s actually useful.

To handle this data deluge, many enterprises turn to data lakes, instead of a standard data warehouse. In theory, data lakes give businesses the upper hand in terms of scalability, flexibility, and integration with technologies like IoT. However, rather than a pristine data lake, many organizations end up with something more like a stagnant data swamp, full of murky data pollution. So, what can you do to prevent the swamp and take full advantage of your data?

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

1. Pick the most important company data…and get (nearly) everyone to agree

I have seven kids, so as a dad, of course, I love all my kids the same. The same isn’t true for data. Stop treating all of your company’s data as if it has the same level of importance. Trust me, it doesn’t.

You need to decide — along with some key stakeholders — what data is the most important to your organization and its goals. You can’t possibly cover all your data, and dumping all of it into the data lake is the quickest way to create a swamp. So, come up with the data that’s driving the company and delivering wider business value – driving efficiencies, enhancing the customer experience, informing product development – and designate those to be your KPIs and success metrics.

Once you’ve got those key success metrics and the most important data, make sure you socialize it with key stakeholders, so you have that buy-in. Here are some questions to ask:

  • What are our key KPIs?
  • What are the metrics that we will measure?
  • Do we understand what the formulas for calculating these are?
  • What rules around how data gets pulled into these metrics are required?
  • What systems does our data reside in?

Think about creating a data charter that clearly states the above so that everyone can refer back to it and to help ground your overall data strategy.

2. Know thy data

So, you’ve picked the most important, business-critical data, and you’ve gotten an agreement on it from key folks in your organization. What’s next? To paraphrase some wise Greek philosopher, you need to know thy data – how is it created? Where is it entered? How is it being maintained?

Take stock of where your company’s important data is coming from, and how and where it’s entered into your systems. From there, let’s ensure the data that you’re storing is accurate; effective and regular cleansing will suppress or modify data that are incorrect, incomplete, irrelevant, or improperly formatted. Make sure you include processes for getting rid of duplicates and merging various datasets. Deduplication may not be the sexiest thing in data, but it’s one of the most important – and done well, can save you a ton of money and resources.

Due to the variety of databases, file formats, structure, it’s going to take time and work but don’t overlook this step. It’s crucial to remove internal silos and create truly valuable data. Proper maintenance and point-of-entry implementations that keep duplicate records and bad addresses out are non-negotiable. Without these, your lake will become a swamp again before you know it. Organizations make this mistake far too often.

3. Governance is critical for company data

I know. Governance is often seen as controlling, slow and limiting. But in reality, it helps assign authority and control over data assets, so that data is consistent and can be used across an organization.

To many businesses, customer success is one of the most essential KPIs. In order to truly understand the entire customer lifecycle, it goes all the way back to the first marketing contact. Who creates and establishes that customer record?

Without proper governance, we could have multiple numbers for the same customer, which dilutes the information we have, prevents us from making smart data-driven decisions, and potentially mucks up our ability to deliver a great customer experience.

Good governance should also support compliance with any regulation that affects your organization, whether it’s HIPAA, GPDR, CCPA, POPI, LGPD, or beyond.

That data charter referenced earlier can serve as the cornerstone of your governance strategy. As a data program continues, it’s easy to lose sight of your initial goals. Make sure you regularly refer back to it, so that they remain top-of-mind for all stakeholders. Equally, it’s important not to be too rigid, so if your organization’s requirements change, then adjust your data charter accordingly.

Last but not least, transparency is crucial. Internally, this means clear communication between all stakeholders, allowing different departments to impart their knowledge, whilst driving transparency and accountability for maintaining data quality.

Externally, it’s imperative to be completely transparent about what customer and prospect data your company is collecting. The most obvious reason for this is to avoid falling foul of regulators – Google, WhatsApp, and CaixaBank have all received multi-million-euro fines for violating GDPR transparency clauses. It’s just not worth it.

The more data, the better? Not necessarily

More data isn’t always better. Companies should be cautious about collecting and storing data for which they have limited tangible use. Not only does this present security, privacy, and compliance risks, storing and managing such data also represents an unnecessary expense. Instead, focus on data that has value and utility – you probably have more than enough of it already!

Clean, usable, and valuable data has the potential to foster new business growth, streamline operations, enhance customer relationships and boost agility. Who wouldn’t want that?

For more than three decades, Kevin Campbell has been passionately driving innovation and growth at global Fortune 500 and start-up organizations. Currently, he serves as the CEO of Syniti.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.