How Can Synthetic Data Impact Data Privacy in the New World of AI

Unlock AI’s potential with synthetic data: Privacy-compliant, unbiased, and revolutionary for training.

February 12, 2024

How Can Synthetic Data Impact Data Privacy in the New World of AI

In the era of advancing AI and growing data privacy concerns, Steve Harris, CEO of Mindtech, explores the transformative potential of synthetic data. Discover how it addresses privacy challenges, minimizes bias, and revolutionizes AI training, offering a promising path forward for the evolving landscape.

Data Privacy WeekOpens a new window is an international effort to spread awareness of data privacy for individuals and organizations. As AI’s development continues to rocket and regulations try to keep up, the week’s arrival couldn’t be better timed. 

The use of copyrighted material to train generative AI models like ChatGPT has come under increasing scrutiny over the last year. This has resulted in lawsuits by companies such as Getty Images and the New York Times against AI firms over such practices, which are sure to be talking points this year. In response, Open AI has stated it can’t train its models without such content. 

Collecting data in the real world can bring about further problems, such as needing to attain model releases and consent when filming in public spaces. Moreover, government regulations and legislative processes like the EU AI Act further complicate real-world data collection. How companies interpret this ethical landscape can vary case by case – there is no universal understanding of how to approach it. 

Ultimately, how AI is regulated is for governments and policymakers to decide. However, the rise of synthetic data provides a clear route to tackling the many dilemmas in collecting real-world data. 

See More: The State of IT Spend 2024: Key Takeaways

Overcoming the Privacy Challenge With Synthetic Data

With these legal battles and ethical considerations affecting the availability of real-world data, the industry is facing a need for alternative ways of using data to train AI models. Not only is it dealing with privacy concerns, but this lack of diverse and representative real-world training data can exacerbate bias and discrimination. Therefore, we can expect a steep increase in interest in synthetic images and training data for the coming year. 

By its nature, synthetic data is privacy-compliant and enables rapid and cost-effective production without relying solely on real-world data. It also plays a key part in testing AI models, particularly for activities such as ID verification, allowing developers to assess data for any incorrect details. As a result, sectors dependent on generative AI technologies like ChatGPT are particularly set to benefit from synthetic data. 

But how exactly does it do this? 

As its name implies, synthetic data and images are artificially created by computer systems to structurally and statistically mirror real-world data. This happens using advanced statistical modeling techniques and algorithms to reflect real-world patterns, structures, and environments without relying on genuine data. Crucially, it removes the need to reveal or transmit personally identifiable information (PII). As a result, developers can generate large volumes of highly accurate and diverse datasets, including data that could be too costly or unattainable in the real world. 

As synthetic data replaces the need for massive amounts of real data while maintaining privacy, the opportunities for its use are widespread. 

Can You Tell the Difference?

While opportunities may be abundant, persuading companies and governments of the credibility of synthetic data is still a challenge. They need a clear overview of what it can do well. At the heart of this challenge is finding ways of convincing stakeholders to welcome new techniques instead of clinging to current processes. If organizations are familiar with past versions of synthetic data, for instance, then they won’t be aware of its newfound abilities. 

Synthetic data has improved massively – even in the last 18 months, considerable advancements in its realism have occurred. The continual improvement in the specifics of synthetic images suggests that some synthetic data may be indistinguishable from real-world images this year or next.

There are certain use cases, such as using 2D faces for ID verifications, where there is no visible difference between real and synthetic data. Visual fidelity has been pushed through advances in the processing power of the GPUs used for rendering, including wider adoption of techniques such as ray-tracing by the underlying software made feasible by the GPU hardware accelerators. 

While this visual fidelity is apparent in individual items, trying to accomplish overall scene complexity – which involves many synthetic visualizations – could be a few years away. While activities such as animation and predicting human actions are feasible in a  photorealistic manner today, having sufficient processing power to create the required volumes of data may prove beyond economic in the near term. 

Use Cases

1. Construction and digital humans

Data from the real world is often inherently biased. This is because the data used to train models is largely gathered from across the internet, reflecting biases present in society and the socio-economic groups prevalent in the social media spaces used to gather this data. Data scientists have turned to synthetic data and  ‘Digital Humans’ to combat these biases. 

With Digital Humans, data scientists can vary elements of ‘Digital DNA,’ such as et,’ city, size, and clothing, and mix with real-world data to create more representative and diverse datasets. Of course, this also protects image rights and PII exposure that could come from using images and footage of people in the real world. 

Mindtech worked with a construction company that wanted to develop autonomous site vehicles. The company wanted to enhance these vehicles’ safety and accrue a broader range of data to train them. As a result, it used synthetic data to create diverse synthetic datasets to train these vehicles to identify various people on site, no matter size/shape/sex/ethnicity/clothing/ – the vehicles could stop their journey if someone were blocking their way. 

As there are countless combinations of lighting, weather, objects, people’s movements, and further factors at play, using synthetic data meant the company could test a far greater range of scenarios than traditional data in a safe and controlled environment.

2. Identity document recognition

Synthetic data is also being used to help train identity document recognition systems. For these systems to work well, they generally require huge amounts of diverse and accurate training data – which is incredibly difficult in the current environment. Using a model trained on synthetic data that covers corner cases, developers can create a robust model that performs when faced with varying visual factors and conditions such as lighting and image distortions. 

When training such systems, using real identity documents presents a substantial risk of exposing an individual’s PII. With synthetic data, developers can produce identical documents that put no personal data at risk. It bolsters data privacy and allows developers to generate a wide range of identity documents, such as passports and driving licenses, worldwide. This means the system can be trained to accurately recognize documents from any region. 

Synthetic Data as the Cornerstone for AI Development

It’s worth stressing here that humans have a vital part to play in overseeing the creation and validation of synthetic data. Data scientists can ensure that it accurately reflects the real world, can adjust datasets to mitigate bias and accuracy, and can assess any privacy implications that may be in play. 

Real-world data is also still vital in building synthetic data and training AI models. However, synthetic data can massively reduce reliance on its use and remove any exposure of PII and copyright issues. It can reflect the real world more accurately than real-world data alone. 

Synthetic data provides a way forward for tackling the industry’s tall task of finding privacy-compliant solutions for the new world of AI. 

How can synthetic data reshape AI? Let us know on FacebookOpens a new window , XOpens a new window , and LinkedInOpens a new window . We’d love to hear from you!

Image Source: Shutterstock

MORE ON DATA PRIVACY & SYNTHETIC DATA

Steve Harris
Steve Harris, CEO of Mindtech, has over 30 years of experience in the technology market sector and holds a masters in Microprocessor Engineering from Manchester University. He has previously been instrumental in creating several European start-up organisations, with a proven track record of success in building strategic relationships and strong revenue streams with tier one companies worldwide. Prior to his current role, he has worked in a number of senior sales and business development positions at leading technology companies, such as: Imagination Technologies, Gemstar, Liberate, and Sun Microsystems, allowing him to bring a wealth of insight and expertise to Mindtech.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.