Skip to main content

Taking the world by simulation: The rise of synthetic data in AI

White man sits in a dark room next to a white woman; he is holding a tablet device and his arm angle is outlined
Image Credit: Datagen

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


Would you trust AI that has been trained on synthetic data, as opposed to real-world data? You may not know it, but you probably already do — and that’s fine, according to the findings of a newly released survey.

The scarcity of high-quality, domain-specific datasets for testing and training AI applications has left teams scrambling for alternatives. Most in-house approaches require teams to collect, compile, and annotate their own DIY data — further compounding the potential for biases, inadequate edge-case performance (i.e. poor generalization), and privacy violations.

However, a saving grace appears to already be at hand: advances in synthetic data. This computer-generated, realistic data intrinsically offers solutions to practically every item on the list of mission-critical problems teams currently face.

That’s the gist of the introduction to “Synthetic Data: Key to Production-Ready AI in 2022.” The survey’s findings are based on responses from people working in the computer vision industry. However, the findings of the survey are of broader interest. First, because there is a broad spectrum of markets that are dependent upon computer vision, including extended reality, robotics, smart vehicles, and manufacturing. And second, because the approach of generating synthetic data for AI applications could be generalized beyond computer vision.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

Lack of data kills AI projects

Datagen, a company that specialized in simulated synthetic data, recently commissioned Wakefield Research to conduct an online survey of 300 computer vision professionals to better understand how they obtain and use AI/ML training data for computer vision systems and applications, and how those choices impact their projects.

The reason why people turn to synthetic data for AI applications is clear. Training machine learning models require high-quality data, which is not easy to come by. That seems like a universally shared experience.

Ninety-nine percent of survey respondents reported having had an ML project completely canceled due to insufficient training data, and 100% of respondents reported experiencing project delays as a result of insufficient training data.

What is less clear is how synthetic data can help. Gil Elbaz, Datagen CTO and cofounder, can relate to that. When he first started using synthetic data back in 2015, as part of his second degree at the Technion University of Israel, his focus was on computer vision and 3D data using deep learning.

Elbaz was surprised to see synthetic data working: “It seemed like a hack, like something that shouldn’t work but works anyway. It was very, very counter-intuitive,” he said.

Having seen that in practice, however, Elbaz and his cofounder Ofir Chakon felt that there was an opportunity there. In computer vision, like in other AI application areas, data has to be annotated to be used to train machine learning algorithms. That is a very labor-intensive, bias- and error-prone process.

“You go out, capture pictures of people and things at large scale, and then send it to manual annotation companies. This is not scalable, and it doesn’t make sense. We focused on how to solve this problem with a technological approach that will scale to the needs of this growing industry,” Elbaz said.

Datagen started operating in garage mode, and generating data through simulation. By simulating the real world, they were able to create data to train AI to understand the real world. Convincing people that this works was an uphill battle, but today Elbaz feels vindicated.

According to survey findings, 96% of teams report using synthetic data in some proportion for training computer vision models. Interestingly, 81% share using synthetic data in proportions equal to or greater than that of manual data.

Synthetic data, Elbaz noted, can mean a lot of things. Datagen’s focus is on so-called simulated synthetic data. This is a subset of synthetic data focused on 3D simulations of the real world. Virtual images captured within that 3D simulation are used to create visual data that’s fully labeled, which can then be used to train models.

Simulated synthetic data to the rescue

The reason this works in practice is twofold, Elbaz said. The first is that AI really is data-centric.

“Let’s say we have a neural network to detect a dog in an image, for instance. So it takes in 100GB of dog images. It then outputs a very specific output. It outputs a bounding box where the dog is in the image. It’s like a function that maps the image to a specific bounding box,” he said.

“The neural networks themselves only weigh a few megabytes, and they’re actually compressing hundreds of gigabytes of visual information and extracting from it only what’s needed. And so if you look at it like that, then the neural networks themselves are less of the interesting. The interesting part is actually the data.”

So the question is, how do we create data that can represent the real world in the best way? This, Elbaz claims, is best done by generating simulated synthetic data using techniques like GANs.

This is one way of going about it, but it’s very hard to create new information by just training an algorithm with a certain data set and then using that data to create more data, according to Elbaz. It doesn’t work because there are certain bounds of the information that you’re representing.

What Datagen is doing — and what companies like Tesla are doing too — is creating a simulation with a focus on understanding humans and environments. Instead of collecting videos of people doing things, they’re collecting information that’s disentangled from the real world and is of high quality. It’s an elaborate process that includes collecting high-quality scans and motion capture data from the real world.

Then the company scans objects and models procedural environments, creating decoupled pieces of information from the real world. The magic is connecting it at scale and providing it in a controllable, simple fashion to the user. Elbaz described the process as a combination of directorial aspects and simulating aspects of the real world dynamics via models and environments such as game engines.

It’s an elaborate process, but apparently, it works. And it’s especially valuable for edge cases hard to come by otherwise, such as extreme scenarios in autonomous driving, for example. Being able to get data for those edge cases is very important.

The million-dollar question, however, is whether generating synthetic data could be generalized beyond computer vision. There is not a single AI application domain that is not data-hungry and would not benefit from additional, high-quality data representative of the real world.

In addressing this question, Elbaz referred to unstructured data and structured data separately. Unstructured data, like images or audio signals, can be simulated for the most part. Text, which is considered semi-structured data, and structured data such as tabular data or medical records — that’s a different thing. But there, too, Elbaz noted, we see a lot of innovation.

Many startups are focusing on tabular data, mostly around privacy. Using tabular data raises privacy concerns. This is why we see work on creating the ability to simulate data from an existing pool of data, but not to expand the amount of information. Synthetic tabular data are used to create a privacy compliance layer on top of existing data.

Synthetic data can be shared with data scientists around the world so that they can start training models and creating insights, without actually accessing the underlying real-world data. Elbaz believes that this practice will become more widespread, for example in scenarios like training personal assistants, because it removes the risk of using personally identifiable data.

Addressing bias and privacy

Another interesting side effect of using synthetic data that Elbaz identified was removing bias and achieving higher annotation quality. In manually annotated data, bias creeps in, whether it’s due to different views among annotators or the inability to effectively annotate ambiguous data. In synthetic data generated via simulation, this is not an issue, as the data comes out perfectly and consistently pre-annotated.

In addition to computer vision, Datagen aims to expand this approach to audio, as the guiding principles are similar. Besides surrogate synthetic data for privacy, and video and audio data that can be generated via simulation, is there a chance we can ever see synthetic data used in scenarios such as ecommerce?

Elbaz believes this could be a very interesting use case, one that an entire company could be created around. Both tabular data and unstructured behavioral data would have to be combined — things like how consumers are moving the mouse and what they’re doing on the screen. But there is an enormous amount of shopper behavior information, and it should be possible to simulate interactions on ecommerce sites.

This could be beneficial for the product people optimizing ecommerce sites, and it could also be used to train models to predict things. In that scenario, one would need to proceed with caution, as the ecommerce use case more closely resembles the GAN generated data approach, so it’s closer to structured synthetic data than unstructured.

“I think that you’re not going to be creating new information. What you can do is make sure that there’s a privacy compliant version of the Black Friday data, for instance. The goal there would be for the data to represent the real-world data in the best way possible, without ruining the privacy of the customers. And then you can delete the real data at a certain point. So you would have a replacement for the real data, without having to track customers in a borderline ethical way,” Elbaz said.

The bottom line is that while synthetic data can be very useful in certain scenarios, and are seeing increased adoption, their limitations should also be clear.

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.