Little-known Secrets About Synthetic Data

For many people, synthetic data is synonymous with simulations, mock or fake data. The reality is very different. The purpose of this article is to explain what it is about. I discuss potential applications, benefits, limitations, and some little-known facts that synthetic data vendors don’t want you to know, mostly because they are unaware of implementations flaws. I also address how to correct these flaws and generate synthetic data the right way.

Introduction

Synthetic data is used more and more to augment real-life datasets, enriching them and allowing black-box systems to correctly classify observations or predict values that are well outside the training and validation sets. In addition, it helps understand decisions made by obscure systems such as deep neural networks, contributing to the development of explainable AI. It also helps with unbalanced data, for instance in fraud detection. Finally, since synthetic data is not directly linked to real people or transactions, it offers protection against data leakage. Synthetic data also contributes to eliminating algorithm biases and privacy issues, and more generally, to increased security.

Types of Synthetic Data

Tabular, transactional data is attracting more and more attention and money: both by companies that need it, and by startups receiving VC funding to develop solutions. Banks, insurances, supply chain and the healthcare industry are top consumers. It helps rebalance datasets (by adding generated transactions) to under-represented segments such as minorities or fraudulent transactions. It also helps with data privacy and security issues.

Computer vision problems such as face recognition or automated driving also benefit from it. It helps generate atypical observations not present in your training set, to boost the pattern recognition capabilities of your algorithms. An example is automated driving under intense glare or white-out conditions. For the same reasons, speech recognition, chatbots (chatGPT) and natural language processing benefit from synthetic data. For instance, to recognize accents. The word “natural language generation” is popular in this context.

terrain4x_final3_small — Storm and terrain generation: morphing (top) and evolutionary process (bottom)

Other applications include time series generation and agent-based modeling. An example of the latter is simulating how a virus spread over time in a country, using evolutionary processes. The above picture features a different case: terrain and storm generation. AI-generated art or copyright-free images for authors and publishers, are other potential applications.

Not so Well-known Applications

Besides mimicking real data, synthetization is used for a number of purposes. Indeed, the term was coined in the context of imputation: creating artificial values when the data is missing. This is a very difficult problem, as missing observations are typically different from the observed ones. Synthetic data is also used to test and benchmark algorithms, detect when they work or not, and understand which observations or features contribute to a specific decision, such as denying a loan to minority people. It contributes to making black-boxes more interpretable and to identify the source of biases.

Finally model-free statistical techniques to build confidence regions or intervals benefit from synthetic data. Typically, statisticians use bootstrapping methods in this context. However, using synthetic data allows you to remove biases and limitations inherent to these methods. In the same context, sensitivity analysis to assess how you algorithm reacts to noise artificially introduced in the data, could use synthetic data rather than artificial noise. The goal in the end is to design algorithm or models less sensitive to noise, thus reducing overfitting.

Typical Flaw and how to Fix it

When the goal is to faithfully replicate a real dataset, the quality of the representation — measured as the “distance” between the real and synthetized data — is usually maximum if the two datasets (synthesized and real) are exactly identical. The Hellinger distance is popular in this context. It favors overfitting and penalizes atypical observations.

It makes it hard to generate new, unusual fraud cases or creating new people in minority groups, truly different than those already in the dataset. As a result, it propagates biases to the synthetized data, defeating its purposes. It is compounded by the fact that many implementations can not generate data outside the observed range: see this article for an example based on the insurance dataset. There I compare copula-generated synthetic data with that obtained from a vendor (Mostly.ai). Both show the same issue. Other examples are discussed in my recent presentation, here.

One way to fix this is to add uncorrelated white noise to your observations. You can also fine-tune the amount of noise. Then measure the quality of the synthetic data based on statistical summaries (correlations and so on) rather than on Hellinger-like metrics. Even better, check how your synthetic data, when blended with real data in the training set, improves your predictions on the validation set. Finally, methods based on empirical quantiles (such as copulas) need to extrapolate rather than interpolate these quantiles to generate data outside the range observed in the real data.

To generate better quality data for specific groups such as minorities, treat each group as a separate data set: use a separate copula for each group. You may as well detect groups with decision tree methods. Then using a different copula or generative adversarial network (GAN) for each group is not unlike using boosted trees in traditional supervised learning.

About the Author

Vincent Granville is a pioneering data scientist and machine learning expert, founder of MLTechniques.com and co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.