Home » Business Topics » AI Ethics

New Book: Synthetic Data – Generation and Applications

Synthetic data is used more and more to augment real-life datasets. It enriches them and allow black-box systems to correctly classify observations or predict values that are well outside of training and validation sets. In addition, it helps understand decisions from obscure systems such as deep neural networks. Thus, it contributes to the development of explainable AI. It also helps with unbalanced data, for instance in fraud detection. Finally, since synthetic data is not directly linked to real people or transactions, it offers protection against data leakage. Synthetic data also contributes to eliminating algorithm biases and privacy issues. More generally, it increases security.

This book is the culmination of years of research on the topic, by the author. Not only it integrates all the material from his previous book “Intuitive Machine Learning and explainable AI”, but it also contains all but the most advanced math from his book on stochastic simulations. The author also added more recent advances with applications to terrain generation (with animated data), synthetic universes and experimental math. The latter is an infinite source of synthetic data to build and benchmark new machine learning techniques. Conversely mathematics benefits from these techniques to uncover new insights on the most famous math problems. Chapter 14 on the Riemann Hypothesis illustrates this point, with new state-of-the-art research results on the subject.

New Book: Synthetic Data – Generation and Applications
Terrain generation, evolution and morphing (video frame, chapter 11)

What You Will Learn From The Book

Topics cover computer vision, natural language processing, tabular data, time series, geospatial and sound data, supervised classification, clustering, generative models, nearest neighbors and collision graphs, data-driven inference, prediction (all regression techniques fit into a single, easy-to-understand method), deep neural networks, modeling without response (unsupervised regression such as circle or curve fitting), constrained optimization, and more. The author introduces a simple alternative to XGBoost, one of the most efficient ensemble methods; it is applied to an NLP problem — categorizing and ranking articles and blog posts to predict future performance. When needed, modern or new statistical learning techniques are introduced: dual confidence regions, new test of independence, parametric bootstrap, Rayleigh test, distribution-free logistic regression, proxy estimation and minimum contrast estimators, as well as a new prime test for strong pseudo-random number generators.

About 15% of the content is Python code with documentation. The code is also on GitHub, spreading across multiple top-level folders. It unifies separate, independent pieces of code for the first time. It also constitutes a solid introduction to scientific computing. This book will teach you how to design rich, useful synthetic data to apply in various contexts. You will also learn the mathematics, methodology, and principles behind the scene. Finally, you will learn how to produce high quality, impactful data animations (data videos).

About the Author


Publication date: December 2022. Author and publisher: Vincent Granville, Ph.D., founder of private machine learning research lab, MLTechniques.com.

The book is available in PDF format (235 pages) with numerous, high-quality color illustrations and clickable links to fundamental concepts described on Wikipedia, if you ever need a refresher on the basics. You can view it for instance in the Chrome browser: press Ctrl-O and select the book. Access all the navigation features and follow the links in the book, with one click. To view the table of contents, list of figures and tables, bibliography, glossary and index, follow this link.

To purchase the book, follow this link.