In the last two years, I published 5 machine learning and AI books, including one on synthetic data by Elsevier. This represents over 800 pages of compact, state-of-the-art material. The new addition features my most recent advances: the problems that I encountered with generative adversarial networks, and how I overcome them with new techniques. The direction is towards less training data, yet better results and increased speed. Thus, significant cost savings, especially in the context of synthetic data. Better evaluation metrics and new loss functions with very fast implementation, contribute to the success.
This book covers optimization techniques pertaining to machine learning and generative AI. The emphasis is on producing better synthetic data with faster methods, some not even involving neural networks. I describe NoGAN for tabular data is in detail, with full Python code. It includes several case studies in healthcare, insurance, cybersecurity, education, and telecom. This low-cost technique is a game changer: it runs 1000x faster than generative adversarial networks (GAN) while consistently producing better results. Also, it leads to replicable results and auto-tuning. However, I also discuss how to fix failing GANs without abandoning deep learning, and make them replicable.
Better Evaluation Metrics
Many evaluation metrics fail to detect defects in synthesized data. Not because they are bad, but because of partial implementation. Due to the complexity, the full multivariate version of these metrics is absent from vendor solutions. In this book, I describe how to do it for the multivariate Kolmogorov-Smirnov distance (KS). Illustrations cover numerous examples. KS relies on on the joint empirical distributions attached to the datasets. This evaluation metric also works in any dimension on categorical and numerical features. The book features open-source Python libraries, both for NoGAN and KS.
Better Performance With or Without Deep Learning
Then, I discuss a very different synthesizer, namely NoGAN2. It relies on resampling, model-free hierarchical methods, auto-tuning, and explainable AI. Interestingly, it minimizes a particular loss function, also without gradient descent. While not based on neural networks, it nevertheless shares many similarities with GAN. Thus you can use it as a sandbox to quickly test various features and hyperparameters before adding the ones that work best, to GAN. Even though NoGAN and NoGAN2 don’t use traditional optimization, gradient descent is the topic of the first chapter. Applied to data rather than math functions, there is no assumption of differentiability, no learning parameter, and essentially no math. The second chapter introduces a generic class of regression methods. It covers all existing ones and more, whether your data has a response or not, for supervised or unsupervised learning. I use gradient descent in this case.
One chapter focuses on NLP, discussing an efficient technique to process large amounts of text data: hidden decision trees, presenting some similarities with XGBoost. Indeed, I use a similar technique in NoGAN. Then I discuss other GenAI methods and various optimization techniques. In particular, feature clustering, data thinning, smart grid search and more. Finally, I also explain how to use exact multivariate interpolation to synthesize time series and geospatial data. There is also a chapter on agent-based modeling to synthesize complex systems.
Methods are accompanied by enterprise-grade Python code, also available on GitHub. Chapters are mostly independent from each other, allowing you to read in random order. The style is very compact, and suitable to business professionals with little time. Jargon and arcane theories are absent, replaced by simple English to facilitate the reading by non-experts, and to help you discover topics usually made inaccessible to beginners. While state-of-the-art research is presented in all chapters, the prerequisites to read this book are minimal: an analytic professional background, or a first course in calculus and linear algebra.
Getting Your Copy
All my books are available as PDF documents on my e-Store, here. Currently, the only one available in print – “Synthetic Data and GenAI” – is published by Elsevier. The new 200-page book “Statistical Optimization for GenAI and ML” was published in October 2023. The table of contents is on GitHub, here. I am currently working on my next book, “Practical AI & Machine Learning Projects and Datasets”.
Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by TechTarget in 2020), Chief AI Scientist at MLTechniques.com, former VC-funded executive, author and patent owner – one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET.
Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS). He published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is the author of multiple books, including “Synthetic Data and Generative AI” (Elsevier, 2024). Vincent lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory. He recently launched a GenAI certification program, offering state-of-the-art, enterprise grade projects to participants.