30 Python Libraries that I Often Use - DataScienceCentral.com

This list covers well-known as well as specialized libraries that I use rather frequently. Applications include GenAI, data animations, LLM, synthetic data generation and evaluation, ML optimization, scientific computing, statistics, web crawling, APIs, SQL, and more. I also mention my owns, and issues that I faced with standard libraries. In several instances, for instance sound generation, I did not use any library. In addition, included some functions that I regularly call. Many times, I explain why I had to create my home-made versions.

Synthetic Data

SDV is the most popular library to generate tabular synthetic data. The Fake library is integrated into it. Another one is CTGan. I was disappointed by the results: poor evaluation, poor synthetization. Thus, I created my owns: Genai-Evaluation and NoGAN-Synthesizer. As the name implies, it does not rely on neural networks. Thus, it is much faster, yet delivers better results.

Natural Language Processing

NLTK is well-known. The Stopwords module proved useless in my case, as I need a separate stopword list for each of the specialized LLM components in my multi-LLM architecture (xLLM): some basic words such as “even” cannot be a stopword depending on the sub-LLM. Thus, I need to create lists of words that cannot be stopwords. I experienced similar problems with Autocorrect and its Speller module. Idem with Singularize from the Pattern library.

I generate all n-grams, compute cosine similarity, create embeddings, deal with accented characters with just a few lines of home-made code, although some libraries also deal with that. That said, the above libraries are useful to many. Even for me, with some workarounds, they are useful.

Web Crawling

I am happy with Requests. You can even access password-protected content. Be careful about getting blocked by the websites that you crawl! I haven’t tested BeautifulSoup yet. If it can recursively retrieve specific navigations features (related pages, similar pages, indexes), which vary from website to website, I will definitely try it. I doubt that it will answer all my questions, as I try to reconstruct the underlying taxonomy of each website or repository that I crawl. Still, it could be handy.

Computer Vision

Of course, openCV is the most popular library. But I have been happy with Pillow. I also produce a lot of data videos, featuring a process evolving over time, or a continuous set of training sets or parameters slightly changing from one video frame to the next. The goal is to show 500 charts in a 1-minute video, as it is more compelling than displaying hundreds of them in separate images. Also great for model comparison, where multiple sub-videos run in parallel within a single video: one per model. I accomplish this very easily with the Moviepy library: see data animation below. Soundtracks are just wave files, so no special library is really needed. As for standard visualizations, I like Matplotlib. But Plotly is more sophisticated and better for special needs.

Deep Neural Networks

Like many programmers, my experience is with TensorFlow and Keras. Training can be very slow unless you switch to GPU, hyperparameters are anything but easy to tune, and results vary wildly from one dataset to another. They lead to non-replicable results unless you use a seed for each sub-module relying on random numbers. In addition, DNNs require a lot of pre- and post-processing (transformers, decoders, and so on). One of the issues is that the loss function is not a good proxy to the output quality. Disclaimer: I mostly used GANs in the context of synthetic tabular data generation; this may be the type of data where DNNs do worst. In the end, that’s why I created NoGAN: it solves all these issues and runs much faster.

Statistics and Machine Learning

Statsmodels is a popular library. Scipy, Numpy, and SKlearn also offer many statistical functions, while Seaborn focuses on visualizations. I work a lot with time series and geospatial data, with my own algorithms. For the latter, on occasions I used Pykrige (kriging to compare with my techniques) and Osmnx to add maps to geospatial data.

I created a very generic regression called cloud regression with Lagrange multipliers for regularization. Yet, I played quite a bit with the curve_fit function in Scipy. Indeed, I was about to create my own version until I realized that you could set bounds on the parameters of the target function, in the optimize module. For random forests (classification), I use SKlearn.

Miscellaneous

To create Web APIs, I use Streamlit. I also run automated SQL queries created on the fly with Python code. So far, I did it with Pandas. Of course, I import Numpy in most of my programs. Sometimes, just to call the random module. Specific functions such as quantiles do not sample outside the observation range and are univariate, so I have my own, here. Likewise for gradient descent (done with neural networks these days): I have my own version, although based on the gradient operator available in Numpy. I take care of stochastic descent on my own, but libraries exist.

To design faster vector search, I needed among other things to perform some interpolated binary search, but eventually settled for my own probabilistic search. On occasions, I played with Re for regular expressions. Finally, I used Gmpy2 and MPmath for scientific computing. It allows you to work with integers with trillions of digits in any base, complex numbers, and special mathematical functions (Bessel, Riemann Zeta, and so on). Especially useful in cryptography and number theory.

To learn more about my home-made functions, contact me. All are open-source, free, and well documented.

Author

Towards Better GenAI: 5 Major Issues, and How to Fix Them

Vincent Granville is a pioneering GenAI scientist and machine learning expert, co-founder of Data Science Central (acquired by a publicly traded company in 2020), Chief AI Scientist at MLTechniques.com and GenAItechLab.com, former VC-funded executive, author and patent owner — one related to LLM. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, and CNET. Follow Vincent on LinkedIn.