Here I want to present my new book on advanced algorithms for data-intensive applications named "*Probabilistic Data Structures and Algorithms in Big Data Applications*" (ISBN: 9783748190486). The detailed information about the book you can find at its webpage and below I give you some introduction to the topic this book is about.

There is no doubt that a lot of data creates unbelievable opportunities to learn and opens new horizons for business. However, this also implies lots of technical challenges. I guess, everyone heard about **Volume**, **Velocity**, and **Variety**, *The Three V’s of Big Data*. These dimensions imply such problems as efficient membership querying, cardinality estimation, frequency and rank computation for data streams, and similarity calculation.

Many of these problems can be solved with parallel computation including usage of the MapReduce paradigm and such popular frameworks for distributed processing as Apache Hadoop, Apache Spark, Apache Flink, Apache Storm, and others.

However, at some point, the amount of data (or its frequency, or variety, or all together) rises so high that we cannot scale up anymore our system, regardless the amount of money and resources we have. Thus, we have to agree to lost some data in order to be able to continue learning. There are two main approaches in this case - Sampling and Hashing.

Sampling

- Process subset of elements
- Discard rest
- Use existing setup and algorithms
- Interpolate results

Hashing

- Process all elements
- Compression with information loss
- Use advanced algorithms
- Approximate results

This book is about **Probabilistic Data Structures and Algorithms** (PDSA) that based on ** hashing** and heavily use various pattern observation techniques to provide approximate results even under heavy input compression that lead to information loss. Such data structures have a strong theory behind and provide reliable ways to estimate possible errors which fully compensated for by extremely low memory requirements, constant query time, and scaling.

They are not necessarily new. Probably, everybody knows about the *Bloom filter* data structure, designed in the 70s, it efficiently solves the problem of performing membership queries (a task to decide whether some element belongs to the dataset or not) in a constant time without requirements to store all elements. This is an example of a probabilistic data structure, but there are much more that have been designed for various tasks in many domains.

Such data structures already implemented in many popular and well-known products such as Google Big Query, Amazon Redshift, Redis, Cassandra, CouchDB, Elasticsearch, and many others. For instance, you can use `**approx_count_distinct**` command in Spark SQL or `**PFCOUNT**` in Redis that actually powered by the *HyperLogLog* algorithm, another example of PDSA.

This book consists of six chapters, each preceded by an introduction and followed by a brief summary and bibliography for further reading relating to that chapter. Every chapter is dedicated to one particular problem in Big Data applications, it starts with an in-depth explanation of the problem and follows by introducing data structures and algorithms that can be used to solve it efficiently.

The first chapter gives a brief overview of popular hash functions and hash tables that are widely used in probabilistic data structures. Chapter 2 is devoted to approximate membership queries, the most well-known use case of such structures. In chapter 3 data structures that help to estimate the number of unique elements are discussed. Chapters 4 and 5 are dedicated to important frequency- and rank-related metrics computations in streaming applications. Chapter 6 consists of data structures and algorithms to solve similarity problems, particularly — the nearest neighbor search.

In case if you want to reach me, connect at Twitter @gakhov or visit the webpage of the book at https://pdsa.gakhov.com

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- Optimization and The NFL’s Toughest Scheduling Problem - June 23

At first glance, the NFL’s scheduling problem seems simple: 5 people have 12 weeks to schedule 256 games over the course of a 17-week season. The scenarios are potentially well into the quadrillions. In this latest Data Science Central webinar, you will learn how the NFL began using Gurobi’s mathematical optimization solver to tackle this complex scheduling problem. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- Optimization and The NFL’s Toughest Scheduling Problem - June 23

At first glance, the NFL’s scheduling problem seems simple: 5 people have 12 weeks to schedule 256 games over the course of a 17-week season. The scenarios are potentially well into the quadrillions. In this latest Data Science Central webinar, you will learn how the NFL began using Gurobi’s mathematical optimization solver to tackle this complex scheduling problem. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central