Recently, I have finished my book on probabilistic data structures and noticed that even they are used almost in every well-known product that works with data (Apache Spark, Elasticsearch, Redis, CouchDB, Cassandra, ....), there are not so many developers and software engineers actually aware of them. However, their base principles are very simple and amazingly smart.
The problem is that they are treated as too "advanced" and have a very specific use case - when you need to handle the real "big data", not just the data that crashes your MS Excel. There are not so many books about them, the original publications are so hard to read, and you unlikely find them in base courses in universities or online. But from my point of view, they are so cool even for Big Data Interview process, when you want to find really smart and experienced employees.
Let me explain them a bit further to warm up your interest to learn them.
When dealing with huge data, the first thing that came in mind, let's use parallel processing, allocate many independent workers and use, for instance, MapReduce. At some point, when our system handles more and more data, we can recognize that we cannot scale our infrastructure further - either it becomes too big and unmanageable, or slow, or we have no money to pay for new servers or all together.
At this point, the usual way to go is to use sampling - to process only a subset of the data, skip the rest, and hope that our subset is representative enough to interpolate obtained results to the whole dataset. The advantage of this approach is that we still use our system and the same algorithms, nothing changes, except the amount for data.
But what if we cannot ignore any data. Imaging, we need to count the number of unique elements, get the most frequent items, or just remember every element to check new elements for existence lately? In such cases, and this is already "advanced" decision that is not used so often, we can use hashing - compute a short representation of our elements, making them smaller. The pure hashing doesn't help so much, since often the problem in algorithms, that require at least linear memory, many passed through the data, or have polynomial time to complete.
If you come to this point, you are ready to learn probabilistic data structures and algorithms, which are optimized to use fixed or sublinear memory and constant execution time, and have many other interesting features. They are based on the hashing and don't provide you with an exact answer, but can guarantee an acceptable error.
Let me point some examples here: Bloom FIlter, Quotient FIlter, Count-Min Sketch, HyperLogLog, t-digest, minhash, and many others.