In the beginning (about 2008) there was Hadoop. More correctly when people speak about Hadoop they are really talking about the a group of programs around the Hadoop File System (HDFS) which needs other applications (Apache Institute calls them projects) like YARN, MapReduce, HIVE, PIG and others to rise to the level of the first Big Data database. Today, Hadoop 2.0 is the base on which most offerings in this category are based.
Key value stores are the simplest of the NOSQL types consisting only of a unique key and a bucket containing any data you wish to store there. The value content of the buckets does not need to be consistent or follow any schema (schema-less).
The content of the bucket can be literally anything you like but applications around unstructured or semi-structured data are the most common. These can be used to store large blocks of unstructured data (e.g. customer service logs, the stored backup image of a smartphone, weblogs, the Gettysburg Address, anything). The buckets can hold quite large entries including BLOBs (Basic Large Objects). To read the value you need to know the key and bucket.
KVs are row based systems designed to efficiently return data for an entire bucket (interpreted as row or record) in as few operations as possible. Essentially all KVs run in batch mode and are therefore used for analytic or caching projects as opposed to transactional applications.
Particular Opportunities and Project Characteristics
Rapidly ingesting large volumes of unstructured and semi-structured text and data. Text analysis and customer sentiment analysis were among the earliest and most widely adopted project types for KVs. Examples include:
Text and document data from inside or outside your company.
Call center logs.
Social media feeds
Web logs, click data
Bit of web pages
Real-time data collection such as point-of-sale data or factory control systems.
Complex objects that were expensive to join in a relational database, to reduce latency.
High ingest rates lends itself to the “Velocity” elements of big data, where constant streams of data must be captured at speed.
Applications with lots of small continuous reads and writes that may be volatile (see also Document stores for even greater capability).
Create ever-growing datasets that are rarely accessed but grow over time. (Caching)
Retrieve data from an entire bucket such as the contact information in a rolodex system or product information in an online shopping system.
Where write performance is your highest priority.
July 23, 2014
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
This original blog can be viewed at:
All nine lessons can be downloaded as a White Paper at: