Subscribe to DSC Newsletter

Lesson 7: Column Oriented Databases (aka Big Table or Wide Column)

Summary:  Column Oriented DBs excel at OLAP and are efficient at partial updates.

Many folks believe that Hadoop is the original NOSQL database and it is the first that was available commercially in 2008.  But Hadoop grew out of a research paper published by Google a few years earlier about their proprietary Big Table NOSQL store.  Big Table was the inspiration for Column Oriented DBs (CODBs) though the current crop of offerings pushes far beyond their Google roots.

Unlike the row-based systems (Key Value and Document Oriented DBs) these are as the name implies oriented to storing data in columns.  Where Document Oriented DBs excel at OLTP, Column Oriented DBs excel at OLAP (on line analytic processing).

Characteristics

Data are stored in cells grouped in columns as opposed to rows.  Columns are grouped in Column Families and each can contain an essentially unlimited number of columns.  Each storage block contains data from only one column.

Data can be sparse, that is not all cells need to be filled and the cell-to-column organization allows for greater compression of data on the disks.  Compression reduces query time since fewer read actions are required.

Moreover CODBs would be selected where queries are likely to look at similar data items on many different records, for example "find all the people with the last name Smith" can be retrieved in a single operation.  Other operations like counting the number of matching records or performing math over a set of data (e.g. find the average salary of all employees at level X) can be much faster.  Since these data elements will reside in single columns they can be retrieved very quickly.  The CODB might be able to retrieve a single data item from all records in a single operation, contrasted to row based systems where each row would need to be read and the data items extracted.  Compared to REBMS this speed increase can be in the range of 5X to 100X.  Consequently CODBs are the go-to solution for OLAP applications.

Advantages

  • Good horizontal scaling.  High availability.
  • Supports semi-structured data (as do Document DBs).
  • Compression allows efficient access to data stored on hard disks reducing seek time and latency. Only fully in-memory databases offer faster access.
  • Efficient when an aggregate value needs to be computed over many rows but where the number of rows queried is significantly smaller than the whole.
  • Particularly efficient when updating all the values in a column at once as the new column can be written efficiently without touching other data columns.
  • Strong for OLAP applications which call for a smaller number of highly complex queries over large quantities of data (terabytes or greater).
  • Can handle logging continuous streams of data that may not require consistency guarantees (see lesson 2) as they can accommodate high volumes of writes over the distributed architecture.
  • Compared to RDBMS, 5X to 100X faster query performance and 5X to 10X less disk space due to compression.
  • Can span multiple data centers.

Disadvantages

  • Not optimum for transactional applications.
  • More complex data model (compared to KVs and DODBs).
  • Unsuited for highly interconnected (graph) data.

Particular Opportunities and Project Characteristics

  • Best when writing more than you read (e.g. logging).  Writes faster than it reads so real time data analytics is a strength.
  • Good at near real time data analytics where queries are relatively simple.  More complex queries may have more latency and may best be run in batch mode.
  • Applications requiring near real time random read and access to data.  For example, large scale web messaging databases.
  • Any application requiring ad hoc queries which may include aggregate calculations over large numbers of similar data items.
  • Good for creating recommenders (in lieu of graph databases).

Representative Vendors (not a recommendation): Hadoop / HBase, Cassandra, Accumulo, Cloudera, MAPR, and many others.

July 23, 2014

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.

 

About the author:  Bill Vorhies is President & Chief Data Scientist  of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

This original blog can be viewed at:

http://data-magnum.com/lesson-7-column-oriented-databases-aka-big-t...

All nine lessons can be downloaded as a White Paper at:

http://data-magnum.com/resources/white-papers/

 

 

 

Views: 1330

Tags: Accumulo, Cassandra, Cloudera, HBase, OLAP, column oriented databases

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Samson Sani Nzevela on September 17, 2014 at 12:21am

I enjoyed this very useful lesson. Thanks you for sharing.

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service