Subscribe to DSC Newsletter

Summary:  Yes it’s a real phrase and it’s the secret to picking the right NoSQL database.

You can drop this phrase at your next Big Data tech meeting:  “Polyglot Persistence”.  Yes it’s a real thing.  Polyglot means speaking in many languages but in Big Data it means picking the right NoSQL DB for the right application.

If you’re already deep into Big Data then you’ve probably figured this out intuitively but if you’re just getting started you may not yet have realized that there is no ‘one best choice’ for all cases.  In fact, the major takeaway here is that you’re going to want several different types of NoSQL (Key Value, Document, Columnar, and Graph) and in some cases even more than one version of the same type depending on your actual use case.

The early adopters already figured this out.  Going back a couple of years, in addition to RDBMS Disney was using Cassandra, Hadoop, and Mongo.  Netflix was using Cassandra, Hbase, and SimpleDB.  Twitter was using Cassandra, FlockDB, Hbase, and MYSQL.  Mendeley was using Hbase, Mongo, Solr, and Voldemort.

Multiple NoSQL DBs is the rule, not the exception.  So how do you begin to focus in on the NoSQL type best suited for your project?  Here’s what Polyglot Persistence might look like today:

Functionality

Considerations

DB Type

User Sessions

Rapid Access for reads and writes.  No need to be durable.

Key-Value

Financial Data

Needs transactional updates.  Tabular structure fits data.

RDBMS

POS Data

Depending on size and rate of ingest.  Lots of writes, infrequent reads mostly for analytics.

RDBMS (if modest), Key Value or Document (if ingest very high) or Column if analytics is key.

Shopping Cart

High availability across multiple locations.  Can merge inconsistent writes.

Document, (Key Value maybe)

Recommendations

Rapidly traverse links between friends, product purchases, and ratings.

Graph, (Column if simple)

Product Catalog

Lots of reads, infrequent writes.  Products make natural aggregates.

Document

Reporting

SQL interfaces well with reporting tools

RDBMS, Column

Analytics

Large scale analytics on large cluster

Column

User activity logs, CSR logs, Social Media analysis

High volume of writes on multiple nodes

Key Value or Document

 

Where to start?  The answer is as always: test, test, test.

The good news is that acquiring a NoSQL DB is relatively cost effective if you’re going for a Hadoop distribution.  These are generally in the low to mid five figures so much less on an annual basis than the cost of one software engineer.  Of course you will have to have qualified staff with these skills and that may mean one person or many depending.

And if you use one of the major Hadoop distributors then you’ll most often get at least Key Value, Column, and Graph as part of the same distribution package with SQL friendly Drill and Spark as well.

(Based approximately on MapR's distribution)

Still, you need to pick a project to start with and that can be whatever has the highest business value.  But as your experience with NoSQL grows, plan on having several NoSQL DB types to facilitate your business objectives.

 

May 7, 2015

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2015, all rights reserved.

 

About the author:  Bill Vorhies is President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

The original blog can be see at:

http://data-magnum.com/polyglot-persistence/

Views: 977

Tags: NoSQL, hadoop distribution, polyglot persistence

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service