Subscribe to DSC Newsletter

Lesson 2: NoSQL Databases are Good for Everything – Except Maybe this One Thing

Summary:  In general, it is true that NOSQL databases can do everything that RDBMS can do.  And almost always when data is ‘big’ they can do it faster and cheaper.  There is one exception where you’ll need to pay close attention.

In a technical discussion we would launch into the details about how RDBMS have been designed around the principles of ACID, Atomicity, Consistency, Isolation, and Durability.  NOSQL databases however have been designed around the principles of BASE, Basically Available, Soft State, Eventually Consistent.

We promised this would not get overly technical, so here’s what you need to know.  NOSQL databases achieve their speed and low cost by spreading the data over many different servers and creating several copies of the data on different servers (also known as MPP, massive parallel processing).  Your IT staff will specify how many replications of the data are to occur and typical numbers would be three or four.  Your data (and its replicated copies) may be spread over those same three or four servers or across hundreds or even thousands of servers if your project is really big.

Here’s the core of the issue.  When data arrives in the system the NOSQL controller decides where to store it first.  There may be a momentary delay ranging from milliseconds to minutes depending on volume before the data is the same in all three or four duplicated locations.  If one of your nodes is off line or requires manual intervention, the delay could be hours, but this would be rare.

So it is POSSIBLE, if not probable that in that millisecond-to-several-second window, if the data is queried the user MIGHT NOT see the same data in each location.  In NOSQL terminology, this is ‘eventual consistency’ (though eventual can be a very short period of time).  In the RDBMS by contrast, the data is stored only once and promises to be immediately and consistently correct.

The primary business cases where you are most likely to encounter a challenge are in financial transactions or inventory (stock availability). 

Take the common example in which a deposit is made to fund A followed immediately by a transfer from fund A to fund B.  Since NOSQL relies on distributing the processing these two separate transactions may arrive and take action on fund A and fund B at different times depending on which of the three or four copies is being read.  The eventual consistency feature of NOSQL will ensure that all the points of storage agree, but there may be that lapse of a few milliseconds to a few minutes before all the reads from the system will agree.  If your customer immediately looks up his balance or initiates a third transaction, it could conceivably show different balances depending on which node is read at exactly which instant.

Lest you think this is an abstract problem, here’s one you probably read about.  Understanding the time lag of eventual consistency, a skilled hacker with an account balance of say $500 could write a program to debit his account for $500, and in fact send that command 20,000 times in a single second actually withdrawing $10,000,000.  The account balance will eventually become consistent (perhaps only a few seconds or minutes later) and show the correct balance of ($9,999,500).  This is exactly what happened, at web scale, to the cash that used to reside in the now defunct Bitcoin Exchange.

Remember though that the market is changing quickly and many players are stepping up to the need for atomicity and immediate consistency in their NOSQL offerings.  Among NOSQL offerings MongoDB, MarkLogic, and Splice Machine claim immediate consistency.  Among NewSQL offerings Aerospike, and FoundationDB are two who specifically claim to solve this problem.  There are probably more in each category and it won’t be long before there are many more.  If you’re dealing with financial transactions or anything else that sounds suspiciously like our example, be sure to probe this carefully.

Second, from a business perspective, this window of potential disagreement in the data is quite short.  Common sense tells us that for most applications we will never tell a customer “sorry, I can’t take your money right now because our data isn’t consistent”.  Take the order and if necessary correct and apologize later.  Unless you have a specific financial exposure as in the example, NOSQL and NewSQL databases are delivering as promised.

 

July 23, 2014

Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.

 

About the author:  Bill Vorhies is President & Chief Data Scientist  of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

This original blog can be viewed at:

http://data-magnum.com/lesson-2-nosql-databases-are-good-for-everything-except-maybe-this-one-thing/

All nine lessons can be seen as a White Paper available at:

http://data-magnum.com/resources/white-papers/

Views: 1730

Tags: ACID, BASE, consistency, eventual

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Lucky Balaraman on September 14, 2014 at 8:46pm

@Wlliam,

Thanks much for your painstaking, thoughtful reply. I now understand the phenomenon to the extent my knowledge of distributed storage systems enables me.

Best,

Lucky

Comment by William Vorhies on September 12, 2014 at 8:04am

Hmmmm.  Always good to check our sources and I accepted this explanation from a single source.  So I delved a little deeper and found some disagreement and some consensus all of which points out that this explanation may be over simplified but is still an exploitable flaw.  There's way too much source material to detail here.  If you're so inclined, Google 'what happened to the cash in Mt. Gox" and 'transaction malleability'.

So a couple of qualifiers.  There does not seem to be agreement on exactly what happened at Mt Gox including a theory that the US Government performed the hack and stole the money??  However, two themes emerge that support this weakness in eventual consistency.  First, Mt. Gox used a first generation NoSQL DB. It appears to be MongoDB and like Key-Values at the time was riddled with many problems including this one.  Here's an explanation from http://aphyr.com/posts/284-call-me-maybe-mongodb.

"In some failover situations primaries will have accepted write operations that have not replicated to the secondaries after a failover occurs. This case is rare and typically occurs as a result of a network partition with replication lag. When this member (the former primary) rejoins the replica set and attempts to continue replication as a secondary the former primary must revert these operations or “roll back” these operations to maintain database consistency across the replica set."

A second contributor to the flaw in this early version software was probably something called 'transaction malleability'.  If you Google that you'll find a Wikipedia page that explains in some detail how such a hack could be used either in conjunction with or in place of the eventual consistency flaw.  Across several sources there is disagreement on which of these, or if either of these is actually at fault.  Or was it the US Government?

Finally, there's an excellent Wired article from March of this year that goes on to point out that Mt.Gox used neither versioning software nor a test environment when producing it's production code.  Holy Cow Batman!

Eventual consistency is still a bugaboo that NoSQL vendors are addressing and one that has driven NewSQL from the beginning since NewSQL DBs started with ACID and presumably never had this problem.

Anyhow, thanks for the question and getting me to go back to do some more checking.  Hope this helps.

Comment by Lucky Balaraman on September 11, 2014 at 7:44pm

Re 20k requests in a second: Each of the (say) 4 locations initially contains $500. The first request hits the first location and debits $500. The second request comes in and goes to (say) the second location, which has not been updated with the first debit, so the second request goes through and the second location is debited $500. Also, if the second request had gone to the first location, the request would have returned a 'no funds' status because the first location was empty.

Extrapolating this process, not more than 4 requests out of the 20k would have succeeded.

What am I missing here?

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service