Subscribe to DSC Newsletter

Hadoop vs. NoSql vs. Sql vs. NewSql By Example

z-05

z-03

Click on the images for full view

Although Mainframe Hierarchical Databases are very much alive today, The Relational Databases (RDBMS) (SQL) have dominated the Database market, and they have done a lot of good. The reason the money we deposit doesn’t go to someone else’s account, our airline reservation ensures that we have a seat on the plane, or we are not blamed for something we didn’t do, etc… RDBMS' data integrity is due to its adherence to ACID (atomicity, consistency, isolation, and durability) principles. RDBMS technology dates back to the 70's.

So what changed? Web technology started the revolution. Today, many people shop on Amazon. RDBMS was not designed to handle the number of transactions that take place on Amazon every second. The primary constraining factor was RDBMS’ schema.

NoSql Databases offered an alternative by eliminating schemas at the expense of relaxing ACID principles. Some NoSql vendors have made great strides towards resolving the issue; the solution is called eventual consistency. As for NewSql, why not create a new RDBMS minus RDBMS’ shortcomings utilizing modern programming languages and technology. That is how some of the NewSql vendors came to life.  Other NewSql companies created augmented solutions for MySql.

Hadoop is a different animal altogether. It’s a file system and not a database. Hadoop’s roots are in  internet search engines. Although Hadoop and associates (Hbase, Mapreduce, Hive, Pig, Zookeeper) have turned it into a mighty database, Hadoop is a scalable, inexpensive distributed filesystem with fault tolerance. Hadoop’s specialty at this point in time is in batch processing, hence suitable for Data Analytics.

Now let’s start with our example: My imaginary video game company recently put our most popular game online after ten years of being in business, shipping our games to retailers around the globe. Our customer information is currently stored in a Sql Server Database , and we have been happy with it. However, since the players started playing the game online, the database is not able to keep up and the users are experiencing delays. As our user base grows rapidly, we spend money buying more and more Hardware/Software, but to no avail. Losing customers is our primary concern. Where do we go from here?

We decide to run our online game application in NoSql and NewSql simultaneously by segmenting our online user base. Our objective is to find the optimal solution. The IT department selects NoSql CouchBase (document oriented like MongoDB) and NewSql VoltDB.

Couchbase is open source, has an integrated caching mechanism, and it can automatically spread data across multiple nodes. VoltDB is an ACID compliant RDBMS, fault tolerant, scales horizontally, and possesses a shared-nothing & in-memory architecture. At the end, both systems are able to deliver. I won’t go into the intricacies of each solution because this is an example and comparing these technologies in the real-world will require testing, benchmarking, and in-depth analyses.

Now that the online operations are running smoothly, we want to analyze our data to find out where we should expand our territory. Which are the most suitable countries for marketing our products?  In doing so, we need to merge the Sql Server customer Data Warehouse with the data from the online gaming database,  and run analytical reports. That’s where Hadoop comes in. We configure a Hadoop system and merge the data from the two data sources. Next, we use Hadoop’s  Mapreduce in conjunction with the open source R  programming language to generate the analytics reports.

See Big Data Studio

Views: 34176

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Priyanka Jain on August 20, 2015 at 10:00pm

Hi Fari Payandeh,

Big data has a lot of capacity to profit organizations in any kind of industry, ubiquitously in world. it is a useful to decision-making and helpful to improve the financial position of any organization and to get all those things organizations are growing with technology with higher then higher performance.

Thanks!

Priyanka jain - OLAP on Hadoop 

Comment by Riya Saxena on April 16, 2015 at 11:44pm

Great post! Thanks for sharing!  Each of these technologies are closely associated with big data, so there’s overlap in terms of what they are designed to do. For example, they’re great for managing large and rapidly growing data sets, and they’re great for handling a variety of data formats, even if those formats change over time. More at www.youtube.com/watch?v=1jMR4cHBwZE

Comment by Glinca Vitalie on September 24, 2013 at 5:02am

Thank you Fari for the suggestions and resultative discussion for me.

Anyway I will study VoltDB deeper because the speed is not last requirements for contemporaneous DBMS. As for data integrity for my project as it is in phase of prototyping - yes I will stay on some traditional RDBMS. But as mechanism in my feature "NewSQL" I will not use something like B+ trees for indexes but it is another story if you want you are welcome to ask me. 

Comment by Fari Payandeh on September 24, 2013 at 1:18am

Hi Glinca,

Not a problem. I know what are asking now. If your concern is indexes, VOLTDB uses Tree indexes and I'd think that it's the same or similar to RDBMS B+ tree indexes. The problem with VOLTDB is that it doesn't enforce data integrity at the Database Layer because unlike RDBMS, it  doesn't support referential integrity as part of its engine. The responsibility of ensuring data integrity has been shifted to developers and this is in my view  a problem with VOLTDB. So, if that is  part of your requirements then I would suggest that you stay with RDBMS like Mysql, Sql Server, Oracle, ... because data integrity is guaranteed as long as you configure it correctly (establishing foreign key- primary key relationships )

Comment by Glinca Vitalie on September 23, 2013 at 7:16pm

Good day Fari.

Sorry for my bad English. I'm not specialist in SQL servers domain but I work now on some prototype of ... let name it "NewSQL". And I use as platform of prototyping the MS Access (All I need now from RDBMS is good data integrity definition). And Access when you create the relationship with enforcing data integrity will create new index on both sides for each new relationship. That's way data integrity is ligate to indexes. Something the same mechanism must be on all RDBMS as I think.

And index usually help you to select the data in special with complex Sql (many Joins and sortings). But in the same time - each index is reducing your performance when you add new records, because for each new records the RDBMS engine must rebuild each index of affected table. That's way the constraining factor for performance of OldRDBMS’ is schema. And for me is very important to know if NewRDBMS not minus'ed as "shortcaming's" the data integrity futures?

Sorry for too long conversations, it's all I can now. And Thank you for attention. Vit.

Comment by Fari Payandeh on September 23, 2013 at 12:36pm

Glinca,

I studied VoltDB's architecture and I can engage now, but I need to know what you mean by "your OldSQL and in VoldDB and in specilal the same number and type of indexes? " In other words what did read about indexes that made you pose the question?

Comment by Fari Payandeh on September 22, 2013 at 6:05am

Glinca,

I don't know the architecture of these systems well enough to answer the question. My objective was to give a high level overview of these technologies because there seems to be a lot of confusion out there. I need to learn a lot more...  my to-do list keeps growing...

Comment by Glinca Vitalie on September 20, 2013 at 11:22pm

Thank you Fari for answer. I try to understand what make VoltDB so fast vs. OldSQL's. Some explanation I found - http://voltdb.com/why-voltdb-so-fast/. But what is most interesting for me - did you use the same data integrity definition in both cases - your OldSQL and in VoldDB and in specilal the same number and type of indexes? 

Comment by Fari Payandeh on September 20, 2013 at 12:49pm

Glinca,

You are correct. I just didn't want to appear as advertizing for VoltDB, but that's my personal preference.

Comment by Glinca Vitalie on September 20, 2013 at 7:25am

And what you will choice in final? If both system have the same performance, but VoltDB give you possibility to manage more complex data structure with good mechanism of integrity then the choice is obvious?.

Videos

  • Add Videos
  • View All

Follow Us

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service