Subscribe to DSC Newsletter

What are the differences? Which database system would you recommend, depending on the size and complexity of your data?

Views: 16832

Reply to This

Replies to This Discussion

Vincent, the main criteria I would use is the purpose rather than size or complexity; these databases are good for either structured/un-structured, columnar, key-value, etc.

I recently attended an Infobrigth event and they had a good overview of these big data databases.

Mario 

Thanks Mario, this is helpful.

Hi Vincent,

A very simplistic overview here - you can connect offline if you want more detailed analysis.  My creds here - I have been a MongoDB user/developer for over two years and CouchDB for over three.  I was also VP of BizDev for CouchOne in 2010-11 before merger with Membase. So I spent a lot of time doing technical analysis of the offerings in the market.  Recently I created a pilot HBase cluster for an advertising company.  I do not have direct experience with Cassandra but had to evaluate it vs HBase for the advt project.

I've been in the SQL world since 1991 via Ingres, Sybase, Oracle.

Having said that here goes.

HBase and Cassandra are for the very high data size and data rate - they require at least three if not more servers in a cluster just to get going and are appropriate, in terms of overhead vs value, for the GB's/day or more data inflows.

(for development you can run in pseudo distributed mode or single server mode but this doesn't give you much aside from ability to play with the API's on your laptop.)

Also with HBase and Cassandra the learning curve is high, the number of trained people is low and the eco system in terms of development tools and integration into reporting tools is very low - you have to build a lot of this stuff yourself.  However things are improving here as time goes by.  Re: Cassandra - it was spawned as an Open Source project by Facebook but even FB appears to be moving to HBase now.  Older polls on the net showed Cassandra was dominant but that has changed in last 2 yrs or so.

HBase and Cassandra are both written in Java and require some non-trivial Java experience. They are both Apache projects.

Re: HBase and Cassandra they both require a lot more effort in data modeling as there is no inherent data format like JSON built in - essentially they are distributed KV stores which provide a bare API.  CouchBase and MongoDB can also be thought of as distributed KV stores but they provide a syntactic layer of JSON  on top so collections of KV pairs can be stored as what they both call a "document".

Since HBase is built on top of HDFS (Hadoop Distributed File System), integrating with Hadoop for long running "batch-mode" operations is easier.

MongoDB and CouchBase are much more developer friendly, MongoDB much more so being well integrated into the "LAMP" stack and having a much broader developer ecosystem especially amongst the Ruby on Rails community.

MongoDB can be used via a Sass offering from MongoLab, CouchDB from IrisCouch. 

CouchBase is based on the Open Source Apache project CouchDB and adds enterprise-y features and added infrastructure for monitoring etc. to the Apache project.  Also CouchOne the company merged with Membase the company in early 2011 and CouchBase the result, has a caching layer in front that allows millisecond responses for webapps such as social games.

CouchBase also is developing tools for mobile apps that use CouchBase as a back-end.  The recent acquisition of OMGPOP a mobile game developer by FB threw light on the fact that the backend can be deployed on a single server at the low end and then expanded to clusters as your data grows.  There are non-trivial transition-related issues when going from one to many servers in the every initial stage for MongoDB .  However the latest version of CouchBase claims that servers can be transparently added and clusters expanded horizontally with almost zero overhead. This was supposed to be the reason OMGPOP was able to keep up with massive growth in user base in a very short time.

Aside from all the above both MongoDB and CouchBase use JSON as an on-the-wire format (Requests and Responses are JSON), and also for querying.

CouchBase has a REST API, MongoDB has a console based qry access client like MySQL client or PostgreSQL client. But queries are formulated in a JSON syntax. Both have Map Reduce built in but they differ in their implementation and application.  Mongo uses it for sums/counts etc.  CouchBase uses it to create automatically updated indexes.  

Other than all this CouchBase has master-master replication which is quite unique and differentiating if that is a specific need for you (e.g for disconnected mode operation and sync) else it's just noise.

As far as actually making a  recommendation I'd like to learn more about the app/domain to be able to tell.

You can provide that here or if it's somewhat confidential contact me via Twitter @nitin or nborwankar on the google mail system and we can take it from there.

Good Luck,

Nitin Borwankar

A few points:

  • HBase & Cassandra are columnar/table oriented
  • Mongo & Couch are document oriented
  • Some use cases fall into one or other, rest are not that clear
  • A hierarchical, text-oriented use case (blogs,comments, even product catalogue) is better suited for document oriented dbs
  • Shopping cart and other table oriented applications are more suited for columnar dbs
  • Of course, it is not cut & dry - Facebook uses HBase for it's messaging system
  • If you are a Hadoop shop, HBase is better suited, as you already have HDFS et al in your infrastructure and you are working on HA/DR et al with those technologies
  • If you want a highly distributed data store Cassandra is a good choice (e.g.Netflix)
  • I haven't seen highly distributed HBase clusters (Facebook uses a pod architecture). Am sure it can be done
  • If you have complex queries, Mongo has a good use case
  • If you are new to NOSQL, Mongo gives a good compromise; but as the applications mature and evolve mongo might not satisfy all use cases and it is time to add Cassandra/HBase to your infrastructure
  • I have some notes and use cases in my OSCON talk. But it is a little dated 
I am probably biased, but why are we just talking about these? Depending on ammount of data and how much i had to spend on the project, would control which db I used. If it was under 1tb of structuried data, a standard postgres db on a good box would do fine. If i had 5tb of data, I would look into greenplum single node on a large box with 2 12 core opturons and a nice raid configuration striped. If money was no issue and i had pedabytes of both structured and unstred data, I would go for a mpp greenplum HD array with SSDs and lots of ram so I could do in database analitics almost realtime. I have seen a 2 petabyte aray doing cubes adhoc realtime.

The only reason is that they were listed together as desirable knowledge for a data scientist position advertised on a job board.

Brian France said:

I am probably biased, but why are we just talking about these? Depending on ammount of data and how much i had to spend on the project, would control which db I used. If it was under 1tb of structuried data, a standard postgres db on a good box would do fine. If i had 5tb of data, I would look into greenplum single node on a large box with 2 12 core opturons and a nice raid configuration striped. If money was no issue and i had pedabytes of both structured and unstred data, I would go for a mpp greenplum HD array with SSDs and lots of ram so I could do in database analitics almost realtime. I have seen a 2 petabyte aray doing cubes adhoc realtime.

Hi Brian,

IMHO, PG and GPM are two different use cases.  PG is usable as an operational data store GPM is a data warehouse not meant to be used as a backend for a webapp.

Nitin Borwankar

Brian France said:

I am probably biased, but why are we just talking about these? Depending on ammount of data and how much i had to spend on the project, would control which db I used. If it was under 1tb of structuried data, a standard postgres db on a good box would do fine. If i had 5tb of data, I would look into greenplum single node on a large box with 2 12 core opturons and a nice raid configuration striped. If money was no issue and i had pedabytes of both structured and unstred data, I would go for a mpp greenplum HD array with SSDs and lots of ram so I could do in database analitics almost realtime. I have seen a 2 petabyte aray doing cubes adhoc realtime.

I know that is how Greenplum is being sold, as a data warehouse, but with a scan rate of 246gb/s per rack of structured data, many webapps and dashboards have been tied to it for near real time analytics. I know of many companies that are using it behind a webapp use case.  But I understand this post was mostly about a job posting.

Nitin Borwankar said:

Hi Brian,

IMHO, PG and GPM are two different use cases.  PG is usable as an operational data store GPM is a data warehouse not meant to be used as a backend for a webapp.

Nitin Borwankar

Brian France said:

I am probably biased, but why are we just talking about these? Depending on ammount of data and how much i had to spend on the project, would control which db I used. If it was under 1tb of structuried data, a standard postgres db on a good box would do fine. If i had 5tb of data, I would look into greenplum single node on a large box with 2 12 core opturons and a nice raid configuration striped. If money was no issue and i had pedabytes of both structured and unstred data, I would go for a mpp greenplum HD array with SSDs and lots of ram so I could do in database analitics almost realtime. I have seen a 2 petabyte aray doing cubes adhoc realtime.

Hi Vincent, those are great NoSQL solutions and the guys already made good comments about them here .. I have been using Cassandra and Mongo in different projects...  my 2 cents: Maybe you should consider the CAP theorem as part of your decision.

http://en.wikipedia.org/wiki/CAP_theorem

Hi Brian,

I hear what you're saying. It's not just data transfer rate but concurrency and read vs read-write workloads.  Webapps have high concurrency and many small writes intermingled with small-to-large reads.  Greenplum is optimized for load-once read many workloads, which is very different.  I think it might support multiple concurrent reads with a long running query workload but the many random writes generated by a webapp are not well handled by datawarehouse architectures.  I have used Greenplum and actually like it a lot, so this is not an anti-Greenplum bias - it applies to other similar products like Aster Data etc as well.

Cheers,

Nitin 



Brian France said:

I know that is how Greenplum is being sold, as a data warehouse, but with a scan rate of 246gb/s per rack of structured data, many webapps and dashboards have been tied to it for near real time analytics. I know of many companies that are using it behind a webapp use case.  But I understand this post was mostly about a job posting.

Nitin Borwankar said:

Hi Brian,

IMHO, PG and GPM are two different use cases.  PG is usable as an operational data store GPM is a data warehouse not meant to be used as a backend for a webapp.

Nitin Borwankar

Brian France said:

I am probably biased, but why are we just talking about these? Depending on ammount of data and how much i had to spend on the project, would control which db I used. If it was under 1tb of structuried data, a standard postgres db on a good box would do fine. If i had 5tb of data, I would look into greenplum single node on a large box with 2 12 core opturons and a nice raid configuration striped. If money was no issue and i had pedabytes of both structured and unstred data, I would go for a mpp greenplum HD array with SSDs and lots of ram so I could do in database analitics almost realtime. I have seen a 2 petabyte aray doing cubes adhoc realtime.

Depends also on what is going to be looked at.  A  CEP engine may be the answer in some cases for the particular scenario.

The hybrid of HDFS/Columnar/Cubes and CEP could be much more interesting down the road.

http://blog.pluralsight.com/2012/03/23/meet-the-author-richard-sero...

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service