What are the differences? Which database system would you recommend, depending on the size and complexity of your data?
Tags:
Permalink Reply by Mario A Vinasco on May 30, 2012 at 7:09am Vincent, the main criteria I would use is the purpose rather than size or complexity; these databases are good for either structured/un-structured, columnar, key-value, etc.
I recently attended an Infobrigth event and they had a good overview of these big data databases.
Mario
Permalink Reply by Vincent Granville on May 30, 2012 at 8:29am Thanks Mario, this is helpful.
Permalink Reply by Nitin Borwankar on May 30, 2012 at 12:20pm Hi Vincent,
A very simplistic overview here - you can connect offline if you want more detailed analysis. My creds here - I have been a MongoDB user/developer for over two years and CouchDB for over three. I was also VP of BizDev for CouchOne in 2010-11 before merger with Membase. So I spent a lot of time doing technical analysis of the offerings in the market. Recently I created a pilot HBase cluster for an advertising company. I do not have direct experience with Cassandra but had to evaluate it vs HBase for the advt project.
I've been in the SQL world since 1991 via Ingres, Sybase, Oracle.
Having said that here goes.
HBase and Cassandra are for the very high data size and data rate - they require at least three if not more servers in a cluster just to get going and are appropriate, in terms of overhead vs value, for the GB's/day or more data inflows.
(for development you can run in pseudo distributed mode or single server mode but this doesn't give you much aside from ability to play with the API's on your laptop.)
Also with HBase and Cassandra the learning curve is high, the number of trained people is low and the eco system in terms of development tools and integration into reporting tools is very low - you have to build a lot of this stuff yourself. However things are improving here as time goes by. Re: Cassandra - it was spawned as an Open Source project by Facebook but even FB appears to be moving to HBase now. Older polls on the net showed Cassandra was dominant but that has changed in last 2 yrs or so.
HBase and Cassandra are both written in Java and require some non-trivial Java experience. They are both Apache projects.
Re: HBase and Cassandra they both require a lot more effort in data modeling as there is no inherent data format like JSON built in - essentially they are distributed KV stores which provide a bare API. CouchBase and MongoDB can also be thought of as distributed KV stores but they provide a syntactic layer of JSON on top so collections of KV pairs can be stored as what they both call a "document".
Since HBase is built on top of HDFS (Hadoop Distributed File System), integrating with Hadoop for long running "batch-mode" operations is easier.
MongoDB and CouchBase are much more developer friendly, MongoDB much more so being well integrated into the "LAMP" stack and having a much broader developer ecosystem especially amongst the Ruby on Rails community.
MongoDB can be used via a Sass offering from MongoLab, CouchDB from IrisCouch.
CouchBase is based on the Open Source Apache project CouchDB and adds enterprise-y features and added infrastructure for monitoring etc. to the Apache project. Also CouchOne the company merged with Membase the company in early 2011 and CouchBase the result, has a caching layer in front that allows millisecond responses for webapps such as social games.
CouchBase also is developing tools for mobile apps that use CouchBase as a back-end. The recent acquisition of OMGPOP a mobile game developer by FB threw light on the fact that the backend can be deployed on a single server at the low end and then expanded to clusters as your data grows. There are non-trivial transition-related issues when going from one to many servers in the every initial stage for MongoDB . However the latest version of CouchBase claims that servers can be transparently added and clusters expanded horizontally with almost zero overhead. This was supposed to be the reason OMGPOP was able to keep up with massive growth in user base in a very short time.
Aside from all the above both MongoDB and CouchBase use JSON as an on-the-wire format (Requests and Responses are JSON), and also for querying.
CouchBase has a REST API, MongoDB has a console based qry access client like MySQL client or PostgreSQL client. But queries are formulated in a JSON syntax. Both have Map Reduce built in but they differ in their implementation and application. Mongo uses it for sums/counts etc. CouchBase uses it to create automatically updated indexes.
Other than all this CouchBase has master-master replication which is quite unique and differentiating if that is a specific need for you (e.g for disconnected mode operation and sync) else it's just noise.
As far as actually making a recommendation I'd like to learn more about the app/domain to be able to tell.
You can provide that here or if it's somewhat confidential contact me via Twitter @nitin or nborwankar on the google mail system and we can take it from there.
Good Luck,
Nitin Borwankar
A few points:
Permalink Reply by Brian France on May 30, 2012 at 12:40pm
Permalink Reply by Vincent Granville on May 30, 2012 at 1:56pm The only reason is that they were listed together as desirable knowledge for a data scientist position advertised on a job board.
Brian France said:
I am probably biased, but why are we just talking about these? Depending on ammount of data and how much i had to spend on the project, would control which db I used. If it was under 1tb of structuried data, a standard postgres db on a good box would do fine. If i had 5tb of data, I would look into greenplum single node on a large box with 2 12 core opturons and a nice raid configuration striped. If money was no issue and i had pedabytes of both structured and unstred data, I would go for a mpp greenplum HD array with SSDs and lots of ram so I could do in database analitics almost realtime. I have seen a 2 petabyte aray doing cubes adhoc realtime.
Permalink Reply by Nitin Borwankar on May 30, 2012 at 2:03pm Hi Brian,
IMHO, PG and GPM are two different use cases. PG is usable as an operational data store GPM is a data warehouse not meant to be used as a backend for a webapp.
Nitin Borwankar
Brian France said:
I am probably biased, but why are we just talking about these? Depending on ammount of data and how much i had to spend on the project, would control which db I used. If it was under 1tb of structuried data, a standard postgres db on a good box would do fine. If i had 5tb of data, I would look into greenplum single node on a large box with 2 12 core opturons and a nice raid configuration striped. If money was no issue and i had pedabytes of both structured and unstred data, I would go for a mpp greenplum HD array with SSDs and lots of ram so I could do in database analitics almost realtime. I have seen a 2 petabyte aray doing cubes adhoc realtime.
Permalink Reply by Brian France on May 30, 2012 at 2:15pm I know that is how Greenplum is being sold, as a data warehouse, but with a scan rate of 246gb/s per rack of structured data, many webapps and dashboards have been tied to it for near real time analytics. I know of many companies that are using it behind a webapp use case. But I understand this post was mostly about a job posting.
Nitin Borwankar said:
Hi Brian,
IMHO, PG and GPM are two different use cases. PG is usable as an operational data store GPM is a data warehouse not meant to be used as a backend for a webapp.
Nitin Borwankar
Brian France said:
I am probably biased, but why are we just talking about these? Depending on ammount of data and how much i had to spend on the project, would control which db I used. If it was under 1tb of structuried data, a standard postgres db on a good box would do fine. If i had 5tb of data, I would look into greenplum single node on a large box with 2 12 core opturons and a nice raid configuration striped. If money was no issue and i had pedabytes of both structured and unstred data, I would go for a mpp greenplum HD array with SSDs and lots of ram so I could do in database analitics almost realtime. I have seen a 2 petabyte aray doing cubes adhoc realtime.
Permalink Reply by Marcelo Mayworm on May 30, 2012 at 5:56pm Hi Vincent, those are great NoSQL solutions and the guys already made good comments about them here .. I have been using Cassandra and Mongo in different projects... my 2 cents: Maybe you should consider the CAP theorem as part of your decision.
Permalink Reply by Nitin Borwankar on May 30, 2012 at 7:44pm Hi Brian,
I hear what you're saying. It's not just data transfer rate but concurrency and read vs read-write workloads. Webapps have high concurrency and many small writes intermingled with small-to-large reads. Greenplum is optimized for load-once read many workloads, which is very different. I think it might support multiple concurrent reads with a long running query workload but the many random writes generated by a webapp are not well handled by datawarehouse architectures. I have used Greenplum and actually like it a lot, so this is not an anti-Greenplum bias - it applies to other similar products like Aster Data etc as well.
Cheers,
Nitin
Brian France said:
I know that is how Greenplum is being sold, as a data warehouse, but with a scan rate of 246gb/s per rack of structured data, many webapps and dashboards have been tied to it for near real time analytics. I know of many companies that are using it behind a webapp use case. But I understand this post was mostly about a job posting.
Nitin Borwankar said:Hi Brian,
IMHO, PG and GPM are two different use cases. PG is usable as an operational data store GPM is a data warehouse not meant to be used as a backend for a webapp.
Nitin Borwankar
Brian France said:
I am probably biased, but why are we just talking about these? Depending on ammount of data and how much i had to spend on the project, would control which db I used. If it was under 1tb of structuried data, a standard postgres db on a good box would do fine. If i had 5tb of data, I would look into greenplum single node on a large box with 2 12 core opturons and a nice raid configuration striped. If money was no issue and i had pedabytes of both structured and unstred data, I would go for a mpp greenplum HD array with SSDs and lots of ram so I could do in database analitics almost realtime. I have seen a 2 petabyte aray doing cubes adhoc realtime.
Check this out... http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
Depends also on what is going to be looked at. A CEP engine may be the answer in some cases for the particular scenario.
The hybrid of HDFS/Columnar/Cubes and CEP could be much more interesting down the road.
http://blog.pluralsight.com/2012/03/23/meet-the-author-richard-sero...
© 2013 Data Science Central
