Summary: Unless you have special needs Document Oriented DBs are your most likely default choice.
Second in popularity in the business world behind Key-Value-Stores are Document Oriented Databases. Here an entire document is treated as a record. While these can accommodate completely unstructured text, they excel at semi-structured text. That is text that has been encoded according to a known schema such as XML, YAML, JSON, PDF, email, or even MS Office.
The hidden strength of DODBs is that they are a collection of key value collections. That is within a bucket similar to key value stores, there is an additional level of key value indexing that allows much more efficient queries. It is likely that if you have several big data projects and none call out for a specific specialty database type like graph, the DODB will be your go-to default.
Stored elements are called documents. The data model is a collection of documents, and each document is a collection of key values allowing indexing within the bucket.
On the simplest level, thanks to the built in structures of semi-structured document types like XML, JSON, or even common email formats or MS Word docs (documents with tagged elements) the secondary index within the bucket can be easily inferred. If you have copied a number of invoices into a bucket, the tagged elements facilitate knowing that such-and-such a line is the address, another is the amount due, and so forth. In this mode DODBs are great for raw document searches such as patent search, litigation support, legal precedent search, search of scientific papers and experimental data, email compliance searches, or simply retrieving knowledge on a particular topic hidden among a forest of internal or externally prepared reports and document.
However, the strength of key values within the bucket, the secondary index is much more powerful. It easily facilitates adding additional data sources to an existing logical grouping without the need to change a formal schema. It is a great tool for simply combining data from many different incompatible database sources, addressing the ‘Variety’ aspect of big data. And the additional level of indexing makes partial record updating efficient so DODBs do well at OLTP (on line transaction processing) applications.
- Highly fault tolerant – always available.
- Schema-less offers easier upgrade path for changing data requirements especially with key value indexing within the data bucket.
- Efficient at retrieving information about a particular object with a minimum of disc operations. For example, returning a contact record in a Rolodex application.
- No requirement for SQL queries, indexes, triggers, stored procedures, temporary tables, forms, views, or the other technical overheads of RDBMS. DBA administration is minimized.
- Easy integration of diverse data sources.
- Easy upgrade path when a logical data schema is fluid or may change frequently.
- Fast, easy scalability.
- Provides good performance on very large databases. Low latency.
- Very flexible. Operate over a wide variety of access patterns and data types.
- Good availability of complex queries.
- Good at lots of small continuous reads and writes (OLTP).
- Most efficient when many data items in a single row are required at the same time and when the row size is relatively small since the entire row can be read in a single seek.
- Most efficient when writing a new row if all the row data is provided at the same time so the entire row can be written with a single seek.
- Capable of handling stored procedures moving behavior closer to the data so data doesn’t need to first be moved over a network.
- Row based systems (Key Value and Document Oriented DBs) are not efficient at performing operations that apply to the entire data set as opposed to specific records. For example a search of all employees earning between $40K and $60K would require reading each row in the data set requiring extensive disk operations and slowing response.
- Query model limited to keys and indexes.
- Unsuited for interconnected data (such as graph data).
- Some vendors rely on MapReduce for larger queries which typically run in batch mode. Simpler queries may run in near real time.
- Eventually consistent data model (see Lesson 2) though some vendors are challenging in this area.
Particular Opportunities and Project Characteristics
- Use for accumulating and occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important. For example CRM and CMS systems.
- Offers maximum flexibility in modifying or adding data sources which would require time consuming changes to the data schema in RDBMS.
- Very effective at marrying diverse data sources into a single repository.
- Many vendors support multi-site deployments with master-master replication.
- Use for dynamic queries.
- Use where there is a preference for defined indexes in place of complex MapReduce functions.
- Use for most things you would do with RDBMS like MySQL but having predefined columns is not desired.
- Use where low latency is required like ad targeting or highly concurrent web apps like online gaming.
- Excellent for building SCRUD applications (search, create, read/retrieve, update/modify, and delete).
- Where both Key Value and Document Databases are indicated, maintain a bias toward a Document Database.
Representative Vendors (not a recommendation): MongoDB, CouchBase, RavenDB, MarkLogic Server, and many others.
July 23, 2014
Bill Vorhies, President & Chief Data Scientist – Data-Magnum - © 2014, all rights reserved.
About the author: Bill Vorhies is President & Chief Data Scientist of Data-Magnum and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at:
This original blog can be viewed at:
All nine lessons can be downloaded as a White Paper at: