Subscribe to DSC Newsletter

A Quick Guide on How to Prevail in the Graph Database Arena

Introduction

There are endless discussions on the databases arena about which DBMS is best suited for operational or data warehousing analytics, which one is the most efficient for online transaction processing, or which one is suitable for semantic integration. Recently graph databases are growing in popularity, especially in the enterprise space, and perhaps that adds more headache on those vendors that try to differentiate from competition and on those clients that are completely uncertain how to embrace this database technology.

Definition of Graph Databases

Recently Bloor published a report about Graph and RDF Databases. The author, Philip Howard, claims that “the difference between a true graph product and a triple store is that the former supports index free adjacency (which means you can traverse a graph without needing an index) and the latter doesn’t”. On the contrary Weinberger, CEO of ArrangoDB, argues that this is not a fundamental criterion on what is a graph database. In a post titled “Index Free Adjacency or Hybrid Indexes for Graph Databases” he proposes that the definition of graph database remains


a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data independent of the way the data is stored internally
.

Claudius Weinberger

Indeed, in the same Bloor report a distinction between native and non-native graph databases is made based on their engine. 

In my opinion, any definition that avoids any reference to the semantics of nodes and edges or their internal structure is preferable. Failing to follow this guideline, it is unavoidable to favor specific implementations, e.g. Property Graph Databases or Triple Stores, and you may easily become myopic to other types that are based on different models, e.g. hypergraph databases, or different data storage paradigms, e.g. key-value stores. Therefore, I propose we adopt a vendor neutral definition, such as the following one, which cannot exclude any future type of graph database.

A Graph Database is a database that uses a graph topology, i.e. vertices and edges, to manage information at the conceptual level independent of the logical and physical implementation of the graph data structure.

Athanassios I. Hatzis, 28th February 2017

 

Many-to-many Relationships

In another recently published Spotlight paper by Bloor, “All about graphs: a primer”, the author discusses the Graph data model and highlights the representational differences of a many-to-many relationship including those of bipartite, hypergraph and associative graphs. He observes that

unlike other new database approaches, graphs cannot easily be subsumed by the leading relational database vendors because the architectural constraints of graphs do not fit easily within the relational paradigm.

Philip Howard

He mentions that the two main variants on entity relationships are labeled property graphs and subject-predicate-object triples. In practice, although the idea of relationships (associations) between entities is at the heart of Peter Chen’s Entity-Relationship model, Fig.2 and Fig.3, there are subtle dissimilarities in its implementation on various graph databases. A. Hatzis, in a series of posts on associative data modeling, that is written with a hands-on practice style, attempts to clear the information glut of this topic with a thorough examination of graph data models.

Multi-model Database Engine

The graph engine and the type of data model are critical factors for any graph database. Therefore it is not strange that many vendors have started marketing their DBMS as a multi-model. We have extensive and long experience with two such products, OrientDB and Intersystems Cache. The former supports Graph, Document, Key/Value, and Object models, the latter is an object database with relational access, integrated support for JSON documents and a multidimensional key-value storage mechanism that can be easily extended to cover Graph data model. Generally speaking, we have reasons to believe that multi-model DBMS will dominate the database market. Currently OrientDB has become a leading player in the graph databases and Intersystems Cache is one of the best operational DBMS according to Magic Quadrant report.

Physical versus Logical Perspective

Not only has a multi-model database been flexible with its logical schema, but it also has a unified storage data architecture. Although the developer should hardly need access to the physical implementation details of the storage engine, an API for direct use of the engine is desirable and beneficial for many reasons. Most important, this kind of architecture allows someone to build a customized database management system. In theory, ANSI/SPARC three level architecture (external, conceptual/logical and physical) is an effort to allow these three perspectives to be relatively independent of each other, but in practice the front-end of a DBMS is most often strongly dependent on the back-end storage data model.

A loose coupling can be achieved with associative/multidimensional arrays. No matter what is their physical implementation, i.e. hash tables or trees, based on this abstract data type you can model all four NoSQL database types, (Key/Value, Tabular/Columnar, Document, Graph). For one reason or another, we are of the opinion that associative/multidimensional arrays will eventually prevail in the world of databases. There is already strong competition for their best physical implementation and sparse, column-family store, databases have proven to be very popular (HBase, Hypertable, BigTable, Intersystems Cache).
There are other properties that are crucial for operational database management systems such as ACID transactions, distributed data architecture, and scalability. Whether we are talking for a multi-model or single model graph databases, there is a tendency to use them for on-line transaction processing therefore these properties are worth having. And again in terms of architectural design there is always the problem of how to achieve a loose coupling between the physical structures of a database and the application logic.

Conceptual Framework

With that said it brings us to the question on what kind of logical/conceptual data model architecture to use. Our R3DM/S3DM framework is based on the powerful theory of the semiotic triangle. We use numerical vectors (signs), to encode abstract things in our mind (signified) to which the sign refers, e.g. Person, name, Car, model. We associate these with data containers-forms that the sign takes for the storage of data values (signifier), i.e. primitive data types (see also Signified and Signifier). This trilateral principle of our framework permits a uniform treatment of semantics, syntax and storage of information based on a symbolic representation. This way we define a fundamental, atomic information resource unit, (AIR). Those units, in turn, can be easily shaped to form any tabular, hierarchical, or graph data structure in a unified way. For example, study this R3DM hypergraph representation of Qlikview associative model. Data granularity can be also deeply connected and related to the definition of a fundamental unit of processing.
Based on this single primitive construct as a building block, (AIR), we have implemented seven type systems for an upper level management of any DBMS. These are:

SYSTEM SHORTNAME
1. SYS_Dataset DSS
2. SYS_DomainModel DMS
3. SYS_EntityType ETS
4. SYS_AttributeType ATS
5. SYS_ValueType VTS
6. SYS_LinkType LTS
7. SYS_Database DBS

We characterize Datasets, Domain Models (schemas), Entities, Attributes, etc, as information resources, values are information realization and our AIR units that represent everything are called information representations or simply references. Our current implementation phase has been completed on top of OrientDB and a forthcoming article will present R3DM/S3DM architecture in detail. In the past, Freebase collaborative knowledge graph had a type system that was built on primitive constructs.


Query Language

Yet another decisive norm in databases is the query language. With RDF directed, labeled graph data format and with RDF store databases respectively, e.g. OpenLink Virtuoso, AllegroGraph and Ontotext GraphDB, SPARQL query language is a standard way to retrieve data. On the contrary the query language of property graph databases varies a lot. There are similar to SQL APIs such as those of OrientDB and ArrangoDB, Neo4J is using its own Cypher declarative graph query language and there is also the Gremlin open-source graph programming language.
Another approach is that of GraphQL which is similar to Freebase MQL query language. Queries are shaped in JSON hierarchical format with patterns that follow the schema of the graph database.
We have developed a functional RESTful API that can be served as a prototype for a uniform, universal treatment of data language. Commands and their parameters can become more efficient and they can be simplified if we take on account the hierarchical relationship of Server, Database, Class, Property and Record containers. There are five sets of commands for getting, updating, deleting, adding and linking information. Current implementation is built with Wolfram Language and we will expose more details in a forthcoming article where we analyze R3DM/S3DM architecture.

Business Analytics

Last but not least, there is an emerging need for databases that can function as both analytic and operational. In particular, the modern data warehouse should unify all client’s transactional databases as well as integrate other external data sources that enable data cleansing, validation and enhancement. Not only that, but for quick and smart business analytics the interface should be both user friendly and functionally powerful. We are aware of such a player in this market segment with a technology that possess similar features to our R3DM/S3DM framework.

Epilogue

Make no mistake, relational databases are the past of computer database technology. Graph databases are the present and the future. This quick review on what we considered important criteria for graph database related technology products might leave the reader in more perplexity than satisfaction. This is our perspective, we wanted to share some of our knowledge with experts and chief technology persons on this field so that we could discuss the matter in more detail with them. The future will show in how many of these discussion topics we were right.

Views: 748

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Athanassios Hatzis on March 20, 2017 at 10:27pm

Jon, thank you for your nice comments. You are right, RDBMS is well established and judging from a decade history of a full of promises RDF/OWL standard and triple DBMS that were based on this I find a lot of truth in your arguments ! I am not convinced that RDF standard and SPARQL is the BEST standard for graph databases not to mention that there IS NOT a standard moder or query language for property graph databases. And going at the DB engine level, i.e. indexing, I think we can manage more efficient solutions that are based on SINGLE INSTANCE stores and numerical vector keys for object reference.

That said although in theory the model should be completely independent of the implementation, it does influence the architecture and operations of the DBMS. Regarding to your one-size-fits-all argument again you are right, from my point of view in the future the bet with DBMS will be on how well you integrate and how fast you can drive ON THE FLY analytics from OLTP systems. There is one system in my mind that has proven to be most successful in the business market. Qlikview partially fits to my description but it is an in-memory read only data store for OLAP. Isn't it possible to do the same for OLTP systems ? 

I would love to see your comments in my last article on the associative modeling demystified series where I describe R3DM/S3DM implemented in OrientDB.

Comment by Jon Reade on March 20, 2017 at 7:04am

An informative article, thank you. But that's quite a bold claim in the Epilogue. Much as I appreciate the usefulness of graph databases, and their ability to process relationships between business entities, and the business need to do this, there are a number of obstacles. First, there is no standard (or even close to standard) query language for graph databases. This is a major factor in the adoption of any new platform in the business world, as lack of standardised, widely understood, transferable skills will keep recruitment costs high enough to exclude a platform as a replacement solution, as established businesses (i.e. the ones with lots of money to throw around) are mostly conservative beasts. Then there's the other problem of cost: Why rip out something that is widely used, well understood and with widespread skills, and replace it at massive cost with something that does not share those attributes? Business is fundamentally adverse to this kind of unnecessary cost, even if developers love the "shiny" aspect of a new toolset. Finally, there's the "horses for courses" argument: Graph databases are wonderful at what they do, just as relational databases are wonderful at what they do. Try to make them bend to tasks they were not designed for, and performance suffers and development and maintenance times, and costs, stretch to the horizon and beyond.

Don't get me wrong, I'm not anti-graph database at all. Graph databases are great for certain tasks - LinkedIn, Facebook etc could not operate without them. I personally love Neo4J. However, I'm not convinced by the argument that they're a panacea for all problems, and unfortunately, history tells us that the wild evangelism of "silver bullet solutions" usually turns out to be anything but.It's a well trodden path.

I am sure graph databases will continue to find their place and carve out a very special niche, just in the same way relational databases have for transaction processing applications, which form the hub of most businesses. Why? Because business fundamentally, intrinsically deal with transactions. But for balance, I'm not convinced that relational will be displaced any time soon in most businesses, nor that graph databases will be that bullet that no one's looking for.

But one thing I think we can agree upon is that we're living in very interesting times and technologies like this offer solutions and performance to some problems that are impractical using other database technologies. I'm looking forward to seeing graph databases mature in capability, and using them more myself, they're great tools for the right problem.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service