A Quick Guide on How to Prevail in the Graph Database Arena

Introduction

There are endless discussions on the databases arena about which DBMS is best suited for operational or data warehousing analytics, which one is the most efficient for online transaction processing, or which one is suitable for semantic integration. Recently graph databases are growing in popularity, especially in the enterprise space, and perhaps that adds more headache on those vendors that try to differentiate from competition and on those clients that are completely uncertain how to embrace this database technology.

Definition of Graph Databases

Recently Bloor published a report about Graph and RDF Databases. The author, Philip Howard, claims that “the difference between a true graph product and a triple store is that the former supports index free adjacency (which means you can traverse a graph without needing an index) and the latter doesn’t”. On the contrary Weinberger, CEO of ArrangoDB, argues that this is not a fundamental criterion on what is a graph database. In a post titled “Index Free Adjacency or Hybrid Indexes for Graph Databases” he proposes that the definition of graph database remains

a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data independent of the way the data is stored internally.

Claudius Weinberger

Indeed, in the same Bloor report a distinction between native and non-native graph databases is made based on their engine.

In my opinion, any definition that avoids any reference to the semantics of nodes and edges or their internal structure is preferable. Failing to follow this guideline, it is unavoidable to favor specific implementations, e.g. Property Graph Databases or Triple Stores, and you may easily become myopic to other types that are based on different models, e.g. hypergraph databases, or different data storage paradigms, e.g. key-value stores. Therefore, I propose we adopt a vendor neutral definition, such as the following one, which cannot exclude any future type of graph database.

A Graph Database is a database that uses a graph topology, i.e. vertices and edges, to manage information at the conceptual level independent of the logical and physical implementation of the graph data structure.

Athanassios I. Hatzis, 28th February 2017

Many-to-many Relationships

In another recently published Spotlight paper by Bloor, “All about graphs: a primer”, the author discusses the Graph data model and highlights the representational differences of a many-to-many relationship including those of bipartite, hypergraph and associative graphs. He observes that

unlike other new database approaches, graphs cannot easily be subsumed by the leading relational database vendors because the architectural constraints of graphs do not fit easily within the relational paradigm.

Philip Howard

He mentions that the two main variants on entity relationships are labeled property graphs and subject-predicate-object triples. In practice, although the idea of relationships (associations) between entities is at the heart of Peter Chen’s Entity-Relationship model, Fig.2 and Fig.3, there are subtle dissimilarities in its implementation on various graph databases. A. Hatzis, in a series of posts on associative data modeling, that is written with a hands-on practice style, attempts to clear the information glut of this topic with a thorough examination of graph data models.

Multi-model Database Engine

The graph engine and the type of data model are critical factors for any graph database. Therefore it is not strange that many vendors have started marketing their DBMS as a multi-model. We have extensive and long experience with two such products, OrientDB and Intersystems Cache. The former supports Graph, Document, Key/Value, and Object models, the latter is an object database with relational access, integrated support for JSON documents and a multidimensional key-value storage mechanism that can be easily extended to cover Graph data model. Generally speaking, we have reasons to believe that multi-model DBMS will dominate the database market. Currently OrientDB has become a leading player in the graph databases and Intersystems Cache is one of the best operational DBMS according to Magic Quadrant report.

Physical versus Logical Perspective

Not only has a multi-model database been flexible with its logical schema, but it also has a unified storage data architecture. Although the developer should hardly need access to the physical implementation details of the storage engine, an API for direct use of the engine is desirable and beneficial for many reasons. Most important, this kind of architecture allows someone to build a customized database management system. In theory, ANSI/SPARC three level architecture (external, conceptual/logical and physical) is an effort to allow these three perspectives to be relatively independent of each other, but in practice the front-end of a DBMS is most often strongly dependent on the back-end storage data model.

A loose coupling can be achieved with associative/multidimensional arrays. No matter what is their physical implementation, i.e. hash tables or trees, based on this abstract data type you can model all four NoSQL database types, (Key/Value, Tabular/Columnar, Document, Graph). For one reason or another, we are of the opinion that associative/multidimensional arrays will eventually prevail in the world of databases. There is already strong competition for their best physical implementation and sparse, column-family store, databases have proven to be very popular (HBase, Hypertable, BigTable, Intersystems Cache).
There are other properties that are crucial for operational database management systems such as ACID transactions, distributed data architecture, and scalability. Whether we are talking for a multi-model or single model graph databases, there is a tendency to use them for on-line transaction processing therefore these properties are worth having. And again in terms of architectural design there is always the problem of how to achieve a loose coupling between the physical structures of a database and the application logic.

Conceptual Framework

With that said it brings us to the question on what kind of logical/conceptual data model architecture to use. Our R3DM/S3DM framework is based on the powerful theory of the semiotic triangle. We use numerical vectors (signs), to encode abstract things in our mind (signified) to which the sign refers, e.g. Person, name, Car, model. We associate these with data containers-forms that the sign takes for the storage of data values (signifier), i.e. primitive data types (see also Signified and Signifier). This trilateral principle of our framework permits a uniform treatment of semantics, syntax and storage of information based on a symbolic representation. This way we define a fundamental, atomic information resource unit, (AIR). Those units, in turn, can be easily shaped to form any tabular, hierarchical, or graph data structure in a unified way. For example, study this R3DM hypergraph representation of Qlikview associative model. Data granularity can be also deeply connected and related to the definition of a fundamental unit of processing.
Based on this single primitive construct as a building block, (AIR), we have implemented seven type systems for an upper level management of any DBMS. These are:

SYSTEM	SHORTNAME
1. SYS_Dataset	DSS
2. SYS_DomainModel	DMS
3. SYS_EntityType	ETS
4. SYS_AttributeType	ATS
5. SYS_ValueType	VTS
6. SYS_LinkType	LTS
7. SYS_Database	DBS

We characterize Datasets, Domain Models (schemas), Entities, Attributes, etc, as information resources, values are information realization and our AIR units that represent everything are called information representations or simply references. Our current implementation phase has been completed on top of OrientDB and a forthcoming article will present R3DM/S3DM architecture in detail. In the past, Freebase collaborative knowledge graph had a type system that was built on primitive constructs.

Query Language

Yet another decisive norm in databases is the query language. With RDF directed, labeled graph data format and with RDF store databases respectively, e.g. OpenLink Virtuoso, AllegroGraph and Ontotext GraphDB, SPARQL query language is a standard way to retrieve data. On the contrary the query language of property graph databases varies a lot. There are similar to SQL APIs such as those of OrientDB and ArrangoDB, Neo4J is using its own Cypher declarative graph query language and there is also the Gremlin open-source graph programming language.
Another approach is that of GraphQL which is similar to Freebase MQL query language. Queries are shaped in JSON hierarchical format with patterns that follow the schema of the graph database.
We have developed a functional RESTful API that can be served as a prototype for a uniform, universal treatment of data language. Commands and their parameters can become more efficient and they can be simplified if we take on account the hierarchical relationship of Server, Database, Class, Property and Record containers. There are five sets of commands for getting, updating, deleting, adding and linking information. Current implementation is built with Wolfram Language and we will expose more details in a forthcoming article where we analyze R3DM/S3DM architecture.

Business Analytics

Last but not least, there is an emerging need for databases that can function as both analytic and operational. In particular, the modern data warehouse should unify all client’s transactional databases as well as integrate other external data sources that enable data cleansing, validation and enhancement. Not only that, but for quick and smart business analytics the interface should be both user friendly and functionally powerful. We are aware of such a player in this market segment with a technology that possess similar features to our R3DM/S3DM framework.

Epilogue

Make no mistake, relational databases are the past of computer database technology. Graph databases are the present and the future. This quick review on what we considered important criteria for graph database related technology products might leave the reader in more perplexity than satisfaction. This is our perspective, we wanted to share some of our knowledge with experts and chief technology persons on this field so that we could discuss the matter in more detail with them. The future will show in how many of these discussion topics we were right.