Originally posted on Data Science Central
Things, not Strings
Entity-centric views on enterprise information and all kinds of data sources provide means to get a more meaningful picture about all sorts of business objects. This method of information processing is as relevant to customers, citizens, or patients as it is to knowledge workers like lawyers, doctors, or researchers. People actually do not search for documents, but rather for facts and other chunks of information to bundle them up to provide answers to concrete questions.
Strings, or names for things are not the same as the things they refer to. Still, those two aspects of an entity get mixed up regularly to nurture the Babylonian language confusion. Any search term can refer to different things, therefore also Google has rolled out its own knowledge graph to help organizing information on the web at a large scale.
Semantic graphs can build the backbone of any information architecture, not only on the web. They can enable entity-centric views also on enterprise information and data. Such graphs of things contain information about business objects (such as products, suppliers, employees, locations, research topics, …), their different names, and relations to each other. Information about entities can be found in structured (relational databases), semi-structured (XML), and unstructured (text) data objects. Nevertheless, people are not interested in containers but in entities themselves, so they need to be extracted and organized in a reasonable way.
Machines and algorithms make use of semantic graphs to retrieve not only simply the objects themselves but also the relations that can be found between the business objects, even if they are not explicitly stated. As a result, ‘knowledge lenses’ are delivered that help users to better understand the underlying meaning of business objects when put into a specific context.
Personalization of information
The ability to take a view on entities or business objects in different ways when put into various contexts is key for many knowledge workers. For example, drugs have regulatory aspects, a therapeutical character, and some other meaning to product managers or sales people. One can benefit quickly when only confronted with those aspects of an entity that are really relevant in a given situation. This rather personalized information processing has heavy demand for a semantic layer on top of the data layer, especially when information is stored in various forms and when scattered around different repositories.
Understanding and modelling the meaning of content assets and of interest profiles of users are based on the very same methodology. In both cases, semantic graphs are used, and also the linking of various types of business objects works the same way.
Recommender engines based on semantic graphs can link similar contents or documents that are related to each other in a highly precise manner. The same algorithms help to link users to content assets or products. This approach is the basis for ‘push-services’ that try to ‘understand’ users’ needs in a highly sophisticated way.
‘Not only MetaData’ Architecture
Together with the data and content layer and its corresponding metadata, this approach unfolds into a four-layered information architecture as depicted here.
Following the NoSQL paradigm, which is about ‘Not only SQL’, one could call this content architecture ‘Not only Metadata’, thus ‘NoMeDa’ architecture. It stresses the importance of the semantic layer on top of all kinds of data. Semantics is no longer buried in data silos but rather linked to the metadata of the underlying data assets. Therefore it helps to ‘harmonize’ different metadata schemes and various vocabularies. It makes the semantics of metadata, and of data in general, explicitly available. While metadata most often is stored per data source, and therefore not linked to each other, the semantic layer is no longer embedded in databases. It reflects the common sense of a certain domain and through its graph-like structure it can serve directly to fulfill several complex tasks in information management:
- Knowledge discovery, search and analytics
- Information and data linking
- Recommendation and personalization of information
- Data visualization
Graph-based Data Modelling
Graph-based semantic models resemble the way how human beings tend to construct their own models of the world. Any person, not only subject matter experts, organize information by at least the following six principles:
- Draw a distinction between all kinds of things: ‘This thing is not that thing’
- Give things names: ‘This thing is my dog Goofy’ (some might call it Dippy Dawg, but it’s still the same thing)
- Categorize things: ‘This thing is a dog but not a cat’
- Create general facts and relate categories to each other: ‘Dogs don’t like cats’
- Create specific facts and relate things to each other: ‘Goofy is a friend of Donald’, ‘Donald is the uncle of Huey, Dewey, and Louie’, etc.
- Use various languages for this; e.g. the above mentioned fact in German is ‘Donald ist der Onkel von Tick, Trick und Track’ (remember: the thing called ‘Huey’ is the same thing as the thing called ‘Tick’ – it’s just that the name or label for this thing that is different in different languages).
These fundamental principles for the organization of information are well reflected by semantic knowledge graphs. The same information could be stored as XML, or in a relational database, but it’s more efficient to use graph databases instead for the following reasons:
- The way people think fits well with information that is modelled and stored when using graphs; little or no translation is necessary.
- Graphs serve as a universal meta-language to link information from structured and unstructured data.
- Graphs open up doors to a better aligned data management throughout larger organizations.
- Graph-based semantic models can also be understood by subject matter experts, who are actually the experts in a certain domain.
- The search capabilities provided by graphs let you find out unknown linkages or even non-obvious patterns to give you new insights into your data.
- For semantic graph databases, there is a standardized query language called SPARQL that allows you to explore data.
- In contrast to traditional ways to query databases where knowledge about the database schema/content is necessary, SPARQL allows you to ask “tell me what is there”.
Making the semantics of data and metadata explicit is even more powerful when based on standards. A framework for this purpose has evolved over the past 15 years at W3C, the World Wide Web Consortium. Initially designed to be used on the World Wide Web, many enterprises have been adopting this stack of standards for Enterprise Information Management. They now benefit from being able to integrate and link data from internal and external sources with relatively low costs.
At the base of all those standards, the Resource Description Framework (RDF) serves as a ‘lingua franca’ to express all kinds of facts that can involve virtually any kind of category or entity, and also all kinds of relations. RDF can be used to describe the semantics of unstructured text, XML documents, or even relational databases. The Simple Knowledge Organization System (SKOS) is based on RDF. SKOS is widely used to describe taxonomies and other types of controlled vocabularies. SPARQL can be used to traverse and make queries over graphs based on RDF or standard schemes like SKOS.
With SPARQL, far more complex queries can be executed than with most other database query languages. For instance, hierarchies can be traversed and aggregated recursively: a geographical taxonomy can then be used to find all documents containing places in a certain region although the region itself is not mentioned explicitly.
Standards-based semantics also helps to make use of already existing knowledge graphs. Many government organisations have made available high-quality taxonomies and semantic graphs by using semantic web standards. These can be picked up easily to extend them with own data and specific knowledge.
Semantic Knowledge Graphs will grow with your needs!
Standards-based semantics provide yet another advantage: it is becoming increasingly simpler to hire skilled people who have been working with standards like RDF, SKOS or SPARQL before. Even so, experienced knowledge engineers and data scientists are a comparatively rare species. Therefore it’s crucial to grow graphs and modelling skills over time. Starting with SKOS and extending an enterprise knowledge graph over time by introducing more schemes and by mapping to other vocabularies and datasets over time is a well established agile procedure model.
A graph-based semantic layer in enterprises can be expanded step-by-step, just like any other network. Analogous to a street network, start first with the main roads, introduce more and more connecting roads, classify streets, places, and intersections by a more and more distinguished classification system. It all comes down to an evolving semantic graph that will serve more and more as a map of your data, content and knowledge assets.
Semantic Knowledge Graphs and your Content Architecture
It’s a matter of fact that semantics serves as a kind of glue between unstructured and structured information and as a foundation layer for data integration efforts. But even for enterprises dealing mainly with documents and text-based assets, semantic knowledge graphs will do a great job.
Semantic graphs extend the functionality of a traditional search index. They don’t simply annotate documents and store occurrences of terms and phrases, they introduce concept-based indexing in contrast to term based approaches. Remember: semantics helps to identify the things behind the strings. The same applies to concept-based search over content repositories: documents get linked to the semantic layer, and therefore the knowledge graph can be used not only for typical retrieval but to classify, aggregate, filter, and traverse the content of documents.
PoolParty combines Machine Learning with Human Intelligence
Semantic knowledge graphs have the potential to innovate data and information management in any organisation. Besides questions around integrability, it is crucial to develop strategies to create and sustain the semantic layer efficiently.
Looking at the broad spectrum of semantic technologies that can be used for this endeavour, they range from manual to fully automated approaches. The promise to derive high-quality semantic graphs from documents fully automatically has not been fulfilled to date. On the other side, handcrafted semantics is error-prone, incomplete, and too expensive. The best solution often lies in a combination of different approaches. PoolParty combines Machine Learning with Human Intelligence: extensive corpus analysis and corpus learning support taxonomists, knowledge engineers and subject matter experts with the maintenance and quality assurance of semantic knowledge graphs and controlled vocabularies. As a result, enterprise knowledge graphs are more complete, up to date, and consistently used.
“An Enterprise without a Semantic Layer is like a Country without a Map.”