How many times a day do we ourselves, or hear someone else, utter the phrase “Google it”? It’s hard to imagine that a phrase so ubiquitous and universally understood has been around for less than two decades. The word “Google” has become synonymous with online search, and when we think about why this, it’s because Google yields the most relevant, comprehensive results, quickly. Essentially, it has changed the way we find and interact with content and information.
We’ve seen the cultural effect Google has had on search and discovery on a broad level, but consider the implications for online media and publishing organizations interested in honing these same powerful search and discovery capabilities, a process referred to as dynamic semantic publishing. The results can be transformative, but many companies in the space still struggle with harnessing the technology.
Let’s take a deeper dive into what semantic publishing is, how it works and most importantly, why it matters.
The idea of dynamic semantic publishing can often be a difficult concept to grasp because its use is not readily apparent to viewers. Rather, it’s a process centered on the curation, enrichment and analysis of text and how it’s organized even before users interact with it.
Semantic publishing includes a number of techniques and tactics, including semantic markup and recommendations. Through these techniques, computers are able to understand the structure, meaning and context of massive amounts of information in the form of words and phrases.
At this point, you may be thinking, “This sounds a lot like tagging.” As news organizations and publishers began taking their content online, tagging was the basic process used to categorize information. Basically, when you type a term into a site’s search box, the results returned will contain that word. However, dynamic semantic publishing goes well beyond simple content tagging.
At the heart of this solution are three core semantic technologies: text-mining, a semantic database and a recommendation engine. Text-mining is used to analyze content, extract new facts and generate metadata that enrich the text with links to the knowledge base. The semantic database stores pre-existing knowledge, such as thesauri and lists of people, organizations and geographic information. It also stores the new found knowledge and the metadata delivered from the text mining process. The recommendation engine delivers personalized contextual results based on behavior, search history and text that has been interlinked with related information.
The text mining process operates behind the scenes and continuously runs. Sometimes this process is referred to as “semantic annotation”. In essence, it’s a pipeline of text. Articles are analyzed, sentences are split up, entities are identified and classified in real time. The pipelines often uses related facts from other sources that have already been loaded into the semantic database. These Linked Open Data sources help resolve identities that are the same but referred to differently. During the annotation process, relationships between entities are discovered and stored such as relationships between people, where they work, live, travel, etc… All of the results, known as semantic triples or “RDF statements”, are indexed in a high performance triplestore (graph database engine) for search, analysis and discovery purposes.
This knowledge base is extended with key terms and related concepts, all of which are linked to the original articles or documents. Oftentimes the pipelines encounter entities that require disambiguation. This is crucial to avoid confusion between Athens, Georgia and Athens, Greece, for example, OR the Federal Bureau for Investigation (FBI) with Federation of British Industries. The final result is richly described content all of which is interlinked and stored in the semantic repository.
To ensure the text mining algorithms are kept up to date, editorial feedback is collected allowing machine learning to automatically adapt and retrain the algorithms – this way the knowledge base becomes smarter as publishers prepare to deliver on the promise of personally targeted content.
When this data is combined with web visitor profiles and user search history, the recommendation engine takes over. Massive amounts of data are analyzed to determine the most likely news articles of interest. This magical blend of profiles, history, structured entity descriptions, classified facts, relationships and enhanced knowledge delivers a wonderful user experience. Visitors are automatically delivered highly relevant news.
In a recent article entitled “Financial Times Builds Its Publishing Infrastructure”, Jennifer Zaino spoke with Jem Rayfield, the Head of Solution Architecture at the FT. She writes:
“Among other semantically-infused capabilities available now are recommended reads, based on the concept of semantic fingerprints, Rayfield explains. That is, the Financial Times leverages a concept extraction mechanism to understand what published stories cover and annotates them with identifiers found within the content, which is matched against users’ reading habits to identify what they’d likely want to peruse next. A similar approach semantically matches readers to ads. ‘Semantic advertising is good in terms of getting very targeted ads to the right people and profiles so that we can charge more for ads,’ he says. ‘And serving the right content to the right people gets them to click to read more content so people stay on the site longer.’”
Cleary solutions like this need to scale. At the same time hundreds of queries per second are taking place to serve requests on your website, authors are also enriching new content, which is then committed to the database and available for the next search. As they write prose, they are prompted with related news and facts they can use in authoring. This instant feedback directly correlates with author productivity. If anyone sees something misclassified, they can correct it and commit the change to the underlying triplestore leading to a smarter knowledge base that can drive recommendations.
Let’s look at another real example.
For the 2010 World Cup, the BBC was looking for a new way to manage their web content, including text, video, images and data that encompassed 32 teams, eight groups and 776 individual players. As with many publishing organizations, there was simply too much content and too few journalists to create and manage the site’s content.
The BBC implemented a Dynamic Semantic Publishing framework to accurately deliver large volumes of timely content about the matches, groups, teams and players without expanding on the costly manual intervention of editorial.
Semantic publishing is dramatically changing the way we consume information. It automates the process of organizing and deciding what content goes where on the web, so that news and media publishers can quickly and accurately manage content, create more and deliver a personalized user experience. While the BBC is just one example, it becomes easy to see the far-reaching implications of online media and publishing companies embracing the same approach.
Specifically, semantic publishing can drive real business results, including: increased productivity by enhancing content authoring, editorial and delivery phases; expanded knowledge base and content offerings; personalized content recommendations to coincide with readers’ interests; and, extended content life and repurposing to directly impact the bottom line.
As digital publishing companies house an ever-increasing amount of digital information, those that embrace semantic technology will win on so many levels. News aggregators can categorize and assemble related content faster. Researchers are able to pinpoint exactly what they are searching for at record speeds. Decision makers can accurately assess performance with visibility into the timing and volume of content read. Writers are informed in real time and yield more content. Web site visitors get recommendations they never thought were possible and advertisers benefit from increased response.
There is nothing precluding this very same technology from being applied elsewhere. Online educators can use it to personalize content. Scientific publishers can standardize and classify content using a common language while also delivering relevant information. Pharma companies can index knowledge extracted from complex bio-medical research. Healthcare providers can analyze and integrate dozens of data sources useful in diagnosis and treatment. Government agencies can search corpuses of intelligence useful in mission critical applications. To learn more about this technology, visit www.ontotext.com.
Tony Agresta is the Managing Director of Ontotext USA. Ontotext was established in 2000 to address challenges in semantic technology using text mining and graph databases.