This blog was originally posted on our Text Analysis blog, as part of a Text Analysis 101 series.
We recently added a feature to our API that allows users to classify text according to their own labels. This unsupervised method of classification relies on Explicit Semantic Analysis in order to determine how closely matched a piece of text and a label or tag are.
This method of classification provides greater flexibility when classifying text and doesn't rely on a particular taxonomy to understand and categorize a piece of text.
Explicit Semantic Analysis (ESA) works at the level of meaning rather than on the surface form vocabulary of a word or document. ESA represents the meaning of a piece text, as a combination of the concepts found in the text and is used in document classification, semantic relatedness calculation (i.e. how similar in meaning two words or pieces of text are to each other), and information retrieval.
In document classification, for example, documents are tagged to make them easier to manage and sort. Tagging a document with keywords makes it easier to find. However, keyword tagging alone has it’s limitations; searches carried out using vocabulary with similar meaning, but different actual words may not uncover relevant documents. However classifying text semantically i.e. representing the document as concepts and lowering the dependence on specific keywords can greatly improve a machine's understanding of text.
Wikipedia is a large and diverse knowledge base where each article can be considered a distinct concept. In Wikipedia based ESA, a concept is generated for each article. Each concept is then represented as a vector of the words which occur in the article, weighted by their tf-idf score.
The meaning of any given word can then be represented as a vector of that word’s relatedness, or “association weighting” to the Wikipedia based concepts.
A trivial example might be:
Comparing two word vectors (using cosine similarity) we can get a numerical value for the semantic relatedness of words i.e. we can quantify how similar the words are to each other based on their association weighting to the various concepts.
Note: In Text Analysis a vector is simply a numerical representation of a word or document. It is easier for algorithms to work with numbers than with characters. Additionally, vectors can be plotted graphically and the “distance” between them is a visual representation of how closely related in terms of meaning words and documents are to each other.
Larger documents are represented as a combination of individual word vectors derived from the words within a document. The resultant document vectors are known as “concept” vectors. For example, a concept vector might look something like the following:
Graphically, we can represent a concept vector as the centroid of the word vectors it is composed of. The image below illustrates the centroid of a set of vectors i.e. it is the center or average position of the vectors.
So, to compare how similar two phrases are we can create their concept vectors from their constituent word vectors and then compare the two, again using cosine similarity.
This functionality is particularly useful when you want to classify a document, but you don't want to use a known taxonomy. It allows you to specify on the fly a proprietary taxonomy on which to base the classification. You provide the text to be classified as well as potential labels and through ESA it is determined which label is most closely related to your piece of text.
ESA operates at the level of concepts and meaning rather than just the surface form vocabulary. As such, it can improve the accuracy of document classification, information retrieval and semantic relatedness.
If you would like to know more about this topic check out this excellent blog from Christopher Olah and this very accessible research paper from Egozi, Markovitch and Gabrilovich, both of which I referred to heavily when researching this blog post.
Keep an eye out for more in our “Text Analysis 101” series. You can sign up for a free API account here.