Long title: Measuring Semantic Relatedness using the Distance and the Shortest Common Ancestor and Outcast Detection with Wordnet Digraph in Python
The following problem appeared as an assignment in the Algorithm Course (COS 226) at Princeton University taught by Prof. Sedgewick. The description of the problem is taken from the assignment itself. However, in the assignment, the implementation is supposed to be in java, in this article a python implementation will be described instead. Instead of using nltk, this implementation is going to be from scratch.
The following two data files will be used to create the WordNet digraph. The files are in comma-separated values (CSV) format: each line contains a sequence of fields, separated by commas.
AND_circuit
, AND_gate
} has an id number of 36 and its gloss is a circuit in a computer that fires only when all of its inputs fire
. The individual nouns that constitute a synset are separated by spaces. If a noun contains more than one word, the underscore character connects the words (and not the space character).For example, line 36 means that synset 36 (AND_circuit AND_Gate
) has 42338 (gate logic_gate
) as its only hypernym. Line 34 means that synset 34 (AIDS acquired_immune_deficiency_syndrome
) has two hypernyms: 47569 (immunodeficiency
) and 56099 (infectious_disease
).
Implement an immutable data type WordNet
with the following API:
Performance requirements
isNoun()
must run in time logarithmic (or better) in the number of nouns.distance()
and sca()
must make exactly one call to the length()
and ancestor()
methods in ShortestCommonAncestor
, respectively.
Implement an immutable data type ShortestCommonAncestor
with the following API:
Basic performance requirements
The data type must use space proportional to E + V, where E and V are the number of edges and vertices in the digraph, respectively. All methods and the constructor must take time proportional to E+ V (or better).
Semantic relatedness refers to the degree to which two concepts are related. Measuring semantic relatedness is a challenging problem. For example, let’s consider George W. Bushand John F. Kennedy (two U.S. presidents) to be more closely related than George W. Bush and chimpanzee (two primates). It might not be clear whether George W. Bush and Eric Arthur Blair are more related than two arbitrary people. However, both George W. Bush and Eric Arthur Blair (a.k.a. George Orwell) are famous communicators and, therefore, closely related.
Let’s define the semantic relatedness of two WordNet nouns x and y as follows:
This is the notion of distance that we need to use to implement the distance()
and sca()
methods in the WordNet
data type.
apple and potato (distance 5 in the Wordnet Digraph, as shown below)
As can be seen, the noun entity is the root of the Wordnet DAG.
beer and diaper (distance 13 in the Wordnet Digraph)
beer and milk (distance 4 in the Wordnet Digraph, with SCA as drink synset), as expected since they are more semantically closer to each other.
bread and butter (distance 3 in the Wordnet Digraph, as shown below)
cancer and AIDS (distance 6 in the Wordnet Digraph, with SCA as disease as shown below, bfs computed distances and the target distance between the nouns are also shown)
car and vehicle (distance 2 in the Wordnet Digraph, with SCA as vehicle as shown below)
cat and dog (distance 4 in the Wordnet Digraph, with SCA as carnivore as shown below)
cat and milk (distance 7 in the Wordnet Digraph, with SCA as substance as shown below, here cat is identified as Arabian tea)
Einstein and Newton (distance 2 in the Wordnet Digraph, with SCA as physicist as shown below)
Leibnitz and Newton (distance 2 in the Wordnet Digraph, with SCA as mathematician)
Gandhi and Mandela (distance 2 in the Wordnet Digraph, with SCA as national_leader synset)
laptop and internet (distance 11 in the Wordnet Digraph, with SCA as instrumentation synset)
school and office (distance 5 in the Wordnet Digraph, with SCA as construction synset as shown below)
bed and table (distance 3 in the Wordnet Digraph, with SCA as furniture synset as shown below)
Tagore and Einstein (distance 4 in the Wordnet Digraph, with SCA as intellectual synset as shown below)
Tagore and Gandhi (distance 8 in the Wordnet Digraph, with SCA as person synset as shown below)
Tagore and Shelley (distance 2 in the Wordnet Digraph, with SCA as author as shown below)
text and mining (distance 12 in the Wordnet Digraph, with SCA as abstraction synset as shown below)
milk and water (distance 3 in the Wordnet Digraph, with SCA as food, as shown below)
Given a list of WordNet nouns x1, x2, …, xn, which noun is the least related to the others? To identify an outcast, compute the sum of the distances between each noun and every other one:
di = distance(xi, x1) + distance(xi, x2) + … + distance(xi, xn)
and return a noun xt for which dt is maximum. Note that distance(xi, xi) = 0, so it will not contribute to the sum.
Implement an immutable data type Outcast
with the following API:
As expected, potato is the outcast in the list of the nouns shown below (a noun with maximum distance from the rest of the nouns, all of which except potato are fruits, but potato is not). It can be seen from the Wordnet Distance heatmap from the next plot, as well as the sum of distance plot from the plot following the next one.
Again, as expected, table is the outcast in the list of the nouns shown below (a noun with maximum distance from the rest of the nouns, all of which except table are mammals, but table is not). It can be seen from the Wordnet Distance heatmap from the next plot, as well as the sum of distance plot from the plot following the next one.
Finally, as expected, bed is the outcast in the list of the nouns shown below (a noun with maximum distance from the rest of the nouns, all of which except bed are drinks, but bed is not). It can be seen from the Wordnet Distance heatmap from the next plot, as well as the sum of distance plot from the plot following the next one.
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central