Home » Uncategorized

The Word2Vec Algorithm

This article is an excerpt from “Natural Language Processing and Computational Linguistics” published by Packt.

153118934

Introduction

Arguably the most important application of machine learning in text analysis, the Word2Vec algorithm is both a fascinating and very useful tool. As the name suggests, it creates a vector representation of words based on the corpus we are using. But the magic of Word2Vec is how it manages to capture the semantic representation of words in a vector. The papers Efficient Estimation of Word Representations in Vector Space [1] [Mikolov et al. 2013], Distributed Representations of Words and Phrases and their Composit… [2] [Mikolov et al. 2013], and Linguistic Regularities in Continuous Space Word Representations

[3] [Mikolov et al. 2013] lay the foundations for Word2Vec and describe their uses.

We’ve mentioned that these word vectors help represent the semantics of words – what exactly does this mean? Well for starters, it means we could use vector reasoning for these words – one of the most famous examples is from Mikolov’s paper, where we see that if we use the word vectors and perform (here, we use V(word) to represent the vector representation of the word) V(King) – V(Man) + V(Woman), the resulting vector is closest to V(Queen). It is easy to see why this is remarkable – our intuitive understanding of these words is reflected in the learned vector representations of the words!

This gives us the ability to add more of a punch in our text analysis pipelines – having an intuitive semantic representation of vectors (and by extension, documents – but we’ll get to that later) will come in handy more than once.

Finding word-pair relationships is one such interesting use – if we define a relationship between two words such as France: Paris, using the appropriate vector difference we can identify other similar relationships – Italy : Rome, Japan : Tokyo are two such examples which are found using Word2Vec. We can continue to play with these vectors like any other vectors – by adding two vectors, we can attempt to get what we would consider the addition of two words. For example, V(Vietnam) + V(Capital) is closest to the vector representation of V(Hanoi).

How exactly does this technique result in such an understanding of words? Word2Vec works by understanding context – in particular, what of words tend to appear in certain words? We choose a sliding window size and based on this window size attempt to identify the conditional probability of observing the output word based on the surrounding words. For example, if the sentence is The personal nature of text data always adds an extra bit of motivation, and it also likely means we are aware of the nature of the data, and what kind of results to expect., and our target word is the word in bold, motivation, we try and figure out what the odds of finding the word motivation if the context is always adds an extra bit of on the left side of the window and and it also likely means on the right. Of course, this is just an illustrative example – the exact training procedure requires us to choose a window size and the number of dimensions among other details.

There are two main methods to perform Word2Vec training, which are the Continuous Bag of Words model (CBOW) and the Skip Gram model. The underlying architecture of these models is described in the original research paper, but both of these methods involve in understanding the context which we talked about before. The papers written by Mikolov et al. provide further details of the training process, and since the code is public, it means we actually know what’s going on under the hood!

This blog post [4] (Word2Vec Tutorial – The Skip-Gram Model) by Chris McCormick explains some of the mathematical intuition behind the skip-gram word2vec model, and this post [5] (The amazing power of word vectors) by Adrian Colyer talks about the some of the things we can do with word2vec. The links are useful if you wish to dig a little deeper into the mathematical details of Word2Vec. This resources page [6] contains theory and code resources for Word2Vec and is also useful in case you wish to look up the original material or other implementation details.

While Word2Vec remains the most popular word vector implementation, this is not the first time it has been attempted, and certainly not the last either – we will discuss some of the other word embeddings techniques in the last section of this chapter. Right now, let’s jump into using these word vectors ourselves.

Gensim comes to our assistance again and is arguably the most reliable open source implementation of the algorithm, and we will explore how to use it.

Using Word2Vec with Gensim

While the original C code [7] released by Google does an impressive job, gensims’ implementation is a case where an open source implementation is more efficient than the original.

The gensim implementation was coded up back in 2013 around the time the original algorithm was released – this blog post by Radim Řehůřek [8] chronicles some of the thoughts and problems encountered in implementing the same for gensim, and is worth reading if you would like to know the process of coding word2vec in python. The interactive web tutorial [9] involving word2vec is quite fun and illustrates some of the examples of word2vec we previously talked about. It is worth looking at if you’re interested in running gensim word2vec code online and can also serve as a quick tutorial of using word2vec in gensim.

We will now get into actually training our own Word2Vec model. The first step, like all the other gensim models we used, involved importing the appropriate model.

   from gensim.models import word2vec

At this point, it is important to go through the documentation for the word2vec class, as well as the KeyedVector class, which we will both use a lot. From the documentation page, we list the parameters for the word2vec.Word2Vec class.

  1. sg: Defines the training algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram is employed.
  2. size: This is the dimensionality of the feature vectors.
  3. window: This is the maximum distance between the current and predicted word within a sentence.
  4. alpha: This is the initial learning rate (will linearly drop to min_alpha as training progresses).
  5. seed: For the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread, to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires the use of the PYTHONHASHSEED environment variable to control hash randomization.)
  6. min_count: Ignore all words with a total frequency lower than this.
  7. max_vocab_size: Limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit (default).
  8. sample: Threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, the useful range is (0, 1e-5).
  9. workers: Use this many worker threads to train the model (=faster training with multicore machines).
  10. hs: If 1, hierarchical softmax will be used for model training. If set to 0 (default), and negative is non-zero, negative sampling will be used.
  11. negative: If > 0, negative sampling will be used, the int for negative specifies how many noise words should be drawn (usually between 5-20). The default is 5. If set to 0, no negative sampling is used.
  12. cbow_mean: If 0, use the sum of the context word vectors. If 1 (default), use the mean. Only applies when cbow is used.
  13. hashfxn : Hash function to use to randomly initialize weights, for increased training reproducibility. The default is Python’s rudimentary built-in hash function.
  14. iter: Number of iterations (epochs) over the corpus. The default is 5.
  15. trim_rule: Vocabulary trimming rule specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and returns either RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model.
  16. sorted_vocab: If 1 (default), sort the vocabulary by descending frequency before assigning word indexes.
  17. batch_words: Target size (in words) for batches of examples passed to worker threads (and thus cython routines). The default is 10000. (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.).

 

We won’t be using or exploring all of these parameters in our examples, but they’re still important to have an idea of – fine-tuning your model would heavily rely on this. When training our model, we can use our own corpus or more generic ones – since we wish to not train on a particular topic or domain, we will use the Text8 corpus [10] which contains textual data extracted from Wikipedia. Be sure to download the data first – we do this by finding the link text8.zip under the Experimental Procedure section.

We will be more or less following the Jupyter notebook attached at the end of this chapter, which can also be found at the following link [11].

sentences = word2vec.Text8Corpus(‘text8’)

model = word2vec.Word2Vec(sentences, size=200, hs=1)

Our model will use hierarchical softmax for training and will have 200 features. This means it has a hierarchical output and uses the softmax function in its final layers. The softmax function is a generalization of the logistic function that squashes a K-dimensional vector z of arbitrary real values to a K-dimensional vector of real values, where each entry is in the range (0, 1), and all the entries add up to 1. We don’t need to understand the mathematical foundation at this point, but if interested, links 1-3 go into more details about this.

Printing our model tells us this:

   print(model)

 -> Word2Vec(vocab=71290, size=200, alpha=0.025)

Now that we have our trained model, let’s give the famous King – Man + Woman example a try:

    model.wv.most_similar(positive=[‘woman’, ‘king’], negative=[‘man’], topn=1)[0]

Here, we are adding king and woman (they are positive parameters), and subtracting man (it is a negative parameter), and choosing only the first value in the tuple.

-> (u’queen’)

And voila! As we expected, Queen is the closest word vector when we search for the word most similar to Woman and King, but far away from man. Note that since this is a probabilistic training process, there is a slight chance you might get a different word – but still relevant to the context of the words. For example, words like throne or empire might come up.

We can also use the most_similar_cosmul method – the gensim documentation [12] describes this as being slightly different to the traditional similarity function by instead using an implementation described by Omer Levy and Yoav Goldberg in their paper [13] Linguistic Regularities in Sparse and Explicit Word Representations. Positive words  

still contribute positively towards the similarity, negative words negatively, but with less susceptibility to one large distance dominating the calculation. For example:

model.wv.most_similar_cosmul(positive=[‘woman’, ‘king’], negative=[‘man’])

 

-> [(u’queen’, 0.8473771810531616),

 (u’matilda’, 0.8126628994941711),

 (u’throne’, 0.8048466444015503),

 (u’prince’, 0.8044915795326233),

 (u’empress’, 0.803791880607605),

 (u’consort’, 0.8026778697967529),

 (u’dowager’, 0.7984940409660339),

 (u’princess’, 0.7976254224777222),

 (u’heir’, 0.7949869632720947),

 (u’monarch’, 0.7940317392349243)]

If we wish to look up the vector representation of a word, all we need to do is:

   model.wv[‘computer’]

   model.save(“text8_model”)

We won’t display the output here, but we can expect to see a 200-dimension array, which is what we specified as our size.

If we wish to save our model to disk and re-use it again, we can do this using the save and load functionalities. This is particularly useful – we can save and re-train models, or further train on models adapted to a certain domain.

   model.save(“text8_model”)

   model = word2vec.Word2Vec.load(“text8_model”)

The magic of gensim remains in the fact that it doesn’t just give us the ability to train a model – like we have been seeing so far, it’s API means we don’t have to worry much about the mathematical workings but can focus on using the full potential of these word vectors. Let us check out some other nifty functionalities the word2vec model offers:

Using word vectors, we can identify which word in a list is the farthest away from the other words. Gensim implements this functionality with the doesnt_match method, which we illustrate:

model.wv.doesnt_match(“breakfast cereal dinner lunch”.split())

-> ‘cereal’

As expected, the one word which didn’t match the others on the list is picked out – here, it is cereal. We can also use the model to understand how similar or different words are in a corpus –

model.wv.similarity(‘woman’, ‘man’)

-> 0.6416034158543054

 

model.wv.similarity(‘woman’, ‘cereal’)

-> 0.04408454181286298

model.wv.distance(‘man’, ‘woman’)

-> 0.35839658414569464

The results are quite self-explanatory in this case, and as expected, the words woman and cereal are not similar. Here, distance is merely 1 similarity.

We can continue training our Word2Vec model using the train method – just remember to explicitly pass an epochs argument, as this is a suggested way to avoid common mistakes around the model’s ability to do multiple training passes itself. This Gensim notebook tutorial [14] walks one through how to perform online training with word2vec. Briefly, it requires performing the following tasks – building a new vocabulary and then running the train function again.

Once we’re done training our model, it is recommended to start only using the model’s keyed vectors. You might have noticed so far that we’ve been using the keyed vectors (which is simply a Gensim class to store vectors) to perform most of our tasks – model.wv represents this. To free up some RAM space, we can run:

     word_vectors = model.wv

     del model

We can now perform all the tasks we did before using the word vectors. Keep in mind this is not just for Word2Vec but all word embeddings.

To evaluate how well our model has done, we can test it on data-sets which are loaded when we install gensim.

     model.wv.evaluate_word_pairs(os.path.join(module_path, ‘test_data’,’wordsim353.tsv’))

–> ((0.6230957719715976, 3.90029813472169e-39),

SpearmanrResult(correlation=0.645315618985209, pvalue=1.0038208415351643e-42),  0.56657223796034)

Here, to make sure we find our file, we have to specify the module path – this is the path for the gensim/test folder, which is where the files exist. We can also test our model on finding word pairs and relationships by running the following code.

model.wv.accuracy(os.path.join(module_path, ‘test_data’, ‘questions-words.txt’))

In our examples so far, we used a model which we trained ourselves – this can be quite a time-consuming exercise sometimes, and it is handy to know how to load pre-trained vector models. Gensim allows for an easy interface to load the original Google News trained word2vec model (you can download this file from link [9]), for example.

     from gensim.models import KeyedVectors

     # load the google word2vec model

     filename = ‘GoogleNews-vectors-negative300.bin’

     model = KeyedVectors.load_word2vec_format(filename, binary=True)

Our model now uses a 300-dimension word vector model, and we can run all the previous code examples we ran before, again – the results won’t be too different, but we can expect a more sophisticated model.

Gensim also allows similar interfaces to download models using other word embeddings – we’ll go over this in the last section. We’re now equipped to train models, load models, and use these word embeddings to conduct experiments!

 

You have just read an excerpt from Packt’s book Natural Language Processing and Computational Linguistics, authored by Bhargav Srinivasa-Desikan.

If you want to know how to use natural language processing, and computational linguistics algorithms, to make inferences and gain insights about data you have, this is the book for you.

Book Preview