In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by Prof. Geoffrey Hinton from the University of Toronto in 2012.
In this article we will design a neural net language model. The model will learn to
predict the next word given the previous three words. The network looks like the following:
load data.mat
data.vocab
ans =
{
[1,1] = all
[1,2] = set
[1,3] = just
[1,4] = show
[1,5] = being
[1,6] = money
[1,7] = over
[1,8] = both
[1,9] = years
[1,10] = four
[1,11] = through
[1,12] = during
[1,13] = go
[1,14] = still
[1,15] = children
[1,16] = before
[1,17] = police
[1,18] = office
[1,19] = million
[1,20] = also
.
.
[1,246] = so
[1,247] = time
[1,248] = five
[1,249] = the
[1,250] = left
}
The raw sentences file: first few lines
No , he says now .
And what did he do ?
The money ‘s there .
That was less than a year ago .
But he made only the first .
There ‘s still time for them to do it .
But he should nt have .
They have to come down to the people .
I do nt know where that is .
No , I would nt .
Who Will It Be ?
And no , I was not the one .
You could do a Where are they now ?
There ‘s no place like it that I know of .
Be here now , and so on .
It ‘s not you or him , it ‘s both of you .
So it ‘s not going to get in my way .
When it ‘s time to go , it ‘s time to go .
No one ‘s going to do any of it for us .
Well , I want more .
Will they make it ?
Who to take into school or not take into school ?
But it ‘s about to get one just the same .
We all have it .
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

load data.mat [train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100); % 3gram features for a training datatuple train_x(:,13,14) %ans = %46 %58 %32 data.vocab{train_x(:,13,14)} %ans = now %ans = where %ans = do % target for the same data tuple from training dataset train_t(:,13,14) %ans = 91 data.vocab{train_t(:,13,14)} %ans = we 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

function [embedding_layer_state, hidden_layer_state, output_layer_state] = ... fprop(input_batch, word_embedding_weights, embed_to_hid_weights,... hid_to_output_weights, hid_bias, output_bias) % This method forward propagates through a neural network. % Inputs: % input_batch: The input data as a matrix of size numwords X batchsize where, % numwords is the number of words, batchsize is the number of data points. % So, if input_batch(i, j) = k then the ith word in data point j is word % index k of the vocabulary. % % word_embedding_weights: Word embedding as a matrix of size % vocab_size X numhid1, where vocab_size is the size of the vocabulary % numhid1 is the dimensionality of the embedding space. % % embed_to_hid_weights: Weights between the word embedding layer and hidden % layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the % number of hidden units. % % hid_to_output_weights: Weights between the hidden layer and output softmax % unit as a matrix of size numhid2 X vocab_size % % hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1. % % output_bias: Bias of the output layer as a matrix of size vocab_size X 1. % % Outputs: % embedding_layer_state: State of units in the embedding layer as a matrix of % size numhid1*numwords X batchsize % % hidden_layer_state: State of units in the hidden layer as a matrix of size % numhid2 X batchsize % % output_layer_state: State of units in the output layer as a matrix of size % vocab_size X batchsize % [numwords, batchsize] = size(input_batch); [vocab_size, numhid1] = size(word_embedding_weights); numhid2 = size(embed_to_hid_weights, 2); %% COMPUTE STATE OF WORD EMBEDDING LAYER. % Look up the inputs word indices in the word_embedding_weights matrix. embedding_layer_state = reshape(... word_embedding_weights(reshape(input_batch, 1, []),:)',... numhid1 * numwords, []); %% COMPUTE STATE OF HIDDEN LAYER. % Compute inputs to hidden units. inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ... repmat(hid_bias, 1, batchsize); % Apply logistic activation function. hidden_layer_state = 1 ./ (1 + exp(inputs_to_hidden_units)); %zeros(numhid2, batchsize); %% COMPUTE STATE OF OUTPUT LAYER. % Compute inputs to softmax. inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize); % Subtract maximum. % Remember that adding or subtracting the same constant from each input to a % softmax unit does not affect the outputs. Here we are subtracting maximum to % make all inputs &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0. This prevents overflows when computing their % exponents. inputs_to_softmax = inputs_to_softmax...  repmat(max(inputs_to_softmax), vocab_size, 1); % Compute exp. output_layer_state = exp(inputs_to_softmax); % Normalize to get probability distribution. output_layer_state = output_layer_state ./ repmat(... sum(output_layer_state, 1), vocab_size, 1); 
Here are the steps to generate a piece of pseudorandom text:
Here is the code that by default generates top 3 predictions for each 3gram sliding window and chooses one of predicted words tandomly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

function gen_rand_text(words, model, k=3) probs = []; i = 4; while (i < 20  word != '.' ) [word, prob] = predict_next_word(words{i3}, words{i2}, words{i1}, model, k); words = {words{:}, word}; probs = [probs; prob]; i = i + 1; end fprintf(1, "%s " , words{:}) ; fprintf(1, '\n' ); fprintf(1, "%.2f " , round(probs.*100)./100) ; fprintf(1, '\n' ); end 
Starting with the words 'i was going‘, here are some texts that were generated using the model:
Starting with the words ‘life in new‘, here is a piece of text that was generated using the model:
The next code shows results of a few wordanalogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

analogy( 'year' , 'years' , 'day' , model); % singularplural relation %year:years::day:days %dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186 analogy( 'on' , 'off' , 'new' , model) % antonyms relation %on:off::new:old %dist_E('on','off')=2.013958, dist_E('new','old')=2.265665 analogy( 'use' , 'used' , 'do' , model) % presentpast relation %use:used::do:did %dist_E('use','used')=2.556175, dist_E('do','did')=2.456098 analogy( 'he' , 'his' , 'they' , model) % pronounrelations %he:his::they:their %dist_E('he','his')=3.824808, dist_E('they','their')=3.825453 analogy( 'today' , 'yesterday' , 'now' , model) %today:yesterday::now:then %dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935 
© 2020 Data Science Central ® Powered by
Badges  Report an Issue  Privacy Policy  Terms of Service
Upcoming DSC Webinar
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 20082014  20152016  20172019  Book 1  Book 2  More
Upcoming DSC Webinar
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central