In this article a few simple applications of Markov chain are going to be discussed as a solution to a few text processing problems. These problems appeared as assignments in a few courses, the descriptions are taken straightaway from the courses themselves.
Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text.
Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.
Markov model of natural language. Shannon approximated the statistical structure of a piece of text using a simple mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences of each letter in that text, and using these frequencies as probabilities. For example, if the input text is "gagggagaggcgagaaa", the Markov model of order 0 predicts that each letter is 'a' with probability 7/17, 'c' with probability 1/17, and 'g' with probability 9/17 because these are the fraction of times each letter occurs. The following sequence of letters is a typical example generated from this model:
g a g g c g a g a a g a g a a g a a a g a g a g a g a a a g a g a a g ...
A Markov model of order 0 assumes that each letter is chosen independently. This independence does not coincide with statistical properties of English text because there a high correlation among successive letters in a word or sentence. For example, 'w' is more likely to be followed with 'e' than with 'u', while 'q' is more likely to be followed with 'u' than with 'e'.
We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters. Let a k-gram mean any k consecutive letters. Then for example, if the text has 100 occurrences of "th", with 60 occurrences of "the", 25 occurrences of "thi", 10 occurrences of "tha", and 5 occurrences of "tho", the Markov model of order 2 predicts that the next letter following the 2-gram "th" is 'e' with probability 3/5, 'i' with probability 1/4, 'a' with probability 1/10, and 'o' with probability 1/20.
A brute-force solution. Claude Shannon proposed a brute-force scheme to generate text according to a Markov model of order 1:
“ To construct [a Markov model of order 1], for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. ”
Our task is to write a python program to automate this laborious task in a more efficient way — Shannon’s brute-force approach is prohibitively slow when the size of the input text is large.
Markov model data type. Create an immutable data type MarkovModel to represent a Markov model of order k from a given text string. The data type must implement the following API:
To avoid dead ends, treat the input text as a circular string: the last character is considered to precede the first character.
For example, if k = 2 and the text is the 17-character string "gagggagaggcgagaaa", then the salient features of the Markov model are captured in the table below:
frequency of probability that next char next char is kgram freq a c g a c g ---------------------------------------------- aa 2 1 0 1 1/2 0 1/2 ag 5 3 0 2 3/5 0 2/5 cg 1 1 0 0 1 0 0 ga 5 1 0 4 1/5 0 4/5 gc 1 0 0 1 0 0 1 gg 3 1 1 1 1/3 1/3 1/3 ---------------------------------------------- 17 7 1 9
Note that the frequency of "ag" is 5 (and not 4) because we are treating the string as circular.
A Markov chain is a stochastic process where the state change depends on only the current state. For text generation, the current state is a k-gram. The next character is selected at random, using the probabilities from the Markov model. For example, if the current state is "ga" in the Markov model of order 2 discussed above, then the next character is 'a' with probability 1/5 and 'g' with probability 4/5. The next state in the Markov chain is obtained by appending the new character to the end of the k-gram and discarding the first character. A trajectory through the Markov chain is a sequence of such states. Below is a possible trajectory consisting of 9 transitions.
trajectory: ga --> ag --> gg --> gc --> cg --> ga --> ag --> ga --> aa --> ag probability for a: 1/5 3/5 1/3 0 1 1/5 3/51/5 1/2 probability for c: 0 0 1/3 0 0 0 0 0 0 probability for g: 4/5 2/5 1/3 1 0 4/5 2/5 4/5 1/2
Treating the input text as a circular string ensures that the Markov chain never gets stuck in a state with no next characters.
To generate random text from a Markov model of order k, set the initial state to k characters from the input text. Then, simulate a trajectory through the Markov chain by performing T − k transitions, appending the random character selected at each step. For example, if k = 2 and T = 11, the following is a possible trajectory leading to the output gaggcgagaag.
trajectory: ga --> ag --> gg --> gc --> cg --> ga --> ag --> ga --> aa --> ag output: ga g g c g a g a a g
Text generation client. Implement a client program TextGenerator that takes two command-line integers k and T, reads the input text from standard input and builds a Markov model of order k from the input text; then, starting with the k-gram consisting of the first k letters of the input text, prints out T characters generated by simulating a trajectory through the corresponding Markov chain. We may assume that the text has length at least k, and also that T ≥ k.
import numpy as np from collections import defaultdict class MarkovModel: def __init__(self, text, k): ''' create a Markov model of order k from given text Assume that text has length at least k. ''' self.k = k self.tran = defaultdict(float) self.alph = list(set(list(text))) self.kgrams = defaultdict(int) n = len(text) text += text[:k] for i in range(n): self.tran[text[i:i+k],text[i+k]] += 1. self.kgrams[text[i:i+k]] += 1 def order(self): # order k of Markov model return self.k def freq(self, kgram): # number of occurrences of kgram in text assert len(kgram) == self.k # (check if kgram is of length k) return self.kgrams[kgram] def freq2(self, kgram, c): # number of times that character c follows kgram assert len(kgram) == self.k # (check if kgram is of length k) return self.tran[kgram,c] def rand(self, kgram): # random character following given kgram assert len(kgram) == self.k # (check if kgram is of length k. Z = sum([self.tran[kgram, alph] for alph in self.alph]) return np.random.choice(self.alph, 1, p=np.array([self.tran[kgram, alph] for alph in self.alph])/Z) def gen(self, kgram, T): # generate a String of length T characters assert len(kgram) == self.k # by simulating a trajectory through the corresponding str = '' # Markov chain. The first k characters of the newly for _ in range(T): # generated String should be the argument kgram. #print kgram, c # check if kgram is of length k. c = self.rand(kgram) # Assume that T is at least k. kgram = kgram[1:] + c str += c return str
m = MarkovModel('gagggagaggcgagaaa', 2)
generates the following MarkovChain where each state represents a 2-gram.
Microsoft said Tuesday the company would comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developers Kit for Java.
“We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Counsel for Microsoft Corporation. “We are disappointed with this decision, but we will immediately comply with the Court’s order.”
Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft’s goal.
“We will continue to listen to our customers and provide them the tools they need to write great software using the Java language,” added Tod Nielsen, General Manager for Microsoft’s Developer Relations Group/Platform Marketing.
Markov Model learnt
Generated output: random news item, using input as an order 7 model
Microsoft is no longer able to use the Java language,” added Tod Nielsen, General Counsel for Microsoft’s Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developer Relations Group/Platform Marketing.Microsoft to be Microsoft Corporation. “We are disappointed with the Court’s order.”
Microsoft is no longer able to use the Java language. Providing the best tools and provide them the tools and programming options will continue to listen to our customers and provide them the tools and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the Java programming option of taking advantage of Windows features when writing software Developer Relations Group/Platform Marketing.Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Manager for Microsoft’s goal.
“We will continue to listen to our customers and programming language. Providing the Java language,” added Tod Nielsen, General Manager for Microsoft is no longer able to use the Java language.
Imagine we receive a message where some of the characters have been corrupted by noise. We represent unknown characters by the ~ symbol (we assume we don’t use ~ in our messages). Add a method replaceUnknown that decodes a noisy message by replacing each ~ with the most likely character given our order k Markov model, and conditional on the surrounding text:
def replaceUnknown(corrupted) # replace unknown characters with most probable characters
Assume unknown characters are at least k characters apart and also appear at least k characters away from the start and end of the message. This maximum-likelihood approach doesn’t always get it perfect, but it fixes most of the missing characters correctly.
Here are some details on what it means to find the most likely replacement for each ~. For each unknown character, you should consider all possible replacement characters. We want the replacement character that makes sense not only at the unknown position (given the previous characters) but also when the replacement is used in the context of the k subsequent known characters. For example we expect the unknown character in "was ~he wo" to be 't' and not simply the most likely character in the context of "was ". You can compute the probability of each hypothesis by multiplying the probabilities of generating each of k+1 characters in sequence: the missing one, and the k next ones. The following figure illustrates how we want to consider k+1windows to maximize the log-likelihood:
Using the algorithm described above, here are the results obtained for the following example:
Original : it was the best of times, it was the worst of times.
Noisy : it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times.
Corrected (k=4): it was the best of times, it was the worst of times.
Corrected (k=2): it was the best of times, in was the wo st of times.
This problem appeared as an assignment in the Cornell course cs1114 .
In this assignment, we shall be implementing an authorship detector which, when
given a large sample size of text to train on, can then guess the author of an unknown
Our first step is to construct the transition matrix representing our Markov chain.
First, we must read the text from a sample file. We shall want to create a sparse array using the scipy csr sparse matrix. Along with the transition matrix, we shall be creating a corresponding vector that contains 2 word frequencies (normalized by the total number of words in the document (including repeated words)).
Once we have our transition matrix, we can calculate the likelihood of an unknown
sample of text. We are given several pieces of literature by various authors,
as well as excerpts from each of the authors as test dataset. Our goal is to identify the authors of each excerpt.
To do so, we shall need to calculate the likelihood of the excerpt occurring in each author’s transition matrix. Recall that each edge in the directed graph that the transition
matrix represents is the probability that the author follows a word with another.
Since we shall be multiplying numerous possibly small probabilities together, our
calculated likelihood will likely be extremely small. Thus, you should compare log(likelihood) instead. Keep in mind the possibility that the author may have used a word he has never used before. Our calculated likelihood should not eliminate an author completely because of this. We shall be imposing a high penalty if a word is missing.
Now that we can compute likelihoods, the next step is to write a routine that takes a set of transition matrices and dictionaries, and a sequence of text, and returns the index of the transition matrix that results in the highest likelihood. You will write this in a function classify text, which takes transition matrices, dictionaries, histograms, and the name of the file containing the test text, and returns a single integer best index. The following figure shows how to detect an author k (A_k) of the test text t_1..t_n using the transition matrix P_k with MLE :
from np import log
return 0 if x <= 0 else log(x)
def compute_text_likelihood(filename, T, dict_rev, histogram, index):
Compute the (log) likelihood L of a given string (in ‘filename’) given
a word transition probability T, dictionary ‘dict’, and histogram
text = word_tokenize(open(filename).read().replace(‘\n’, ‘ ‘).lower())
num_words = len(text)
text = [word for word in text if word in histogram] # keep only the words that are found in the training dataset
ll = log0(histogram[text]) – log0(sum(histogram.values()))
for i in range(1, len(text)):
ll += log0(T[dict_rev[text[i-1]], dict_rev[text[i]]])
return ll + (num_words – num_matches)*penalty
def classify_text(tmatrices, dict_revs, histograms, filename):
Return the index of the most likely author given the transition matrices, dictionaries, and a test file
for i in range(len(tmatrices)):
ll = compute_text_likelihood(filename, tmatrices[i], dict_revs[i], histograms[i], i)
print i, ll
The list of authors whose writings are there in the training dataset:
Author JANE AUSTEN (Texts taken from Emma, Mansfield Park, Pride and Prejudice)
Emma Woodhouse, handsome, clever, and rich, with a comfortable home and
happy disposition, seemed to unite some of the best blessings of
existence; and had lived nearly twenty-one years in the world with very
little to distress or vex her.
She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister’s marriage,
been mistress of his house from a very early period. Her mother had
died too long ago for her to have more than an indistinct remembrance
of her caresses; and her place had been supplied by an excellent woman
as governess, who had fallen little short of a mother in affection.
Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a
governess than a friend, very fond of both daughters, but particularly
of Emma. Between _them_ it was more the intimacy of sisters. Even
before Miss Taylor had ceased to hold the nominal office of governess,
the mildness of her temper had hardly allowed her to impose any
restraint; and the shadow of authority being now long passed away, they
had been living together as friend and friend very mutually attached,
and Emma doing just what she liked; highly esteeming Miss Taylor’s
judgment, but directed chiefly by her own.
The real evils, indeed, of Emma’s situation were the power of having
rather too much her own way, and a disposition to think a little too
well of herself; these were the disadvantages which threatened alloy to
her many enjoyments. The danger, however, was at present so
unperceived, that they did not by any means rank as misfortunes with
Author Lewis Carroll (Texts taken from Alice’s Adventures in Wonderland, Sylvie and Bruno)
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, ‘and what is the use of a book,’ thought Alice ‘without pictures or
So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.
There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear!
Oh dear! I shall be late!’ (when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time
it all seemed quite natural); but when the Rabbit actually TOOK A WATCH
OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,
Alice started to her feet, for it flashed across her mind that she had
never before seen a rabbit with either a waistcoat-pocket, or a watch
to take out of it, and burning with curiosity, she ran across the field
after it, and fortunately was just in time to see it pop down a large
rabbit-hole under the hedge.
In another moment down went Alice after it, never once considering how
in the world she was to get out again.
The rabbit-hole went straight on like a tunnel for some way, and then
dipped suddenly down, so suddenly that Alice had not a moment to think
about stopping herself before she found herself falling down a very deep
Author William Shakespeare (Texts taken from Henry IV Part 1, Romeo and Juliet, Twelfth Night)
So shaken as we are, so wan with care,
Find we a time for frighted peace to pant,
And breathe short-winded accents of new broils
To be commenced in strands afar remote.
No more the thirsty entrance of this soil
Shall daub her lips with her own children’s blood;
No more shall trenching war channel her fields,
Nor bruise her flowerets with the armed hoofs
Of hostile paces: those opposed eyes,
Which, like the meteors of a troubled heaven,
All of one nature, of one substance bred,
Did lately meet in the intestine shock
And furious close of civil butchery,
Shall now, in mutual well-beseeming ranks,
March all one way, and be no more opposed
Against acquaintance, kindred, and allies:
The edge of war, like an ill-sheathed knife,
No more shall cut his master. Therefore, friends,
As far as to the sepulchre of Christ–
Whose soldier now, under whose blessed cross
We are impressed and engaged to fight–
Forthwith a power of English shall we levy,
To chase these pagans in those holy fields
Over whose acres walk’d those blessed feet
Which fourteen hundred years ago were nail’d
For our advantage on the bitter cross.
But this our purpose now is twelvemonth old,
And bootless ’tis to tell you we will go:
Therefore we meet not now.–Then let me hear
Of you, my gentle cousin Westmoreland,
What yesternight our Council did decree
In forwarding this dear expedience.
My liege, this haste was hot in question,
And many limits of the charge set down
But yesternight; when, all athwart, there came
A post from Wales loaden with heavy news;
Whose worst was, that the noble Mortimer,
Leading the men of Herefordshire to fight
Against th’ irregular and wild Glendower,
Was by the rude hands of that Welshman taken;
A thousand of his people butchered,
Upon whose dead corpse’ there was such misuse,
Such beastly, shameless transformation,
By those Welshwomen done, as may not be
Without much shame re-told or spoken of.
Author MARK TWAIN (Texts taken from A Connecticut Yankee in King Arthur’s Court, The Adventures of Huckleberry Finn, The Prince and the Pauper)
I am an American. I was born and reared in Hartford, in the State
of Connecticut–anyway, just over the river, in the country. So
I am a Yankee of the Yankees–and practical; yes, and nearly
barren of sentiment, I suppose–or poetry, in other words. My
father was a blacksmith, my uncle was a horse doctor, and I was
both, along at first. Then I went over to the great arms factory
and learned my real trade; learned all there was to it; learned
to make everything: guns, revolvers, cannon, boilers, engines, all
sorts of labor-saving machinery. Why, I could make anything
a body wanted–anything in the world, it didn’t make any difference
what; and if there wasn’t any quick new-fangled way to make a thing,
I could invent one–and do it as easy as rolling off a log. I became
head superintendent; had a couple of thousand men under me.
Well, a man like that is a man that is full of fight–that goes
without saying. With a couple of thousand rough men under one,
one has plenty of that sort of amusement. I had, anyway. At last
I met my match, and I got my dose. It was during a misunderstanding
conducted with crowbars with a fellow we used to call Hercules.
He laid me out with a crusher alongside the head that made everything
crack, and seemed to spring every joint in my skull and made it
overlap its neighbor. Then the world went out in darkness, and
I didn’t feel anything more, and didn’t know anything at all
–at least for a while.
When I came to again, I was sitting under an oak tree, on the
grass, with a whole beautiful and broad country landscape all
to myself–nearly. Not entirely; for there was a fellow on a horse,
looking down at me–a fellow fresh out of a picture-book. He was
in old-time iron armor from head to heel, with a helmet on his
head the shape of a nail-keg with slits in it; and he had a shield,
and a sword, and a prodigious spear; and his horse had armor on,
too, and a steel horn projecting from his forehead, and gorgeous
red and green silk trappings that hung down all around him like
a bedquilt, nearly to the ground.
“Fair sir, will ye just?” said this fellow.
“Will I which?”
“Will ye try a passage of arms for land or lady or for–”
“What are you giving me?” I said. “Get along back to your circus,
or I’ll report you.”
Now what does this man do but fall back a couple of hundred yards
and then come rushing at me as hard as he could tear, with his
nail-keg bent down nearly to his horse’s neck and his long spear
pointed straight ahead. I saw he meant business, so I was up
the tree when he arrived.
Each of the Markov models learnt from the training texts for each author are used to compute the log-likelihood of the unknown test text, the author with the maximum log-likelihood is chosen to be the likely author of the text.
Against the interest of her own individual comfort, Mrs. Dashwood had
determined that it would be better for Marianne to be any where, at
that time, than at Barton, where every thing within her view would be
bringing back the past in the strongest and most afflicting manner, by
constantly placing Willoughby before her, such as she had always seen
him there. She recommended it to her daughters, therefore, by all
means not to shorten their visit to Mrs. Jennings; the length of which,
though never exactly fixed, had been expected by all to comprise at
least five or six weeks. A variety of occupations, of objects, and of
company, which could not be procured at Barton, would be inevitable
there, and might yet, she hoped, cheat Marianne, at times, into some
interest beyond herself, and even into some amusement, much as the
ideas of both might now be spurned by her.
From all danger of seeing Willoughby again, her mother considered her
to be at least equally safe in town as in the country, since his
acquaintance must now be dropped by all who called themselves her
friends. Design could never bring them in each other’s way: negligence
could never leave them exposed to a surprise; and chance had less in
its favour in the crowd of London than even in the retirement of
Barton, where it might force him before her while paying that visit at
Allenham on his marriage, which Mrs. Dashwood, from foreseeing at first
as a probable event, had brought herself to expect as a certain one.
She had yet another reason for wishing her children to remain where
they were; a letter from her son-in-law had told her that he and his
wife were to be in town before the middle of February, and she judged
it right that they should sometimes see their brother.
Marianne had promised to be guided by her mother’s opinion, and she
submitted to it therefore without opposition, though it proved
perfectly different from what she wished and expected, though she felt
it to be entirely wrong, formed on mistaken grounds, and that by
requiring her longer continuance in London it deprived her of the only
possible alleviation of her wretchedness, the personal sympathy of her
mother, and doomed her to such society and such scenes as must prevent
her ever knowing a moment’s rest.
But it was a matter of great consolation to her, that what brought evil
to herself would bring good to her sister; and Elinor, on the other
hand, suspecting that it would not be in her power to avoid Edward
entirely, comforted herself by thinking, that though their longer stay
would therefore militate against her own happiness, it would be better
for Marianne than an immediate return into Devonshire.
Her carefulness in guarding her sister from ever hearing Willoughby’s
name mentioned, was not thrown away. Marianne, though without knowing
it herself, reaped all its advantage; for neither Mrs. Jennings, nor
Sir John, nor even Mrs. Palmer herself, ever spoke of him before her.
Elinor wished that the same forbearance could have extended towards
herself, but that was impossible, and she was obliged to listen day
after day to the indignation of them all.
log-likelihood values computed for the probable authors
As can be seen from above the maximum likelihood value corresponds to the author 0, i.e., Austen. Hence, the most probable author of the unknown text is Austen.
Then he tossed the marble away pettishly, and stood cogitating. The
truth was, that a superstition of his had failed, here, which he and
all his comrades had always looked upon as infallible. If you buried a
marble with certain necessary incantations, and left it alone a
fortnight, and then opened the place with the incantation he had just
used, you would find that all the marbles you had ever lost had
gathered themselves together there, meantime, no matter how widely they
had been separated. But now, this thing had actually and unquestionably
failed. Tom’s whole structure of faith was shaken to its foundations.
He had many a time heard of this thing succeeding but never of its
failing before. It did not occur to him that he had tried it several
times before, himself, but could never find the hiding-places
afterward. He puzzled over the matter some time, and finally decided
that some witch had interfered and broken the charm. He thought he
would satisfy himself on that point; so he searched around till he
found a small sandy spot with a little funnel-shaped depression in it.
He laid himself down and put his mouth close to this depression and
“Doodle-bug, doodle-bug, tell me what I want to know! Doodle-bug,
doodle-bug, tell me what I want to know!”
The sand began to work, and presently a small black bug appeared for a
second and then darted under again in a fright.
“He dasn’t tell! So it WAS a witch that done it. I just knowed it.”
He well knew the futility of trying to contend against witches, so he
gave up discouraged. But it occurred to him that he might as well have
the marble he had just thrown away, and therefore he went and made a
patient search for it. But he could not find it. Now he went back to
his treasure-house and carefully placed himself just as he had been
standing when he tossed the marble away; then he took another marble
from his pocket and tossed it in the same way, saying:
“Brother, go find your brother!”
He watched where it stopped, and went there and looked. But it must
have fallen short or gone too far; so he tried twice more. The last
repetition was successful. The two marbles lay within a foot of each
log-likelihood values computed for the probable authors
As can be seen from above the maximum likelihood value corresponds to the author 7, i.e., Twain. Hence, the most probable author of the unknown text is Twain. The following figure shows the relevant states corresponding to the Markov model for Twain trained from the training dataset.