This problem appeared as an assignment in the **coursera course Natural Language Processing (by Stanford) in 2012**. The following description of the problem is taken directly from the assignment description.

In this article, a **probabilistic parser** will be built by implementing the **CKY parser**. The *Manually Annotated Sub-Corpus* (**MASC**) from the American National Corpus (ANC): This will be used for this purpose.

First, we need to learn a **PCFG** from the *training trees*. Since the *training set* is handparsed this learning is very easy. We need to simply set:

where **C(N_j→ζ) ** is the *count* observed for that *production* in the data set. While we could consider *smoothing rule rewrite probabilities*, it is sufficient to just work with *un-smoothed* **MLE probabilities** for *rules*. (Doing anything else makes things rather more complex and slow, since every rewrite will have a nonzero probability, so let’s get things working with an un-smoothed grammar before considering adding smoothing!).

At the very beginning, all the **train** and the **test** **trees** are read in. The training trees are going to be used to construct a **PCFG** parser, by learning the **PCFG grammar**. The parser is then used to *predict* trees for the sentences in the **test set**. For each *test sentence*, the parse given by the parser is *evaluated* by comparing the constituents it generates with the constituents in the hand-parsed version. From this, *precision*, *recall* and the ** F1 score **are calculated.

There are the following basic data structures:

• ling.Tree: CFG tree structures (ling.Trees.PennTreeRenderer)

• Lexicon: *Pre-terminal productions* and probabilities

• Grammar, UnaryRule, BinaryRule: CFG rules and accessors

Most parsers require grammars in which the rules are *at most binary* branching. Hence, first we need to **binarize** the trees and then construct a **Grammar** out of them using **MLE**.

The first job is to build a **CKY parser** using this **PCFG grammar** learnt. Scan through a few of the training trees in the MASC dataset to get a sense of the range of inputs. Something worth noticing is that the grammar has relatively few **non-terminal** symbols but **thousands of rules**, many ternary-branching or longer. Currently there are **38 MASC train** files and **11 test** files.

Once we have a parser working on the treebank, the next task is improve upon the supplied grammar by adding 1st / 2nd-order **vertical markovization**. This means using parent annotation symbols like NP^S to indicate a subject noun phrase instead of just NP.

The following figure shows the **CKY** algorithm to compute the *best possible* (**most probable**) parse tree for a sentence using the **PCFG grammar** learnt from the training dataset.

The following animation (prepared from the lecture slides of the same course) shows how the chart for ** CKY **is constructed using

For this assignment we will use your **average F1 score** to evaluate the correctness of the **CKY parser**, although in essence we ought to know from the output on the *development set* (**devtest**) whether the parser is implemented correctly. The following figure shows an example **evaluation:**

(1) First let’s use a **toy minimal training dataset** containing just **3 POS-tagged trees, **and a **dev/test dataset** with a single **test sentence **(with** ground-truth**), to start with. The following figure shows all the *training trees*. There are just enough productions in the training set for the test sentence to have an *ambiguity* due to **PP-attachment**.

The following figure shows the **PCFG** learnt from these training trees. Now let’s try to **parse** a single *test sentence***‘cats scratch walls with claws’** with the **CKY** parser and using the **PCFG** grammar learnt. The following figure shows the manual (**gold**) parse tree and the **best (most probable)** parse tree using the **CKY ***dynamic programming algorithm*.

(2) Now, let’s use a much **larger training dataeset MASC **(with a total of **3595 ***annotated training trees*),* *a few of them are shown in the next figure.

Let’s use all these **3595 *** POS-tagged training trees *to learn the

*There are ~10k of lexicon*rules*producing terminals*(with*non-zero probabilities*) are learnt, some of them are shown below:

*There are ~4.5k binary*rules (with*non-zero probabilities*)*producing terminals*are learnt, some of them are shown below:*There are ~3.5k unary*rules (with*non-zero probabilities*)*producing terminals*are learnt, some of them are shown below:

The independence assumptions of a ** PCFG **are ofen too strong. In order to indicate the dependency on the parent non-terminal in a tree the grammar rules can be re-written depending on

*There are ~14k of lexicon *rules

The following figure shows the **best parse trees** obtained with **CKY** for a sentence using **PCFG** learnt *with* and *without ***vertical Markovization **from the** minimal dataset**.

Similarly, using the **MASC dataset**, as can be seen for the following particular *test sentence*, the **CKY** parser performs much better with the ** PCFG **learnt from the

*A few more parse tree*s guessed by the CKY using the PCFG learnt from the **vertically Markovized** training trees are shown below:

The **markdown** file can be found here.

© 2020 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central