Currently, many of us are overwhelmed with mighty power of Deep Learning. We start to forget about humble graphical models. CRF is not so trendy as LSTM, but it is robust, reliable and worth noting.
In this post, you will find a short summary about CRF (aka Conditional Random Fields) – what is this thing, what is it for and some interesting facts. Enjoy!
Linear chain CRF is good for different segmentation and sequence tagging tasks:
The time complexity of the training process is large enough:
In practical implementation, the computational time is often larger due to many other operations like numerical scaling, smoothing etc.
The time complexity of the inference process is much better in case of using Viterbi algorithm for inference:
There are not many of them.
I think, the most suitable is pyStruct.
Also, I can name CRFsuite, a fast implementation of Conditional Random Fields written in C++, but with a decent Python wrapper.
As CRF is supervised machine learning algorithm, you need to have large enough training sample to train it. If you have such sample, and if you choose features wisely, theoretically you may obtain the quality around 0.6-0.7 (F1-measure)
Unfortunately, researches show us that in reality it is hardly unlikely to have a quality higher than 0.5 (F1-measure). For example, some Indian researchers used CRF to extract key words from medical texts and they had good features and large enough training sample, but they obtained quality not more than 0.4 (F1-measure). That’s sad.
CRF is significantly better in coping with NER task
For example, researchers from HSE and SPSU presented a paper, where they obtained quality of NER about 0.9 (F-measure) on a test set, having a training set not more than 70 000 examples. On real data they would hardly obtain such quality, while Stanford NER shows quality not more than 0.81 (F-measure) given it has perfectly selected training features and it was trained on larger corpora (CoNLL, MUC-6, MUC-7 and ACE)
Some Spanish and Russian researchers compared HMM and CRF in NER task for medical texts on JNLPBA corpus (18546 sentences with 109588 named entities). They obtained interesting results: HMM had higher recall (+4-7% depending on the type of entity) while CRF had higher precision (+4-13% depending on the type of entity). Average F-measure on cross-validation appeared to be about 0.65 for HMM and 0.69 for CRF. The authors supposed, that the quality of NER could have been 5-10% higher if they used hybrid HMM+CRF algorithm.
Yep, time expressions are also entities, but very specific ones, so I decided to write about them in a separate section.
According to one master thesis, linear-chain CRF operated very well on extracting time expressions from Russian text. The author manually tagged 2000 sentences (which contained about 500 time expressions) then iteratively tuned parameters and features until he obtained 0.93 (F1-measure) in cross-validation.
Sounds cool, but in real process I think there would be no more than 0.7-0.75 (F1-measure). That is also very decent, though.
CRF copes with this task very well. Different research papers (just google it!) claim the results of using CRF for sentiment analysis for various languages are good enough – about 0.7 (F1-measure).
For instance, the researchers from HSE claimed that they achieved 0.74 (F1-measure) while performing Sentiment Analysis on real Twitter messages. They manually tagged 20 000 messages and achieved average 0.86 (F1-measure) on all three possible types of sentiment (positive, negative and neutral).
By the way, neutral class is important for a good sentiment analysis (and my experience proves this). See, for example, this research about the importance of neutral calss in Sentiment An....
Yet another link about CRF and label bias
CRF for Russian language (in Russian)