Opinion Mining - Sentiment Analysis and Beyond

Introduction

There’s a lot of buzzword around the term “Sentiment Analysis” and the various ways of doing it. Great! So you report with reasonable accuracies what the sentiment about a particular brand or product is.

Opinion Mining and Sentiment Analysis

After publishing this report, your client comes back to you and says “Hey this is good. Now can you tell me ways in which I can convert the negative sentiments into positive sentiments?” – Sentiment Analysis stops there and we enter the realms of Opinion Mining. Opinion Mining is about having a deeper understanding of the review that was written. Typically, a detailed review will not just have a sentiment attached to it. It will have information and valuable feedback that can literally help to build the next strategy. Over time, some powerful methods have been developed using Natural Language Processing and computational linguistics to extract these subjective opinions.

Opinion Mining

In this blog we will study the stepping stone to Opinion Mining – grammatically tagging a sentence. It will help us break a sentence down into its underlying grammatical structure – nouns, verbs, adjectives etc. that will help us associate what was said about what. Once we are capable enough to do that, we can extract useful opinions that will help us answer the question posed by our client above.

The Intuition

All throughout this blog we will discuss the underlying theory and some code. If you’re in a hurry, this link will take you to the data and the code.

For the more patient readers, let’s start by looking at a review–

“Wish the unit had a separate online/offline light. When power to the unit is missing, the single red light turns off only when the warning sounds. The warning sound is like a lot of sounds you hear in the house so it isn’t always easy to tell what is happening”

This review about an electronic product is much beyond sentiment. There is feedback, suggestions and opinions – a storehouse of useful information.

The first sentence’s takeaway is that the unit should have had a separate light. The words wish and the are not as important as unit and separate online/offline switch. So, understanding the underlying structure can be key to unlocking useful feedback.

We do that in data science by parts of speech tagging. Let’s look into it more closely in the next section.

POS Tagging

POS Tagging is short for Parts of Speech Tagging. Our end goal will be – given a sentence, we have to parse it to predict the underlying grammatical structure

Parse Tree

Example: Fruit flies like a banana.

Fruit – JJ

flies – NN

like – VB

a – DT

banana – NN


POS Tag	Description
DT	Determiner
NN	Noun, singular or mass
VBD	Verb, past tense
JJ	Adjective
NP	Noun Phrase
VP	Verb Phrase

The above example looks is a representation of the problem statement. The word “flies” is mostly used as a verb – The airplane flies. But in this context it’s used as a noun. After giving it a thought, you will be forced to conclude that one word can assume many parts of speech given the context where it is being used.

So the challenge is to come up with a predictive model that can discriminate between these two usage instances and give us the appropriate tags given a sentence.

More formally –

We have an input sentence x=x1,x2,x3,…,xn where x1,x2,x3…xn are words in the sentence
We have a tag sequence y=y1,y2,y3,…,yn where y1,y2,y3…yn are grammatical tags (verb, noun, determiner etc.)
We have to find the most likely tag sequence for x -> argmax p(x1…xn,y1,y2…,yn)

Which is nothing but the joint probability of all x and all y.

Tagging a sentence can be vicious if brute force approach is used. Say there is a 20-word sentence and 50 grammatical tags. Each word can be any tag. So, there are 5020 possibilities! Not very computation friendly.

We will accomplish this with the help of the Hidden Markov Model using theViterbi Algorithm. It’s a dynamic programming algorithm that breaks this exponential problem into a linear problem in the number of words in a sentence.

Few quick definitions to set the context

Markov Process: The state of the system at time t+1 depends on the state of the system at time t. In this context, a word will be grammatically tagged based on the most likely sequence of the words seen before it. This is also called best parent approach.
Hidden Markov Model:
- X is the set of all words
- Y is the set of all grammatical tags
- We will define HMM as P(x1,x2,…,xn,y1,y2,…,yn) ∀ x € X and y € X

Method

NLTK Downloader

So now that we have access to a healthy data reserve, let’s concentrate on the method – then we will go through some code.

The training phase is constructing probabilities and something called the transition matrix. Typical transition matrix will look like this –

Transition Matrix

Here, the number in the 3rd row 1st column is 0.0008. This means – P(MD|NNP)= number of times NNP was followed by MD / total number of MD

Observed Probabilities will look like this –

Transition Matrix

Here, the probability of Janet being a noun is 0.000032

Transition Diagram

The above diagram is the most likely tag sequence of the unknown phrase computed by the Viterbi algorithm.

The Viterbi algorithm computes a probability matrix – grammatical tags on the rows and the words on the columns. Each cell in this matrix is a probability score. This score is the probability computed for the previous word * observed probability of the current word * transition probability of going from the grammatical tag of the previous word to the grammatical tag for the current word

Viterbi Illustration

Diagrammatic representation of the Viterbi probability matrix. We move column by column (each column represents probability scores for a word). Again, each score is a product of three numbers –

The Viterbi path probability computed for the previous word
The transition probability from picked from the transition matrix. This will be the probability of being in the current tag given the previous tag.
The observation probability for the current word being the current tag

The Viterbi score for Janet being NNP = Prob. of Janet being NNP * Prob. Of first word being NNP = 0.000032 * 0.2767 = 0.000009.

Again let’s say we want to compute the Viterbi score for the second word “will”. Janet is the previous word. The probability of Janet being anything other than NNP is 0. So, the probability of “will” being MD given Janet is NNP –

Viterbi score of Janet * Prob. Of NNP followed by MD * probability of “will” being MD = 0.000009 * 0.0110 * 0.308431 = ~ 0.00000002772

Janet only had non-zero scores for one tag so this was easy. Typically, at a particular cell for a current word, we compute Viterbi score for all the possibilities of parent tags and select the best parent that maximizes the probability at that cell. Thus we progress and arrive at the most likely tag sequence.

Pseudo Code for Viterbi Algorithm –

Python Code

import nltk tagged_reviews = [] for each_review_text in review_data:     text = nltk.word_tokenize(each_review_text)     tagged_reviews.append(nltk.pos_tag(text)) tagged_reviews[0]

Python’s nltk package has an off the shelf implementation for grammatically tagging a sentence. First, we tokenize (split) the review to get each word and then pass that list of words as an argument to nltk.pos_tag() to get the tagged reviews as the output. The full code for amazon product reviews along with data is available on my github page

Opinion Mining/Association

Once the sentence has been grammatically tagged, we can use production rules to mine opinions and extract meaningful feedback that might help us solve business problems. Below is a sample production rule. Production rules are very data specific. It depends on the intent as well – the type of data being looked at, the kind of keywords we want to deal with etc. There are basically two approaches to extracting opinions from grammatically tagged sentences –

Chunking – Here we write a production rule to extract important keywords that can help us in a number of ways – cleaning text by retaining important words, highlighting important words that help us understand the text better etc.
Chinking – This is the opposite of chunking. Here we write production rules to identify and remove phrases which are unimportant. Some typical chinking production rules stress on diminishing the conjunctions and determiners and the phrases that might lead to identifying stop words.

#nltk.help.upenn_tagset() grammar = "NP: {<dt|pp|cd>?<jj||jjr|jjs>*<nn|nns|prp|nnp|in|prp\$>+<vbd|vbz|vbn|vbp|in>*<jj|rb>*<prp|nn|nns>*}" cp = nltk.RegexpParser(grammar) results = cp.parse(tagged_reviews[9])

The variable grammar in the above code is a production rule for catching a particular phrase in a sentence. It is based on regular expression. ? means the sentence can optionally start with either a determiner, preposition or a cardinal number. * means 0 or more adjectives may follow. + means 1 or more occurrences of nouns/prepositions can occur and so on.

for result in results:     if type(result) == nltk.tree.Tree:         assoc=[]         for res in result:             assoc.append(res[0])         if len(assoc) > 2:             print assoc

Code for exploring the results.

The input review was –

“After going through the reviews, I bought this CF card for my Canon Digital Rebel. So far it has worked fine, though I don't pretend to be an expert digital photographer. When the card is empty, it shows 127 shots available. It seems to read reasonably fast, though it takes a bit longer than I was used to with my point-and-shoot Olympus and its SmartMedia card.”

Output –

[u'this', u'CF', u'card', u'for', u'my', u'Canon', u'Digital', u'Rebel'] [u'it', u'has', u'worked', u'fine'] [u'the', u'card', u'is', u'empty'] [u'127', u'shots', u'available'] [u'its', u'SmartMedia', u'card']

So as output we get chunks of useful information extracted by grammatical tag filtering through our production rule.

End Notes

If we can mine opinions for all reviews, there will emerge common topics and opinions centering round those topics can be easily looked at. Imagine a camera where people are talking about the lens, body, the auto flash etc. Based on those opinions a roadmap can be sought for the product – what aspects to focus on the next upgrade etc.

Do share your comments and explore this topic further together.

Originally posted here

Opinion Mining – Sentiment Analysis and Beyond