Home » Uncategorized

Text Mining and Sentiment Analysis – A Primer

Over years, a crucial part of data-gathering behavior has revolved around what other people think.  With the constantly growing popularity and availability of opinion-driven resources such as personal blogs and online review sites, new challenges and opportunities are emerging as people have started using advanced technologies to make decisions now. Sentiment analysis or opinion mining, refers to the use of computational linguistics, text analytics and natural language processing to identify and extract information from source materials.

Sentiment analysis is considered one of the most popular applications of text analytics. The primary aspect of sentiment analysis includes data analysis on the body of the text for understanding the opinion expressed by it and other key factors comprising modality and mood. Usually, the process of sentiment analysis works best on text that has a subjective context than on that with only an objective context. This is because when a body of text has an objective context or perspective to it, the text usually depicts some normal statements or facts without expressing any emotion, feelings, or mood. Subjective text contains text that is usually expressed by a human having typical moods, emotions, and feelings. Sentiment analysis is widely used, especially as a part of social media analysis for any domain, be it a business, a recent movie, or a product launch, to understand its reception by the people and what they think of it based on their opinions or, sentiment.


 Textual data in the form of unstructured datasets, can be classified into two types:

  • Factual based (objective)/opinion based (subjective). Sentiment analysis, works best on text that has a subjective context. In general, social media, surveys, and feedback data, all are heavily opinionated and express the beliefs, judgement, emotion, and feelings of human beings.
  • Feature/aspect based analysis involves the identification of sentiments or opinions by assessing different factors of an entity. For example, the picture quality of a digital camera, screen of the cellphone and a bank, among others.

However, sentiment analysis can be computed on various levels for text data, including on a sentence level, paragraph level or the whole document. Often sentiments are evaluated by taking the whole document into consideration or by aggregating the sentiments for individual sentences.

A basic task in the process of sentiment analysis includes the classification of polarity of text in the document. Sentiment analysis, by computing the polarity of the document finds out whether the document expresses a positive, negative, or a neutral sentiment. Whereas, more advanced analysis finds out even complex emotions comprising happiness, anger, sadness, and sarcasm, among others. 

Polarity analysis

It assigns scores to the emotions expressed in the document in the form of a positive or negative emotion. Finally, it assigns labels to the document on the basis of the aggregate score. Two major techniques for sentiment analysis include:

  • Supervised machine learning
  • Unsupervised lexicon-based sentiment analysis

The key idea is to learn the various techniques typically used to tackle sentiment analysis problems through practical and relevant use cases of each.

The concepts of supervised machine learning based classification algorithms can be used to classify documents with their related sentiment by leveraging lexicons, which are dictionaries or vocabularies specially constructed to be used for sentiment analysis, and compute sentiment without using any supervised techniques.


  • The case for supervised machine learning models

Sentiment analysis of Internet Movie Database (IMDb) Reviews-

To perform sentiment analysis on the movie reviews, suppose one takes 50,000 movie reviews containing corresponding labels of sentiment polarity which is either positive or negative into consideration. A positive label usually represents a movie rated more than six stars by the audience whereas a negative review depicts less than five stars in IMDb. Since it is sentiment analysis, one cannot compute and prove a sentiment mathematically- which means one can never get a hundred-percent perfect model.

However, for a supervised machine learning model, the process is as follows-



Training the supervised model on the training data and then evaluating model performance on the testing data are two primary steps that occur in this technique. Out of 50,000 reviews, say one takes 35,000 as training datasets and remaining 15,000 as testing datasets. Supervised machine learning model learns from corresponding sentiments and past reviews to predict reviews from the test dataset. The model is built using feature-extraction, normalization and support a vector machine algorithm which can predict sentiment for new movie reviews from the test dataset.

For example, if the review is- “I hope these same film-makers never unite”, wherein, Actual labeled sentiment- Negative and predicted sentiment- Negative. Supervised machine learning models are around 80% accurate with regard to predicting sentiment for movie reviews. These models gain interest because of their capability to portray many features, easy adaptability to change inputs and measure the degree of uncertainty to make a classification.

  • The case for Unsupervised lexicon-based Sentiment Analysis


Sentiment Analysis for social media analytics


Application of a lexicon is considered one of the two primary approaches of sentiment analysis which involves the calculation of sentiments from the semantic orientation of phrases or words that occur in the text. This approach uses a dictionary of both positive and negative words, wherein, a positive or negative value is assigned to every word. In lexicon-based models, a piece of source text message is considered a bag of certain words. According to the representation of this message, specific sentiment values are assigned to all negative and positive words within the message. Finally, a combining function, such as average or sum is applied to predict the overall sentiment of the message. Apart from the sentiment value, the local context of a phrase or a word is taken into consideration, such as intensification or negation.

Sentiment Lexicon-

Suppose the sentiment lexicon constructed using SentiWordNet as the baseline contains 6300 words and each word or phrase in the lexicon has been assigned a value depicting sentiment in the range of 100 being most positive to -100 being the most negative. There’s no denying that some of the negative and positive words sometime occur simultaneously with the neutral meaning in a sentence. To solve this issue, for each word from different lexicons, a conditional probability (K) can be estimated besides the assigned sentiment value –

K (positive|s) for positive s
K (negative|s) for negative s

On the basis of a set of labelled data, for every positive word, let’s estimate the probability that any random message containing this particular word is positive and the same way, estimate the probabilities for negative words as well. We moved on further to see if applications of such information can manage messages with mixed sentiment or not. The training dataset was produced on the basis of absence emoticons in a message. The conditional probability has been calculated based on the positive or negative state of the word as presented below-

where #s K  and #s N   represent the number of messages from a sample that has word s in the form of positive and negative, respectively. To obtain accurate results, this process is repeated around 100 times and then the average probability is stored in the lexicon.                                                               


Conventionally, sentiment analysis approaches and systems looked at words or phrases in a confined manner. Typically, they assigned negative points for negative words and similarly, for the positive ones; later summing up these points. For example, “I love this car”- the word “love” here represents a “+1” ranking whereas “The tea was really, really bad” not only generates “-1” ranking due to the word “bad”, but also generates a “-2” ranking due to the “really, really” phrase. This rules-based type of sentiment analysis demands the crafting of text analysis and parsing of data manually. This model is harder to transfer to different other languages and also, it does not work friendly with social media channels like Twitter, which has condensed, idiosyncratic and shorter sentences. Precision rates with conventional models varies from 40%-60% which is good but certainly not outstanding.

This is where the demand of applying deep learning into these models becomes imperative. The results of sentiment analysis must be precise to be useful. Many organizations are reaping benefits by implementing deep learning models- simply because of utility and accuracy.

Sentiment analysis when coupled up with deep learning doesn’t demand handcrafted attributes or a comprehensive determined dictionary- instead, this approach leverages inference to produce its own models. Long Short-term Memory (LSTM) network architecture when works in tandem with Recursive Neural Networks (RNNs) and grammatical structures provide precise measurements of sentiment in texts irrespective of its size across different channels.

Deep learning makes the process of sentiment analysis much more effective than conventional methods, improving both accuracy and speed. Also, with deep learning, results of sentiment analysis can be as accurate as 90%.


The number of customers who trust and read online reviews is increasing every day. The web and the internet have now made it convenient for consumers to find out the experiences and opinions of people that are neither popular critics nor personal acquaintances- which is surprising.

Here’s a survey of around 2000 American adults focusing on the growing power of information and how customers are using it to make smart choices-


  • More than 80% have searched for a product online at least once;
  • 20% of them engage themselves in an online research on a typical day;
  • 80% revealed that their buying decisions get influenced by online reviews;
  • 60% prefer the 5-star rated item over a 4 star- rated product.
  • 32% have submitted a rating on a service, person or product through online rating system.

The curiosity and reliance upon online recommendations and advice that the above data reveals is one of the reasons behind the rise of interest in systems and approaches that deal with opinions and sentiments as a top priority.




If someone is talking about you, you would certainly want to find proofs. For business processes and organizations, there’s no choice- because they need to know what people think about their brand. In such cases, it becomes imperative for brands to listen carefully to the customers to know what is being said about their firm- more importantly, if it is positive or negative. The tools mentioned below are helping companies in tracking sentiments of their customers-




Sentiment analysis, still in its infancy, is constantly growing and becoming popular with numerous applications.  Organizations are looking at sentiment analyses as a primary aid in improving their marketing strategies and measuring sales as well. And to accomplish this- some organizations are developing their own strategies and tools, while others are outsourcing this task to companies specializing in the same domain. However, top job locations to land a dream job include London, England, Berkshire, Birmingham, South East and India among others.

Data scientists, data analysts as well as developers with a certification are well-compensated and sought after in the big data-driven scenario. Here’s what market trends say-


From gaining practical skills to learning all aspects of a career pursuit- there is nothing that a certification can’t do to steer your career in the right direction. To make a career in sentiment analysis as a successful data analyst or data engineer, a professional certification plays an imperative role as it provides a vehicle that facilitates one with primary skillsets and knowledge to be recognized as a “thought leader”.