Analyzing sentiments is a very subjective exercise. The other day, I was talking about this to a colleague of mine about how you could mine sentiments and he made a very interesting comment based out of the experience that he has had with this concept :
He has his own software company which is a mid-sized software one and that which was doing fairly well and at one time, he tried to analyze the broader sentiment about the brand value of his company on the open market.
He went online and using some screen scraping algorithms and tools, extracted feedback provided on various forums and fed them through a off the shelf sentiment analysis tool – HP has one – IDOL (https://www.idolondemand.com/developer/apis/analyzesentiment#overview) and even Microsoft Azure has one now (https://azure.microsoft.com/en-us/documentation/articles/machine-le....
The overall brand sentiment turned out to be negative and even more surprising was the fact that it leaned towards the most negative scale. This surprised him because the other parameters that his HR partners provided him were painting a contrasting picture- the attrition rate was low, the employee engagement survey produced positive results etc. Then he did a deep dive into the feedback content and realized that almost all of the comments were negative and that the people who posted feedback were all disgruntled employees and not many employees who were happy posted any kind of feedback on any social forum. They were too busy with their work and adding more value to the organization.
The comment he made in conclusion is “It does not matter what or which complex or super smart models you choose or algorithms that can crunch complex text, it matters how you choose the source of the data set that you will feed for the analysis”.
The below screen shots show the sentiment analysis of a famous banking brand based on twitter feedback. We have 2 pictures of word cloud analysis – one displaying set of positive things said and the other one, well, not so positive things.
As you could see the sentiment is fairly positive with the word cloud of positive tweets having more data than the word cloud of negative tweets.
A couple of points that we need to remind ourselves while we are in the process of analyzing these groups of words:
- The most important caveat or caution area is the quality of data. The ability to use high quality of data for analysis comes through the ability to analyze the initial data and apply data set specific filters to weed out low quality data, the notion of “one size does not fit all” fits ideally into this scenario. While doing analysis we should be very confident about the type of data that we are looking for so that we are expert enough to apply the correct regular expressions or use the right set of stop words to get the data quality that we are looking for.
- The second but nonetheless important factor is the quantity of data, we may not be able to derive meaningful results from minuscule data sets. There should volume and variety available so that the model that is used to derive the analysis results can be trained adequately. We are dealing with machine learning and if we don’t teach the system properly and feed them with enough variety and volume of data, the learning algorithm would not.
- There is also subjectivity involved around the date and time this data was pulled from twitter for analysis – for example the sentiment would hover around positive things after a charity drive or football game when the brand has sponsored the game or has some interesting ads versus when the sentiment would turn adverse after news around recession or job cuts come up. Hence companies doing brand or sentiment analysis exercises should factor in this.