Guest blog post by Dr. Livan Alonso.
Twitter has more than 250 million monthly active users who tweet more than 500 millions tweets per day. In the case someone is following many people, it isn’t feasible to read each tweet from them.
Do you have any idea about what people are twitting about Data Science, Big Data, Business Analytics, Hadoop, Machine Learning or R programming?
Word clouds are one of the simplest and most intuitive ways of visualizing text data. The clouds give greater prominence to words that appear more frequently in the tweets. During the last week, we collected all public tweets related to such topics, allowing us to track what people are talking about around the world and here we present the world cloud for each of them.
Using these word clouds we can more likely get answers to questions such as: What is Big Data? What can we do with it? Who uses Big Data? Who tweets about Big Data? What platforms and software products are dealing with Big Data? But even more, if we collect tweets for long periods of time, we will be able to generalize and extract definitions from buzzwords and new trends.
Some interesting facts based on the Big Data word cloud and the collected data:
As it can be noticed, this word cloud could help us to understand the phenomenon that is Big Data, but also other new trends.
Could you find other interesting facts for the other word clouds? Let us know in the comments.
How was the data collected?
All new tweets related to specific topics (Big Data, Data Science, Hadoop, etc.) were collected using a Python code, running in real-time in our dedicated Amazon EC2 Linux server, that saves them in a file.
The Natural Processing Language Toolkit (NLTK)1 was used to process the tweet data. Let’s explain simple things you can do with NLTK, for example: tokenizing and tagging of a recent @datasciencectrl tweet (terminal commands in python):
>>> import nltk
>>> tweet="10 types of regressions for #DataScientists, #Statisticians and other #Analytic practitioners, which one to use? http://ow.ly/zsvs4"
>>> tokens = nltk.word_tokenize(tweet)
['10', 'types', 'of', 'regressions', 'for', '#', 'DataScientists', ',', '#', 'Statisticians', 'and', 'other', '#', 'Analytic', 'practitioners', ',', 'which', 'one', 'to', 'use', '?', 'http', ':', '//ow.ly/zsvs4']
nltk.pos_tag processes the tweet text and attaches a tag to each word, e.g. CD: cardinal numbers, NNS: plural nouns, NNP: singular nouns, etc.
>>>tagged = nltk.pos_tag(tokens)
[('10', 'CD'), ('types', 'NNS'), ('of', 'IN'), ('regressions', 'NNS'), ('for', 'IN'), ('#', '#'), ('DataScientists', 'NNS'), (',', ','), ('#', '#'), ('Statisticians', 'NNS'), ('and', 'CC'), ('other', 'JJ'), ('#', '#'), ('Analytic', 'NNP'), ('practitioners', 'NNS'), (',', ','), ('which', 'WDT'), ('one', 'CD'), ('to', 'TO'), ('use', 'VB'), ('?', '.'), ('http', 'NN'), (':', ':'), ('//ow.ly/zsvs4', '-NONE-')]
Nouns can be printed using the tag information and functions from the regular expression package (https://docs.python.org/2/library/re.html).
>>> for i in range(0,len(tagged)):
... if re.match('N',tagged[i]):
... print (tagged[i])
The processed data was visualized using a R package, called worldcloud (http://cran.rproject.org/web/packages/wordcloud/wordcloud.pdf).
The word clouds are represented for the most 200 frequent words in the tweets (least frequent terms were dropped). The font size of each word depends on the frequency of appearance on the tweets. The color of words comes from the ColorBrewer palettes, specifically the “Dark2”, the frequency range from the most 200 frequent words were split in 8 groups. Then, words were colored from least to most frequent (see palette below).
1-Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.