Subscribe to DSC Newsletter

Word Clouds of Big Data, Data Science and Other Buzz Words

Guest blog post by Dr. Livan Alonso. 

Twitter has more than 250 million monthly active users who tweet more than 500 millions tweets per day. In the case someone is following many people, it isn’t feasible to read each tweet from them.

Do you have any idea about what people are twitting about Data Science, Big Data, Business Analytics, Hadoop, Machine Learning or R programming?

Word clouds are one of the simplest and most intuitive ways of visualizing text data. The clouds give greater prominence to words that appear more frequently in the tweets. During the last week, we collected all public tweets related to such topics, allowing us to track what people are talking about around the world and here we present the world cloud for each of them.

Using these word clouds we can more likely get answers to questions such as: What is Big Data? What can we do with it? Who uses Big Data? Who tweets about Big Data? What platforms and software products are dealing with Big Data? But even more, if we collect tweets for long periods of time, we will be able to generalize and extract definitions from buzzwords and new trends.

Some interesting facts based on the Big Data word cloud and the collected data:

  • The Big Data word cloud is the most heterogeneous between all the analyzed ones and it is not centered on few prominent words.
  • The most prominent word from the Big Data word cloud was Analytics, giving an idea that Big Data Analytics is transforming and changing the world through Big Data
  • In just one week, more than 55 000 tweets mentioned Big Data, otherwise there were around 5600 for Data Science; 1300, Business Analytics; 5400, Machine Learning; 5700, Hadoop and 2400, R programming.

As it can be noticed, this word cloud could help us to understand the phenomenon that is Big Data, but also other new trends.

Could you find other interesting facts for the other word clouds? Let us know in the comments.

How was the data collected?

All new tweets related to specific topics (Big Data, Data Science, Hadoop, etc.) were collected using a Python code, running in real-time in our dedicated Amazon EC2 Linux server, that saves them in a file.

The Natural Processing Language Toolkit (NLTK)1 was used to process the tweet data. Let’s explain simple things you can do with NLTK, for example: tokenizing and tagging of a recent @datasciencectrl tweet (terminal commands in python):

>>> import nltk

 

>>> tweet="10 types of regressions for #DataScientists, #Statisticians and other #Analytic practitioners, which one to use? http://ow.ly/zsvs4"

 

>>> tokens = nltk.word_tokenize(tweet)

 

>>> tokens

['10', 'types', 'of', 'regressions', 'for', '#', 'DataScientists', ',', '#', 'Statisticians', 'and', 'other', '#', 'Analytic', 'practitioners', ',', 'which', 'one', 'to', 'use', '?', 'http', ':', '//ow.ly/zsvs4']

 

nltk.pos_tag processes the tweet text and attaches a tag to each word, e.g. CD: cardinal numbers, NNS: plural nouns, NNP: singular nouns, etc.

 

>>>tagged = nltk.pos_tag(tokens)

 

>>> tagged

[('10', 'CD'), ('types', 'NNS'), ('of', 'IN'), ('regressions', 'NNS'), ('for', 'IN'), ('#', '#'), ('DataScientists', 'NNS'), (',', ','), ('#', '#'), ('Statisticians', 'NNS'), ('and', 'CC'), ('other', 'JJ'), ('#', '#'), ('Analytic', 'NNP'), ('practitioners', 'NNS'), (',', ','), ('which', 'WDT'), ('one', 'CD'), ('to', 'TO'), ('use', 'VB'), ('?', '.'), ('http', 'NN'), (':', ':'), ('//ow.ly/zsvs4', '-NONE-')]

 

Nouns can be printed using the tag information and functions from the regular expression package (https://docs.python.org/2/library/re.html).

 

>>>import re

>>> for i in range(0,len(tagged)):

...     if re.match('N',tagged[i][1]):

...             print (tagged[i][0])

 

types

regressions

DataScientists

Statisticians

Analytic

practitioners

http

 

The processed data was visualized using a R package, called worldcloud (http://cran.rproject.org/web/packages/wordcloud/wordcloud.pdf).

Note

The word clouds are represented for the most 200 frequent words in the tweets (least frequent terms were dropped). The font size of each word depends on the frequency of appearance on the tweets. The color of words comes from the ColorBrewer palettes, specifically the “Dark2”, the frequency range from the most 200 frequent words were split in 8 groups. Then, words were colored from least to most frequent (see palette below).

Reference

1-Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.

Views: 11419

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Tim Bock on November 2, 2017 at 7:29pm

For R users creating a word cloud there is a great new too called Displayr https://www.displayr.com/word-cloud/. The software works in R.

Comment by Livan Alonso on July 29, 2014 at 9:06am

Thanks Amy for your comment.

It would be great to implement the 2-token keywords suggestion. We did not make extensive filtering, but the algorithm and the word clouds can be improved as much as we wish.

We added more information about what colors and sizes represent in the post.

Thanks

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service