Subscribe to DSC Newsletter

Twitter Weather Radar - Test Data for Language Analytics

By: Nicholas Hartman, Director at CKM Advisors

Today we'd like to share with you some fun charts that have come out of our internal linguistics research efforts. Specifically, studying weather events by analyzing social media traffic from Twitter. 

We do not specialize in social media and most of our data analytics work focuses on the internal operations of leading organizations. Why then would we bother playing around with Twitter data? In short, because it's good practice. Twitter data mimics a lot of the challenges we face when analyzing the free text streams generated by complex processes. Specifically:

  • High Volume: The analysis represented here is looking at around 1 million tweets a day. In the grand scheme of things, that's not a lot but we're intentionally running the analysis on a small server. That forces us to write code that rapidly assess what's relevant to the question we're trying to answer and what's not. In this case the raw tweets were quickly tested live on receipt with about 90% of them discarded. The remaining 10% were passed onto the analytics code.  

  • Messy Language: A lot of text analytics exercises I've seen published use books and news articles as their testing ground. That's fine if you're trying to write code to analyze books or news articles, but most of the world's text is not written with such clean and polished prose. The types of text we encounter (e.g., worklogs from an IT incident management system) are full of slang, incomplete sentences and typos. Our language code needs to be good and determining the messages contained within this messy text.

  • Varying Signal to Noise: The incoming stream of tweets will always contain a certain percentage of data that isn't relevant to the item we're studying. For example, if a band member from One Direction tweets something even tangentially related to what some code is scanning for the dataset can be suddenly overwhelmed with a lot of off-topic tweets. Real world data is similarly has a lot of unexpected noise.  

In this exercise, tweets from Twitter's streaming API JSON stream were scanned in near real-time for their ability to 1) be pinpointed to a specific location and 2) provide potential details on local weather conditions. The vast majority of tweets passing through our code failed to meet both of these conditions. The tweets that remained were analyzed to determine the type of precipitation being discussed.

The figure at the top of this post shows a summary of the analysis for the afternoon of 14 December 2013. Around this time a major storm system was moving up the eastern seaboard dumping heavy rain and snow along I-95. Twitter commentary indicating locally snowy conditions is displayed in blue, while commentary indicating rainy conditions is displayed in green. The 'rain/snow' line that extended from New York City down towards Philadelphia and Washington DC is clearly visible. There are some anomalies (like the blue in southern CA and FL, but the snow noise is small relative to the signal coming out of the northeast).  

To see additional graphics, including timelapse animations of the T...

Views: 1028

Tags: analysis, api, gps, json, language, r, twitter

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service