Subscribe to DSC Newsletter

Clustering Arabic Tweets to seize the reputation of a Brand in the Middle-East

This is a technical post. The purpose is to test topic modeling techniques with Python on arabic texts in order to grasp the efficiency of the approach used in my previous work (NASDAG.org) on a different langage.

The same code may be applied as is to any "brand" by changing the keywords searched when querying the Twitter API.

My approach is the following:

  1. Use the Twitter API to extract up to 500 arabic tweets using selected keywords related to a brand (I will choose Renault "رينو") in this exemple
  2. Save the tweets into a Mongo database
  3. Filter retweets, arabic and english stop-words
  4. Tokenize (using words, bigrams and trigrams)
  5. Vectorize (using normalised tf-idf)
  6. Reduce dimensionality
  7. Apply Agglomerative Clustering or Latent Dirichlet Allocation techniques in order to identify relevant topics

Follow this link in order to learn more about this approach. You can also contact me for further explanation if you are interested in applying this approach to your own brand by analyzing a massive amount of Arabic text...

Philippe

Views: 522

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service