There are many lists of top data scientists on Twitter. Here we mention and comment one of them, published on BigData-MadeSimple; most other lists have similar drawbacks. It has been argued that the overlap between my Top 25 data scientists on LinkedIn and the Twitter list is small, because there are LinkedIn versus Twitter data scientist users (and maybe Quora or Google+ data scientists).
But is this the real explanation? It turns out that the Twitter lists typically suffer from big gaps (my LinkedIn list has big gaps too, but at least I mention the gaps and biases). This Twitter list misses people that should be in the top three.
So how do you create a good list (with no big gaps)? The answer is simple: use data science! More explicitly, here are 7 tips to make a good list (and we'll share our list with you when we are done)
7 tips to make great lists of popular data scientists
- Do not rank people, put them in alphabetical order (the rankings in the list below look arbitrary)
- Use sound metrics to identify top data scientists: people tweeting about data science, machine learning, Hadoop etc. (assuming you know all popular terms related to data science, if not read this)
- Focus on robust metrics, such as number of followers (difficult to fake), rather than soft metrics such as number of tweets, which are easy to fake
- Filter out bad data: people with a sudden massive spike in number of followers (likely to be fake followers purchased on the black market, especially if there's no spike in number of tweets, and if none of the followers are popular data scientists). Filter out people who are not data scientists (recruiters) but occasionally or always tweet about data science. People whose tweets produce very few retweets, are not influencers.
- Check profiles over time: an old profile that has stopped growing 3 years ago should score less than a recent profile growing fast
- Using hash tags to identify popular data scientists is not a great idea: you will miss all tweeters that rarely use hash tags or that are using hash tags that you don't know; you must combine hash tags with other metrics to identify popular people
- Identify automated vs. manual tweets; most big accounts have some automated tweets, but if more than 90% of the tweets are automated, then we are dealing with a bot, not a human
One of the people not listed in the initial list of 33 data scientists (below) is @analyticbridge; it has more followers (22,900) than pretty much all the people listed below, and is growing faster than many, currently at a rate faster than (say) @kdnuggets (both accounts reached 20,000 followers almost the same week, a few weeks ago). I'm sure many are missing for the same reason: they don't use their name in their profile; @analyticbridge uses 'big data science' rather than his name. Of course, you need to filter out commercial accounts owned by corporations when considering these types of accounts, but that should be easy, using white lists of commercial Twitter accounts.
So who's @analyticbridge? Who else is missing? I'll leave it to you to find out, but we hope to provide a list of our own soon, to fill the gap. Of course, @analyticbridge must be someone with a small ego, as he's not interested in having his name published. In an era of privacy scares, not using your real name could be a good strategy.
- Besides being accurate, not missing very popular, highly connected data scientists, will get your list of "top data science tweeters" re-tweeted and shared by the most connected thought leaders (the ones you did not miss!), potentially multiplying the traffic volume to your web site by a factor ten. Not including these people, from a journalistic point of view, is missing a big opportunity of free traffic to your website.
- Another Twitter account that my business partner (and cofounder of Data Science Central) has created is @DataScienceCtrl. It has close to 7,000 followers, and growing even faster than @AnalyticBridge, so it could also fit in the top 33. However we view this account more as a business account. Also its tweets will soon be mostly automated, and we hope that it will become the second best source of automated tweets about data science (we expect the first source to be a secret project that we are currently working on).
BigData-MadeSimple list of top 33 data scientists
They added @analyticbridge in position #34 after I mentioned the issue. The first number after the handler is the number of followers as of today; @analyticbridge (not in the original list) has 22,900.
- Hilary Mason @hmason - 44,600
- John Myles White @johnmyleswhite - 8,573
- Peter Skomoroch @peteskomoroch - 18,100
- Gregory Piatetsky @kdnuggets - 21,000
- Ryan Rosario @DataJunkie - 8,794
- DJ Patil @dpatil - 18,300
- Jeff Hammerbacher @hackingdata - 16,300
- David Smith @revodavid - 11,200
- Christopher D. Long @octonion - 11,300
- Carla Gentry @data_nerd - 13,400
- Ben Lorica @bigdata - 18,900
- Siah @siah - 4,876
- Ferenc Huszar @fhuszar - 2,113
- Drew Conway @drewconway - 9,902
- Michael Wu Ph.D. @mich8elwu - 8,332
- Matt Wood @mza - 6,643
- Olivier Grisel @ogrisel - 7,328
- Josh Wills @josh_wills - 5,714
- John Foreman @John4man - 9,151
- Jake Porway @jakeporway - 7,088
- Andrew Ng @AndrewYNg - 18,500
- Eric Xu @mathena - 10,400
- Monica Rogati @mrogati - 9,342
- P. Oscar Boykin @posco - 4,640
- Benedikt Koehler @furukama - 6,449
- David Gutelius @gutelius - 2,413
- Marck Vaisman @wahalulu - 1,340
- Andreas Weigend @aweigend - 2,413
- Amy Heineike @aheineike - 1,561
- Sebastian Thrun @SebastianThrun - 24,300
- Jen Lowe @datatelling - 4,558
- Doug Cutting @cutting - 10,500
- Kirk Borne @KirkDBorne - 12,600