Home » Uncategorized

Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)

This article has two parts: 

  • Listing the top 20 experts, along with their Twitter handle, rank in reverse order, number of Twitter followers, and Klout score. We hope to soon see a woman among the top 10.The top woman is currently #11.
  • Discussing a robust methodology to score experts

1328071

Source for picture: click here

1. The top 20

This is a subset of a bigger list published here. Note that our data scientist is ranked #6.

Rank                                                                   Twitter                                            Followers     Score
20 Bernard Marr BernardMarr 86K 66.5
19 Jeremy Waite jeremywaite 93K 67.5
18 R Ray Wang wang0 80K 67.6
17 Hadley Wickham hadleywickham 23K 68
16 Mike Briercliffe mikejulietbravo 54K 68.5
15 Evan Sinar EvanSinar 29K 68.6
14 Bob E. Hayes bobehayes 5K 68.65
13 Dez Blanchfield dez_blanchfield 77K 68.7
12 Andrew Ng andrewng 48K 69.5
11 Hilary Mason hmason 68K 70
10 Gregory Piatetsky kdnuggets 48K 70.35
9 Ronald van Loon Ronald_vanLoon 29K 71.5
8 Hans Rosling HansRosling 296K 72.05
7 Randy Olson randal_olson 80K 73
6 Vincent Granville analyticbridge 128K 73.5
5 Timothy Hughes Timothy_Hughes 134K 73.6
4 Kirk Borne kirkdborne 58K 74
3 Vala Afshar ValaAfshar 101K 78.5
2 Simon Porter simonlporter 66K 80.5
1 Nate Silver NateSilver538 1328K 81

2. Proposed Algorithm to Score Experts

Scores can measure many things: popularity, how influencial someone is in a specific domain, and so on. We have worked on creating various lists over the past few years, typically with a goal different from journalists, rewarding expertise and the volume of quality publications and references, over traditional popularity metrics. We have built various lists of top data science / big data experts:

You should check these three lists and the associated literature, not just out of curiosity, but to discover the methodology used in each case: a methodology designed by a real data scientist, not a black-box tool used by a journalist. Thus our lists are robust, sound and unbiased – or at least the bias is known and disclosed.

Since we have seen lists in the past where the #1 expert was irrelevant, here we propose a 3-steps methodology to build lists and compute scores:

Step #1: Categorize sub-domains (of big data, data science, etc.)

Break the domain into sub-domains. For instance, we established a while back that 

Data Science = 0.24 * Data Mining + 0.15 * Machine Learning + 0.14 * Analytics + 0.11 * Big Data

Read this paper to learn about the methodology used to arrive at this equation. Note that weights and even sub-domains evolve over time. And these sub-domains overlap, though that’s not difficult to handle.

Step #2: Categorize experts, and score by sub-domains

Start with a large list of experts, make sure you are not missing any big ones (I have seen lists that were missing the number one expert).

Then categorize these experts according to pre-selected sub-domains (big data, machine learning, and so on in this case). This is performed by

  • scraping tons of tweets or blog posts from these experts (or better, from high-score people talking about these experts),
  • creating keyword frequency tables,
  • extracting (for each expert) keywords associated with the sub-domains,
  • and eventually clustering these experts by sub-domains.

This is done using an indexation algorithm. We have used an indexation (or automated tagging) algorithm in a very similar context, to assign sub-categories to 2,500 data science blogs. The methodology is explained in details here. If the data is well structured, you can proceed as here: we were able to determine that Gregory Piatetski-Spapiro and Vincent Granville belongs to a same cluster, while Kirk Borne and Monica Rogati belongs to another, machine learning heavy cluster.

Note: Klout scores (actually ranks) are also available at the sub-domain level, click here for details.

Step #3: Blend scores across sub-domains

Blend the scores obtained at the sub-domain level (in step #2) using the blending formula obtained in step #1.

Caveat: Experts that do not tweet or publish much might not have sub-domain scores that are statistically significant. This can be handled by computing an aggregated score across sub-domains, and ignoring the sub-domain scores. Statistical significance, at the score level, can be computed using the following method

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge