Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)

This article has two parts:

Listing the top 20 experts, along with their Twitter handle, rank in reverse order, number of Twitter followers, and Klout score. We hope to soon see a woman among the top 10.The top woman is currently #11.
Discussing a robust methodology to score experts

Source for picture: click here

1. The top 20

This is a subset of a bigger list published here. Note that our data scientist is ranked #6.

Rank			Twitter Followers	Score

20	Bernard Marr	BernardMarr	86K	66.5
19	Jeremy Waite	jeremywaite	93K	67.5
18	R Ray Wang	wang0	80K	67.6
17	Hadley Wickham	hadleywickham	23K	68
16	Mike Briercliffe	mikejulietbravo	54K	68.5
15	Evan Sinar	EvanSinar	29K	68.6
14	Bob E. Hayes	bobehayes	5K	68.65
13	Dez Blanchfield	dez_blanchfield	77K	68.7
12	Andrew Ng	andrewng	48K	69.5
11	Hilary Mason	hmason	68K	70
10	Gregory Piatetsky	kdnuggets	48K	70.35
9	Ronald van Loon	Ronald_vanLoon	29K	71.5
8	Hans Rosling	HansRosling	296K	72.05
7	Randy Olson	randal_olson	80K	73
6	Vincent Granville	analyticbridge	128K	73.5
5	Timothy Hughes	Timothy_Hughes	134K	73.6
4	Kirk Borne	kirkdborne	58K	74
3	Vala Afshar	ValaAfshar	101K	78.5
2	Simon Porter	simonlporter	66K	80.5
1	Nate Silver	NateSilver538	1328K	81

2. Proposed Algorithm to Score Experts

Scores can measure many things: popularity, how influencial someone is in a specific domain, and so on. We have worked on creating various lists over the past few years, typically with a goal different from journalists, rewarding expertise and the volume of quality publications and references, over traditional popularity metrics. We have built various lists of top data science / big data experts:

You should check these three lists and the associated literature, not just out of curiosity, but to discover the methodology used in each case: a methodology designed by a real data scientist, not a black-box tool used by a journalist. Thus our lists are robust, sound and unbiased – or at least the bias is known and disclosed.

Since we have seen lists in the past where the #1 expert was irrelevant, here we propose a 3-steps methodology to build lists and compute scores:

Step #1: Categorize sub-domains (of big data, data science, etc.)

Break the domain into sub-domains. For instance, we established a while back that

Data Science = 0.24 * Data Mining + 0.15 * Machine Learning + 0.14 * Analytics + 0.11 * Big Data

Read this paper to learn about the methodology used to arrive at this equation. Note that weights and even sub-domains evolve over time. And these sub-domains overlap, though that’s not difficult to handle.

Step #2: Categorize experts, and score by sub-domains

Start with a large list of experts, make sure you are not missing any big ones (I have seen lists that were missing the number one expert).

Then categorize these experts according to pre-selected sub-domains (big data, machine learning, and so on in this case). This is performed by

scraping tons of tweets or blog posts from these experts (or better, from high-score people talking about these experts),
creating keyword frequency tables,
extracting (for each expert) keywords associated with the sub-domains,
and eventually clustering these experts by sub-domains.

This is done using an indexation algorithm. We have used an indexation (or automated tagging) algorithm in a very similar context, to assign sub-categories to 2,500 data science blogs. The methodology is explained in details here. If the data is well structured, you can proceed as here: we were able to determine that Gregory Piatetski-Spapiro and Vincent Granville belongs to a same cluster, while Kirk Borne and Monica Rogati belongs to another, machine learning heavy cluster.

Note: Klout scores (actually ranks) are also available at the sub-domain level, click here for details.

Step #3: Blend scores across sub-domains

Blend the scores obtained at the sub-domain level (in step #2) using the blending formula obtained in step #1.

Caveat: Experts that do not tweet or publish much might not have sub-domain scores that are statistically significant. This can be handled by computing an aggregated score across sub-domains, and ignoring the sub-domain scores. Statistical significance, at the score level, can be computed using the following method.

DSC Resources

Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
Buzz: Business News | Announcements | Events | RSS Feeds
Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers

Additional Reading