This article has two parts:
Source for picture: click here
1. The top 20
|18||R Ray Wang||wang0||80K||67.6|
|14||Bob E. Hayes||bobehayes||5K||68.65|
|9||Ronald van Loon||Ronald_vanLoon||29K||71.5|
2. Proposed Algorithm to Score Experts
Scores can measure many things: popularity, how influencial someone is in a specific domain, and so on. We have worked on creating various lists over the past few years, typically with a goal different from journalists, rewarding expertise and the volume of quality publications and references, over traditional popularity metrics. We have built various lists of top data science / big data experts:
You should check these three lists and the associated literature, not just out of curiosity, but to discover the methodology used in each case: a methodology designed by a real data scientist, not a black-box tool used by a journalist. Thus our lists are robust, sound and unbiased - or at least the bias is known and disclosed.
Since we have seen lists in the past where the #1 expert was irrelevant, here we propose a 3-steps methodology to build lists and compute scores:
Step #1: Categorize sub-domains (of big data, data science, etc.)
Break the domain into sub-domains. For instance, we established a while back that
Data Science = 0.24 * Data Mining + 0.15 * Machine Learning + 0.14 * Analytics + 0.11 * Big Data
Read this paper to learn about the methodology used to arrive at this equation. Note that weights and even sub-domains evolve over time. And these sub-domains overlap, though that's not difficult to handle.
Step #2: Categorize experts, and score by sub-domains
Start with a large list of experts, make sure you are not missing any big ones (I have seen lists that were missing the number one expert).
Then categorize these experts according to pre-selected sub-domains (big data, machine learning, and so on in this case). This is performed by
This is done using an indexation algorithm. We have used an indexation (or automated tagging) algorithm in a very similar context, to assign sub-categories to 2,500 data science blogs. The methodology is explained in details here. If the data is well structured, you can proceed as here: we were able to determine that Gregory Piatetski-Spapiro and Vincent Granville belongs to a same cluster, while Kirk Borne and Monica Rogati belongs to another, machine learning heavy cluster.
Note: Klout scores (actually ranks) are also available at the sub-domain level, click here for details.
Step #3: Blend scores across sub-domains
Blend the scores obtained at the sub-domain level (in step #2) using the blending formula obtained in step #1.
Caveat: Experts that do not tweet or publish much might not have sub-domain scores that are statistically significant. This can be handled by computing an aggregated score across sub-domains, and ignoring the sub-domain scores. Statistical significance, at the score level, can be computed using the following method.