I present here the results of a data science study about data science. Based on LinkedIn data (top people listed when you do a people search for data science, from a LinkedIn account with 8,000+ data science connections), we identified the fields most frequently associated with data science, as well as top data scientists on LinkedIn.
The statistical validity of data science related fields is strong, while validity is weak for top data scientists. The reason being that you need to have at least 10 endorsements for your LinkedIn data science in the skills section, to be listed as a top data scientist in the following list. Pioneering data scientists such as Davenport, are not listed because they don't bother adding data science skills in their LinkedIn profile, for the same reason that you are not listed in top big data people lists based on Twitter hash tags, if you don't use Twitter hash tags in your tweets, or if you do not tweet.
Skill Set Found on Profile of Most Popular Data Scientist on LinkedIn
The following lists were created by searching for data scientists with 10+ data science skill endorsements on LinkedIn (see above image as an illustration), and analyzing the top 5 skills that they list, as in the above picture.
Here's the list of data science related fields (DS stands for Data Science):
In short, you could write the
Data science formula
Data Science = 0.24 * Data Mining + 0.15 * Machine Learning + 0.14 * Analytics + 0.11 * Big Data + ...
Note that surprisingly, Visualization does not appear at the top. Explanation: Visualization, just like like Unix, is perceived either as a tool (e.g. Tableau) or sometimes as a soft skill, rather than a hard skill, technique or field, and thus it is frequently mentioned as a skill on LinkedIn profiles, but not in the top 5. Computer Science is also missing, probably because it is too broad, and instead people will list (in their profile) a more narrow field, such as data mining or machine learning.
The following table shows how (from a quantitative point of view) related skills contribute to Data Science, broken down per skill and per contributor. It is used to compute the above summary table. For instance, the first row below reads as follows: Monica Rogati lists data science as skill #3 in her LinkedIn profile; she is endorsed by 61 people for data science, and by 106 people for Machine Learning. The machine learning contribution to data science, coming from her, is 106 / 3 = 35.33. The people listed in the following table are data science pioneers, among the top data scientists according to LinkedIn.
The full list, based on top 10 data scientists identified on LinkedIn, can be downloaded here. Anyone interested in automatically crawling thousands of pre-selected (relevant) LinkedIn profiles to refine my manual analysis?
Alternate formula for data science
If for each skill, instead of summing Endorsements(person, skill) / DS_Skill_Rank(person) over the 10 persons listed in the spreadsheet, we sum SQRT{ Endorsements(person, skill) * DS_Endorsements(person) }, over the same 10 persons, then we obtain a slightly different mix illustrated in the following table.
Data Science = 0.22 * Data Mining + 0.12 * Machine Learning + 0.13 * Analytics + 0.11 * Big Data + ...
Parameters in the formula
Whether you use the first or second formula, you are dealing with three parameters k, n and m:
The most complicated problem is to identify all professionals with at least 10 endorsements for data science, on LinkedIn. It can be solved by using a list of 100 well known data scientists as seed persons, then look at first-degree, second-degree and third-degree related persons. By finding related people, I mean accessing the LinkedIn feature that tells you "people who visit X's profile also visit Y's profile", and extract endorsement counts and skills, for each person, using a web crawler.
Question
In your opinion, which formula is best, from a methodology point of view? The first one, or the alternative? Not surprisingly, they both yield similar results. I like the second one better.
Also, it would be a good exercise to find the equivalent formulas for data mining, big data, machine learning, and so on. Finally, note that people can be more than just data scientists - for instance data scientists and musician at the same time. This explains why the skill rank, for anybody, is rarely if ever #1, for data science: even myself, I get far more endorsements for data mining or analytics, than for data science, in part because data science is relatively new.
Next steps
Read our article types of data scientists to see what we've done with the data analyzed here. In short, classifying data scientists, making discoveries such as: Gregory Piatetski-Spapiro (KDNuggets, Founder) and Vincent Granville belong to a same cluster; Bhandarkar (Pivotal's Chief Scientist) and to a lesser extent Eric Colson (former VP Data Science and Engineering at Netflix), are outliers, both very strong in big data.
Skills Interactions
Machine learning is part of data mining (at least for some people). Data mining and machine learning both involve analytics, big data, and data science. Big data involves analytics, data mining, machine learning and data science, etc. So how do you handle skill interactions? Should you have multiple equations, one for data science, one for data mining, one for big data and so on, and try to solve a linear system of equations? Each equation could be obtained using the same methodology used for the data science equation. I'll leave it to you as an exercise.
Related articles
Comment
Hi Vincent,
How do you think of the difference between Data Mining and Machine Learning? Can we just say Data Mining is kind of like data preparation, cleaning and even crawling from web pages? And Machine learning is mainly about modeling, prediction and so on?
Roberto, it's an study that we will conduct shortly.
I'd try a follow-up question to what you analysed on this one: what skills differentiate the top-level data scientists from the other 8.000. The top skills identified for a data scientist are interesting, but there seems to be a general opinion that being able to analyse is only part of a successful data science career. I would like to prove or disprove that, and also understand what makes these people more successful than any other with a degree on data analysis/machine learning.
@Ralph: It's a bit like if you want to measure US demographics, but your sample contains nobody with a social security number ending with an odd digit (1, 3, 5, 7, 9). Your analysis will still be valid. However, if you wanted to find the 10 oldest Americans still alive, you might miss 50% of them - maybe more as the oldest people were born before social security numbers exist.
Note that I avoid using the word "top 10 data scientists", but instead I use "top 10 data scientists according to LinkedIn's search engine".
Vincent - How are you measuring "statistically validity of data science related fields"?
If you statistical validity for "top data scientists" is weak, what would be the basis for using them as the sample in your other blog?
Yes, statistics is generic indeed. Still, some data scientists include it as a skill, since it's something that potential employers are familiar with.
Hilary is great. Her definition of the data product is probably the best out there.
I think data science statisticians - at least me - pick up predictive modeling or analytics as statistical skill (or more specialized stuff such as time series or Bayesian networks). Statistics, like computer science is too generic. Plus statistics could mean something different, like sport statistics.
It is indeed disappointing that 'visualization' doesn't appear higher in the list. What good is a data product if its use and applicability to a problem domain can't be adequately communicated? I would also argue that the disciplines of human visual and auditory perception are well researched and firmly in the 'hard' science fields.
Intriguing article! I had no idea it was possible to model LinkedIn members based on this data. I would add Statistics in the feature set, as most data scientists are good at it, whether they rely on it or not in their work. I would be interested in learning more about this project after further analysis has been conducted.
© 2016 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central