.

# O'Reilly 2015 Salary Survey for Data Scientists

Very interesting data compiled and analyzed by O'Reilly, using statistical models such as Lasso regression to predict salary based on different factors. It reminds me our own analysis based on simulated (but realistic) data, to assess whether having Python or R (or both) commands a bigger salary, and what is the extra boost provided by these skills, individually. The statistical model used was Jackknife regression, and it was designed for tutorial purposes.

The O'Reilly survey is much bigger, based on real data, and it includes many factors, as well as factor selection. It uses standard statistical techniques which might be less robust than Jackknife regression. Below is the highlight - a formula to estimate your salary. They tried different models, and use R^2 for model selection. I would recommend using an L^1 metric instead of R^2, which is more robust.

All in all, a great analysis with numerous useful charts. You can download the survey here.

One of O'Reillys statistical (regression-like) models for salary prediction:

For instance, start with a \$30,572 base salary. Add \$13,950 if you are 28 years old (though this variable should be capped, you don't earn more at 58 than you do at 53, I think -- but I could be wrong). Add \$13,200 if you are in California. Add \$9,747 if you know Spark. And so on. Note that in our simulated data, the boosts provided by each skill were not additive. Beyond 3-4 skills, there was no more boost, indeed I believe the number of skills was a component of the model, capped at 3, if I remember correctly.

30572 intercept
+1395 age (per year of age above 18)
+5911 bargaining skills (times 1 for “poor” skills to 5 for “excellent” skills)
+382 work_week (times # hours in week)
-2007 gender=Female
+1759 industry=Software (incl. security, cloud services)
-891 industry=Retail / E-Commerce
-6336 industry=Education
+718 company size: 2500+
-448 company size: <500
+8606 PhD
+851 master’s degree (but no PhD)
+13200 California
+10097 Northeast US
-3695 UK/Ireland
-18353 Europe (except UK/I)
-23140 Latin America
-30139 Asia
+7819 Meetings: 1 - 3 hours / day
+9036 Meetings: 4+ hours / day
+2679 Basic exploratory data analysis: 1 - 4 hours / week
-4615 Basic exploratory data analysis: 4+ hours / day
+352 Data cleaning::1 - 4 hrs / week
+2287 cloud computing amount: Most or all cloud computing
-2710 cloud computing amount: Not using cloud computing
+9747 Spark
+6758 D3
+4878 Amazon Elastic MapReduce (EMR)
+3371 Scala
+2309 C++
+625 Hive
-1931 Visual Basic/VBA
+31280 level: Principal
+15642 title: Architect
+3340 title: Data Scientist
+2819 title: Engineer
-3272 title: Developer
-4566 title: Analyst

It would be nice to create an interactive Excel spreadsheet, or a web form (maybe in JavaScript) to compute expected salary given your particular feature vector. Also, checking whether this data is consistent with other sources such as Indeed.com, Payscale.com or Glassdoor.com (these sources also offer some level of granularity, although limited). Better, blend these data sets together: survey data is not good at catching outliers (people who don't have time filling surveys, and who might be executives with a high salary, or people not speaking English) so we might get a better picture for extreme salaries.

.

Here's one of the numerous charts found in the report:

And the table of content:

Introduction ....................................................................................2
Tools versus Tools ..........................................................................21
Tools and Salary: A More Complete Model ......................................30
Integrating Job Titles into Our Final Model .......................................33
Finding a New Position...................................................................38
Wrapping Up.................................................................................39

DSC Resources