Very interesting data compiled and analyzed by O'Reilly, using statistical models such as Lasso regression to predict salary based on different factors. It reminds me our own analysis based on simulated (but realistic) data, to assess whether having Python or R (or both) commands a bigger salary, and what is the extra boost provided by these skills, individually. The statistical model used was Jackknife regression, and it was designed for tutorial purposes.

The O'Reilly survey is much bigger, based on real data, and it includes many factors, as well as factor selection. It uses standard statistical techniques which might be less robust than Jackknife regression. Below is the highlight - a formula to estimate your salary. They tried different models, and use R^2 for model selection. I would recommend using an L^1 metric instead of R^2, which is more robust.

All in all, a great analysis with numerous useful charts. You can download the survey here.

**One of O'Reillys statistical (regression-like) models** for salary prediction:

For instance, start with a $30,572 base salary. Add $13,950 if you are 28 years old (though this variable should be capped, you don't earn more at 58 than you do at 53, I think -- but I could be wrong). Add $13,200 if you are in California. Add $9,747 if you know Spark. And so on. Note that in our simulated data, the boosts provided by each skill were not additive. Beyond 3-4 skills, there was no more boost, indeed I believe the number of skills was a component of the model, capped at 3, if I remember correctly.

30572 intercept

+1395 age (per year of age above 18)

+5911 bargaining skills (times 1 for “poor” skills to 5 for “excellent” skills)

+382 work_week (times # hours in week)

-2007 gender=Female

+1759 industry=Software (incl. security, cloud services)

-891 industry=Retail / E-Commerce

-6336 industry=Education

+718 company size: 2500+

-448 company size: <500

+8606 PhD

+851 master’s degree (but no PhD)

+13200 California

+10097 Northeast US

-3695 UK/Ireland

-18353 Europe (except UK/I)

-23140 Latin America

-30139 Asia

+7819 Meetings: 1 - 3 hours / day

+9036 Meetings: 4+ hours / day

+2679 Basic exploratory data analysis: 1 - 4 hours / week

-4615 Basic exploratory data analysis: 4+ hours / day

+352 Data cleaning::1 - 4 hrs / week

+2287 cloud computing amount: Most or all cloud computing

-2710 cloud computing amount: Not using cloud computing

+9747 Spark

+6758 D3

+4878 Amazon Elastic MapReduce (EMR)

+3371 Scala

+2309 C++

+1173 Teradata

+625 Hive

-1931 Visual Basic/VBA

+31280 level: Principal

+15642 title: Architect

+3340 title: Data Scientist

+2819 title: Engineer

-3272 title: Developer

-4566 title: Analyst

It would be nice to create an interactive Excel spreadsheet, or a web form (maybe in JavaScript) to compute expected salary given your particular feature vector. Also, checking whether this data is consistent with other sources such as Indeed.com, Payscale.com or Glassdoor.com (these sources also offer some level of granularity, although limited). Better, blend these data sets together: survey data is not good at catching outliers (people who don't have time filling surveys, and who might be executives with a high salary, or people not speaking English) so we might get a better picture for extreme salaries.

.

**Here's one of the numerous charts found in the report**:

**And the table of content**:

Introduction ....................................................................................2

How You Spend Your Time.............................................................13

Tools versus Tools ..........................................................................21

Tools and Salary: A More Complete Model ......................................30

Integrating Job Titles into Our Final Model .......................................33

Finding a New Position...................................................................38

Wrapping Up.................................................................................39

**DSC Resources**

- Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
- Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
- Buzz: Business News | Announcements | Events | RSS Feeds
- Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers

**Additional Reading**

- 50 Articles about Hadoop and Related Topics
- 10 Modern Statistical Concepts Discovered by Data Scientists
- Top data science keywords on DSC
- 4 easy steps to becoming a data scientist
- 13 New Trends in Big Data and Data Science
- 22 tips for better data science
- Data Science Compared to 16 Analytic Disciplines
- How to detect spurious correlations, and how to find the real ones
- 17 short tutorials all data scientists should read (and practice)
- 10 types of data scientists
- 66 job interview questions for data scientists
- High versus low-level data science

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

© 2017 Data Science Central Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

## You need to be a member of Data Science Central to add comments!

Join Data Science Central