The comparison is performed on a data set where linear regression works well: salary offered to a candidate, based on programming language requirements in the job ad: Python, R or SQL. This is a follow-up to the article highest paying programming skills. The increased accuracy of linear regression estimates is negligible, and well below the noise level present in the data set. The Jackknife method has the advantage to be more stable, easy to code, easy to understand (no need to know matrix algebra), and easy to interpret (meaningful coefficients).
Jackknife is not the first regression approximation developed by the author: check my book pages 172-176 for other examples.
1. Model Comparison
We compare three different methodologies:
The results are displayed in the comparison tab, in the same spreadsheet, as well as in the above figure.
The general framework is as follows:
The dependent variable (to be estimated or predicted) is the overall salary. The data was simulated and represents reality extremely well based on recruiting strategies (see data tab in the spreadsheet to see how this was done), using a model where each professional, in a specific education / years of experience segment, has a salary simulated based on 5 semi-random components:
Cross-correlations were introduced between Python, R, and SQL, and can be fine-tuned (increased or decreased) by playing with the parameters in cells L1:M1 (highlighted in orange background) in the data tab. There are eight types of professionals, encoded using the formula
Code = Python + 2 * R + 4 * SQL
where Python, R, and SQL are binary variables representing whether or not the data scientist has Python, R or SQL skills. When you change the cross-correlations, make sure you have at least 2 or 3 professionals for each code, otherwise the results in the comparison tab will show some errors.
We use modern metrics to show the yield for regression or Jackknife, above the base model (base model = salary, for each professional, is estimated as the average salary across all professionals).
In each case (Jackknife, regression, base model), the model estimates the individual salary based on which skills the data scientist owns, from a pre-selected list of skills: Python, R, and SQL.
2. The Data Set and Spreadsheet
A few comments about the data:
Also, note that L1 metrics are used in the spreadsheet, in the comparison tab, to compare models; these L1 metrics should be preferred over traditional L2 metrics such as R-Squared. L2 metrics are sensitive to outliers.
The formulas and details about Jackknife are found in my article Jackknife logistic and linear regression for clustering and predict.... The model is so simple that we implemented it in our Excel spreadsheet. Linear regression is also implemented in the same spreadsheet. Note that we used the most rudimentary version of Jackknife, using 0 parameters. The regression technique involves three parameters (not including the intercept in both models). Codes with a large number of onservations, as expected, have better predicted value, especially for the Jackknife.
Here's an additional summary table:
Even though standard regression seems to be performing much better, predictions for individual salary - regression versus Jackknife - are not far off, as illustrated in the top figure. Both for regression and Jackknife, only 8 different estimated values are generated, since we have just 8 codes. Note that if we boost correlations to the point that Correl(Python, R) = 1, then the linear regression model will crash, while the Jackknife will perform nicely.
Rudimentary, approximate methods such as Jackknife regression (not to be confused with Efron's bootstrap) are just nearly as good as so-called exact models such as traditional regression, for predictive modeling. The reason is because data is anything but exact, and statistical models are approximate representations of the reality: all models are wrong, some are not as wrong as others. Approximate solutions provide substantial advantages: easy to code (even in SQL) and understand, robust, and easy to interpret. In short, they are a good choice for inclusion in black-box, automated data science.
Finally, the following table pretty much summarizes the small differences between Jackknife and standard regression, for predictions. By error, I mean the absolute value of the error, averaged across the 8 codes, and weighted based on the number of observations in each code. So it's a weighted average. The base model corresponds to the case where the salary estimate is identical (and equal to the mean), for all data scientists.