The winner for our second data science competition is Tom De Smedt, biostatistician completing a Ph.D program at University of Leuven, Belgium. His special interests are in spatial statistics, environmental epidemiology, novel regression techniques and data visualization.
The competition consisted of simulating data and testing the Jackknife regression technique recently developed in our laboratory, on correlated features or variables. The technique provides an approximation to standard regression, but is far more robust and deemed suitable for automated or black-box data science. The easiest version consists of pretending that variables are uncorrelated, to very quickly obtain robust regression coefficients that are easy to interpret. This is the version that Tom has been working on.
Findings
Initial findings suggest, as expected, that the Jackknife approach provides a rough approximation in the context of predictive modeling, although parameter estimates are quite different from standard regression in this test. It is also faster than standard regression when the number of variables is very large (> 10,000), though we are still investigating this issue. The standard regression algorithm might be very efficiently implemented in R, while our Jackknife regression does not benefit yet from the same amount of code optimization.
Next steps
We would like to test the Jackknife regression when applying the clustering step, to group variables into 2, 3 or 4 subsets, to see the improvement in the context of predictive modeling. This is described in section 3 in our original article. This has not tested yet so far.
In particular, we would like to see the improvement when we have a million variables (thus 0.5 trillion correlations) and use sampling techniques to pick up 10 million correlations (using a robust correlation metric) out of these 0.5 trillion, grouping variables using an algorithm identical to our sparse keyword clustering algor....
So, instead of using a 1 million by 1 million correlation table (for the similarity matrix), we would use an hash table of size 10 million, where each entry consists of a pair-value $hash{Var A | Var B}=Corr(A,B). This is 50,000 times more compact than using the full matrix, and nicely exploits sparsity in the data. Then we would like to measure the loss of accuracy by using a sample 50,000 times smaller than the (highly-redundant) full correlation matrix.
More on this soon. We plan on publishing the full results and source code in the next two months. The award for this competition was $1,000. Congratualations Tom!
You can check our first competition, and the winner, here.
Links
Comment
It looks like this type of regression (when number of clusters is set to 1, the default option) works best when the number of vriables is very large, > 1 million. Maybe because in that case, typical data have cross-correlations relatively well distributed among all pairs of variables. An example with 1 million variables is in the context of spam detection, where each variable represents a feature or rule, such as "email contains the keyword viagra".
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central