This applies to data science research as well as any other analytic discipline. For centuries, scientific research was performed in Academia, by university professors managing their own labs. Much of the research was carried out by young scientists who just completed their PhD. The selection process has always favored the same type of personality. The basic rule is "publish or perish" which produces the following drawbacks:
Data Science Central Research Lab
With the tenure process, research directors must be careful not to engage in revolutionary experimentation, in order to please their grantors and faculty boards. They also spend a considerable amount of time chasing money, rather than doing research.
This hurts innovation. The private industry and some agencies have their own research labs. But they hire the same type of individuals: the kid that always had perfect grades at school, assuming that this is a predictor of research quality (and since they define what quality is, we are stuck in a loop here). Yet the private sector provides an alternative to Academia, though many times, research results are kept secrets and incorporated into patents.
The New Model
Here I propose an new approach to scientific research, and discuss how it could be implemented on a larger scale, via proper monetization. It consists of independent professionals performing their research and publishing in popular blogs rather than in scientific journals, and obtaining themselves the data that they need for their tests and experimentation (many data sources are free, many projects are posted on Kaggle, and research-oriented projects are posted on DataScienceCentral, some using simulated data). You can call it crowd-research.
The advantages are as follows:
In my case, I realized that publishing in blogs takes 1 hour per article, rather than 50 hours for scientific journals. At $1,000/hour (my hourly rate), and since scientific journals don't pay authors, it's a $49,000 saving per article, that is, hundreds of thousands of dollars saved per year. Also, my articles are shorter, published much faster, reach a thousand times more users, are easier to read (with source code that you can copy and paste, data sets that you can download), and written so as to be understood by many professionals from various applied disciplines, not just a dozen highly specialized theoretical experts. You can compare my article on data videos with one published by a traditional statistician, in a top traditional journal, independently and at the same time. I believe that mine is more useful, provide code to make much faster, longer videos, and is in essence, of superior value.
How to pay for this new type of research?
The money can come from various sources. As a data scientist interested in doing research, you have the following options; you can combine several of them:
If you spend 25% of your time in these money-making activities (listed above), 25% of your time in building your network and reaching out to clients, 25% on doing scientific research (including working on projects that support your research), and 25% managing your business (organizing, planning, operations, finance), you will soon make more money than working in a cubicle, and at the same time doing things that you enjoy, with a real control on your life.
I'll write more articles on how to get started with this career path, and offer mentoring, in the near future. For now, feel free to check out our research lab publications.
Related Articles
Comment
This sounds like an interesting new way to do research, but do not write off academia just yet, and some published papers have been written by a machine. I like the idea of writing anfd partenring with someone who has a goift for sales and marketing, but I don't know anyone like that (and I certainly do not)
PCA is another methodology that I recommend to avoid. It leads to meaningless variables, and lack of interpretation. The whole concept is also sensitive to outliers. And, no I don't chose data to make the job easier, but to provide the best predictive power and ROI. Finally, you'll see yet another example that shows how Jackknife regression works well enough (given its simplicity) in my next paper. There was also another PhD guy who tested it and won a $1,000 award from us. His comments and conclusions can also be found on this blog. And all the references to my articles are accessible from this website, which has far more reach than any statistical journal, another reason why I don't publish in these journals.
Indeed, you criticize the most basic version of my regression, one that does not include any parameter, and compare it with one that uses a few parameters (ridge regression, for which an approximate solution - source code - can also be found on AnalyticBridge, one of the DSC channels). It is as if you stopped reading my article after the first paragraph. Google "jackknife regression", that's where the methodology is explained in detail.
Also, your focus is on the regression only, not the whole picture which goes from defining the problem, gathering the right data, to measuring yield over baseline (after factoring in costs of implementation and maintenance) when the technique is in production mode.
Even the concept of goodness-of-fit is subject to discussion, a reason why I published an L^1 rather than L^2 version, in another DSC paper (a reader also won a $1,000 award for disproving some fact and solving problems related to this L^1 goodness-of-fit metric).
You'll find everything in my upcoming book data science 2.0. Past research is published in my first book available from this website. These are not books meant for academics though, but for the general educated public and developers with no or little background in statistics.
Research paper? Maybe if you had published it in an actual journal, I would know what to read. That's one of the points of having your results in a journal. I am reading this. If you could direct me to a better source, I'll read that as well and give you some feedback. But that page which I criticized has a number of errors and make several erroneous and outlandish claims. You are making some strong claims again here. Have you actually proven these claims?
If you want to do variable selection at the same time, you can of course use the lasso for which there are solid libraries readily available. You also don't always get to pick the dataset that makes your job easiest. There can be important information in correlated data and ignoring much of it because it makes your job difficult is not a good solution. Multi-linear regression (or PCA) methods allow you to untangle the correlation and extract all the information.
David, to insinuate that my methods are incorrect is just plain wrong. To disprove my methods, use honest factual information. Regarding my Jackknife, first, I propose to select an optimal set of variables based on predictive power, in short, that's my variable reduction step. Then Jackknife regression includes a clustering step, to eliminate much of the correlations between variables. You seemed to have missed that, which is also the reason why I call it Jackknife (because it does regression, clustering variables, and clustering observations at the same time). Then, if by correct, you mean that my regression coefficients are off from the theoretical solution, so be it. Mines are easy to interpret, not subject to over-fitting, and yield better predictive power and more reliable predictions, especially when all variables are highly correlated. Imagine a data set with 100 identical variables, that is, 100 variables as much correlated to each other as one can imagine. My Jackknife regression will provide predictions IDENTICAL to any other sound methods, but with meaningful regression coefficients (all identical), unlike many other methods. But then, if variables are highly correlated, I would question the process used to gather data in the fist place (data scientists are supposed to come up with metrics, data sets, and correctly tracking metrics that make sense, it's part of our job - even before a predictive analytics project get started).
Read my research paper in detail before making erroneous conclusions. You are a perfect example why peer-review by academic scientists is a waste of time for people like me, and one of the many reasons why I chose to publish my research in public blogs where everyone, including you, can and do criticize my techniques. My techniques have been successfully tested on various real data sets over the last 25 years, and deployed in production mode, including to catch massive Botnets, fraud detection, and in various arbitraging contexts, with many big and small companies. They've also benefited from improvements thanks to many of our bloggers, but in this discussion, I don't see any constructive comment from you that would lead to a significant improvement.
Vincent you can be a successful marketer for your ideas but eventually peddling incorrect methods will catch up to you. Someone implementing incorrect methods can end up losing millions of dollars for a company. Intellectual integrity is paramount. Those making outlandish claims and refusing to face objective criticism are the ones peddling snake oil. I looked into your Jackknife Regression today. Before we get into specifics, just look at the jargon. You criticized "bootstrap re-resampling" because the term will scare people. Then you offer up "Jack-knife regression". Is that less jargon-ful?
I actually don't think it matters but the fact is that your jackknife regression is simply incorrect and sometimes will be spectacularly incorrect. It is only going to be close to correct when the correlations among the regressor variable is negligible. But if you somehow knew that (I don't know how you would) then you don't need to do multiple linear regression at all. You just do linear regression one variable at a time. This is the Naive Bayes assumption, an assumption that, I've noticed, you have criticized many times on your blog. When you normalized all the variables, there is no intercept term and so the, already simple, univariate linear regression is identical to your formula Cov(x_i,y)/Var(x_i). But this isn't an approximate attempt at multi-linear regression, it is simply avoiding the issue that regressors could in fact be correlated which is, in fact, the whole point of the problem that multi-linear regression is trying to solve.
While your method isn't worthless, it is far cry from your claims that it is almost as good at techniques like ridge regression and the ridiculous claim you make "All the regression theory developed by statisticians over the last 200 years (related to the general linear model) is useless." That's an extremely irresponsible statement to make.
Here is my attempt to test your method on data with significant correlation among the regressors. Note both normal linear regression and ridge regression require less code than your method in R.
http://data-science-musings.blogspot.com/2015/01/granville-regressi...
To me, it doesn't look like it performs anywhere near as well as ridge regression. Look at the plot. Do you disagree?
Theoretically (assuming we don't ignore 200 years of research on regression), this should be no surprise. Regression is equivalent to trying to reproduce the inverse covariance matrix of the variables. The usual analytic formula for the solution of of multilinear regression y = Mx, is the normal equations, x= (M^T M)^(-1) M^T y. The matrix M^T M is also the covariance matrix, so you see, you have to invert the covariance matrix to solve the normal equations (though there are better ways to do it). Essentially what your method does is approximate the inverse of a matrix by the element-wise inverse which is of course incorrect and not even a good approximation most of the time. It is correct only for diagonal matrices, e.g. Naive Bayes.
Your method will be particular prone to the issue of confounding. That is, just because two variables are correlated, doesn't mean there is a real relation between them. Multilinear regression is designed to deal with this issue of confounding and tries to unravel the real minimalist structure of correlations. In this simulations I've made, only 20% of the variables are truly connected. Your model result imply they are all connected and so you model does not discover the inherent simplicity behind the apparent complexity. The implementation is simpler but suggests a more complicated and ultimately incorrect model. If we no longer care about correctness then it isn't hard to come up with easy to implement methods.
The regression coefficient between y and x_i is the linear slope of the graph of y versus x_i when all other x are held constant. The correlation coefficient you calculate is the slope of y versus x_i averaged over the probability distribution of all other variables. It's the marginal distribution. Regression is about discovering the conditional distribution. It's not the same problem.
Finally, getting back to this original post. If you came up with this method and discussed it with academics or those with deep statistical background, they would tell you this and probably direct you to something like ridge regression that works much better. Then perhaps you would have instead written a blog on the greatness of ridge regression. By writing off these people as out-of-touch academics, you have missed the opportunity to learn something important and have harmed your readers by misdirecting them to a method that in fact is not really very effective.
I'd be less harsh in my criticism if you didn't make such outlandish claims about the superiority of your methods. In academia, such claims would never see the light of day as the reviewer would definitely recognize these flaws and reject the paper. In the blogosphere everything goes, I guess, and it takes someone like me to chime in and point out what is wrong. Is this really a better way to do it? Do you really want your techniques criticized so publicly. Is some reader new to this material going to know what to think now that some anonymous commenter (myself) seems to be criticizing your method. (You can find me on LinkedIn if you really want).
If there is one thing that I have learned to appreciate about academia is that we can accomplish great things not because we're just smarter than everyone else, but because we can stand on the shoulders of giants or more often the shoulders of a pile of other regular people. We can only do that because we can trust that published results are very likely correct. I can publish a paper based on convex optimization techniques and only concern myself with proving my own work. I don't have to worry that maybe John Von Neumann or Carl Friedrich Gauss or even less famous people totally screwed up the theorems that I use in my everyday work as base for my (probably modest ) extension. I can simply trust them because they have been vetted extensively once before publication and likely many times afterward. This is why the blogosphere with it's anything-goes style would not be an improvement on regular academic research.
The MapReduce NMF paper from Microsoft, I mentioned in my previous comment is available here, if readers of this thread are interested in implementing it. There are 3 NMF versions described in the paper which appeared in the "Proceedings of the 19th International World Wide Web Conference" published by ACM, 2010.
"Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce"
David, the marketing part is critical if you want your methods to be widely adopted. Use simple English that everyone understand, like "model-free confidence intervals" as opposed to "Bootstrap re-sampling" or more obscure terms. Provide Excel spreadsheet with calculations - everyone knows how to deal with a basic Excel spreadsheet. And use a few sentences to explain why your methodology makes sense.
To mainstream people (or even engineers), words such as "bootstrap re-sampling" look like someone trying to package his product using marketing strategies similar to snake-oil salesmen: using words very few (but a clique of initiated) understand.
In a next paper, I will show that extremely simplified methods such as Jackknife regression leads to results very similar to sophisticated techniques that very few people know. So why use the sophisticated technique if the simplified version yields similar return? Along the same lines, having hundreds of arcane methods in your arsenal makes the practitioner confused: it hurts more than it helps.
To get back to the core of this discussion, the type of research that you advocate serves one purpose (boosting your academic career), mine serves a different purpose (dissemination of robust techniques outside academia). Even though sometimes the methods being deployed are identical, or at least of same quality.
Finally, plenty of amateur data scientists participate in Kaggle competitions. They sometimes bring value, all you have to do is solve a real problem or find root causes based on a real datasets -- no degree needed. Awards range from $500 to $200,000. We also have similar competitions on DSC, open to anyone without restrictions.
Vincent, most academics just make their pre-print copies of their papers available for free on the net, even before the final draft of their submitted papers (to whichever journals they send them to) are accepted for publication. So, this means that R&D work is produced very fast (the only delay is the publishers) and anyone can get access to them quickly.
For myself, I will always prefer to follow academic journals first & foremost when I look to find new solutions or algorithms to implement, because it is rigorous. But that's just me & of course others are free to ignore them. I disagree with the notion that academics keep their work secret for patent application purposes. The PageRank algorithm is patented by Google (the patent originally held by Stanford but Google bought it back for $300 mils) but the algorithm is published in the literature (1998). Their paper is available on the net for anyone to implement it (there are already free implementations of it on the net already), so sure Google now may have improved on from the original variant of the algorithm, but that's what drives innovations. Keeping it secret will make the other guy (say Microsoft & others) want to beat the retrieval accuracy of PageRank by coming up with something new.
The kind of free & open publishing and sharing (remember that academics work are open, even if you pay to buy a copy of their papers from the journals) will not drive innovations. Google, Microsoft, IBM, Intel, etc,..., have an army of PhDs that perhaps do nothing all day except to invent new algorithms because they fear of not beating the other. Once they simply follow others, then that's the moment they have stopped innovating.
One just have to look at the tons of research papers that are pouring out from Microsoft Research which are available in the academic literature here. That's really innovating. Sure, what goes into product development at Microsoft based on what they published in their papers may not be exactly the same as the algorithm they published, perhaps its a more robust variant than the one that they published (it make sense since they already give out their research to be accessible by any one from the journals & not only that its their commercial secrets not to share the robust version in a journal), but the fact that they publish their new algorithms is a revelation on its own. They do share their work.
Just browse along their R&D site here and see high quality papers on new algorithms & methods that are being submitted for publications in various journals. Microsoft has been doing this for over a decade.
http://research.microsoft.com/en-us/
I've frequently implemented some algorithms that are being published from Microsoft Research, like the MapReduce NMF for dyadic data (non-negative matrix factorization). So, Microsoft published their versions of MapReduce NMF to share with the wider world.
If decision makers are scared by jargon they probably shouldn't be decision makers. I'm fine with amateur data scientists as long as amateur means what it is supposed to mean, not doing data science for money. If someone in business wants ROI, they had better hire a professional data scientist not an amateur just as they hire professional lawyers and professional accountants.
But something like bootstrap re-sampling is just too simple and too useful for anyone to miss out on. It's as simple as this. Calculate your statistic on the data. Now repeat that some some number of times after resampling the data (with replacement) and use the distribution of results to compute your confidence intervals or standard deviation or whatever you want. It's typically about 5 lines of code. It's simpler than your subsampling methods, has more statistical power and just as easy (easier really) to explain. Plus it is proven to give correct results. So why not just teach that instead?
David, my business experience is that jargon such as "bootstrap re-sampling" scares decision makers, it even scares me. And the use of "it is better" scares them even more, unless "better" means consistently increased ROI. There are millions of supposedly "better" statistical strategies, and great Princeton PhD's tried to use them to beat the market (Wall Street), but it seems that very few (if any) succeeded over extended time periods.
I'm not in the business of convincing academics to use my methods. My goal is to create a unified (scalable, robust, simple, automatabe) approach to dealing with randomness, and maybe even substantially re-invent statistics, as long as it helps better solve problems that deal with modern data and modern computers.
On a different note, why not accept amateur data scientists? In astronomy, they've proved to be incredibly useful.
© 2019 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central