Subscribe to DSC Newsletter

This applies to data science research as well as any other analytic discipline. For centuries, scientific research was performed in Academia, by university professors managing their own labs. Much of the research was carried out by young scientists who just completed their PhD. The selection process has always favored the same type of personality. The basic rule is "publish or perish" which produces the following drawbacks:

  • Re-use of old material (rather than brand new material) for fast publication turnaround
  • Professors accumulating thousands of publications (are they produced by a robot? Some are, but most are actually produced by PhD students)
  • The need to hire many low-paid PostDocs and PhD candidates to produce publications
  • Too many candidates lured into doing a PhD; many will end up as low-paid, part-time adjunct professors due to market saturation

AnalyticBridge's headquarters

Data Science Central Research Lab

With the tenure process, research directors must be careful not to engage in revolutionary experimentation, in order to please their grantors and faculty boards. They also spend a considerable amount of time chasing money, rather than doing research.

This hurts innovation. The private industry and some agencies have their own research labs. But they hire the same type of individuals: the kid that always had perfect grades at school, assuming that this is a predictor of research quality (and since they define what quality is, we are stuck in a loop here). Yet the private sector provides an alternative to Academia, though many times, research results are kept secrets and incorporated into patents.

The New Model

Here I propose an new approach to scientific research, and discuss how it could be implemented on a larger scale, via proper monetization. It consists of independent professionals performing their research and publishing in popular blogs rather than in scientific journals, and obtaining themselves the data that they need for their tests and experimentation (many data sources are free, many projects are posted on Kaggle, and research-oriented projects are posted on DataScienceCentral, some using simulated data). You can call it crowd-research.

The advantages are as follows:

  • Much faster lifecycle, from deciding on a project, to delivering value
  • Fosters innovation: anyone can participate, no need for a PhD
  • You are judged by your peers via comments, shares, and other social activity; the best scientists will naturally shine and attract publicity
  • You can collaborate with anyone
  • Impartiality of your research; you are not pressured by sponsors
  • No political impediments
  • You own copyrights, and your intellectual property is yours (not the case if you work for a private company other than one that you own)
  • You can work online, from home, and live wherever you want (you don't face the two-body dilemma in which one spouse must abandon her/his career and dreams)
  • No need to spend tons of time publishing in top journals
  • Your articles are available for free to other people (articles from scientific journals cost sometimes as much as $40 for access)
  • Your research reaches out to far more people

In my case, I realized that publishing in blogs takes 1 hour per article, rather than 50 hours for scientific journals. At $1,000/hour (my hourly rate), and since scientific journals don't pay authors, it's a $49,000 saving per article, that is, hundreds of thousands of dollars saved per year. Also, my articles are shorter, published much faster, reach a thousand times more users, are easier to read (with source code that you can copy and paste, data sets that you can download), and written so as to be understood by many professionals from various applied disciplines, not just a dozen highly specialized theoretical experts. You can compare my article on data videos with one published by a traditional statistician, in a top traditional journal, independently and at the same time. I believe that mine is more useful, provide code to make much faster, longer videos, and is in essence, of superior value.

How to pay for this new type of research? 

The money can come from various sources. As a data scientist interested in doing research, you have the following options; you can combine several of them:

  • Gathering, curating and selling data: for instance reports about job or market trends (sell directly to clients or sell it on Amazon, data is gathered from sources such as Indeed.com or using your web crawler, harvesting and data and distilling summaries continuously)
  • Design a predictive API, monetized via subscription or advertising: for instance, about stock price forecasts, or to perform clustering of large data sets in the cloud
  • Blog for digital publishers, get paid per article; it's easy to earn $3,000 per month working only a few hours per week
  • Blog for yourself, monetize your blog with advertising (partner with a sales guy)
  • Consulting
  • Tutoring, helping others write their PhD dissertations (the statistics research)
  • Help recruit data science candidates, directly or via your own job board

If you spend 25% of your time in these money-making activities (listed above), 25% of your time in building your network and reaching out to clients, 25% on doing scientific research (including working on projects that support your research), and 25% managing your business (organizing, planning, operations, finance), you will soon make more money than working in a cubicle, and at the same time doing things that you enjoy, with a real control on your life.

I'll write more articles on how to get started with this career path, and offer mentoring, in the near future. For now, feel free to check out our research lab publications.

Related Articles

Views: 4268

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by David Johnston on January 2, 2015 at 6:01pm

I think your format for presenting things like that model-free confidence interval is fine. I'm sure what might tick off academics in statistics is that way you present it. It's presented as if it is a new statistical method and it certainly is not. It's just the sampling method of estimating confidence intervals. But there is a better and even easier method called bootstrap resampling that's been widely known and widely used since 1979 (Efrom). It is provably better in the sense that you can achieve the same accuracy in the confidence intervals with a smaller sample of data. If your data sample is enormous, then it's true that this is just as good. When you are data starved, bootstrap and similar resampling methods will improve things by a lot. Your method may indeed be easier to explain to a client. If I used used bootstrap, I'd probably still explain it like this. 

And what will really tick off academics is that you are comparing this to methods based on assumptions of normality as if that is what they are currently using as best in class methods for confidence intervals. Things like p-values haven't been cutting edge since the advent of computers. Analytic formulae were needed in those days because complex calculations not leveraging the symmetries of mathematics simply were out of reach. Some people may indeed be trying to use normal statistical formulae with non-normal data and could benefit from this. Perhaps some are out-of-touch professional statisticians but they aren't academic statisticians. 

Overall, I think you're being too hard on academic statisticians. Most are more practically minded than you think. Many published articles are about approximation methods and heuristics; algorithms that are fast and or easy to implement are highly sought out.

I would admit that there is a selection against articles that are entirely based on numerical verification; that is methods without any mathematical theory behind them. Perhaps these are simply viewed as unfinished ideas. Perhaps it's is just that it is too easy to combined things to create heuristics that appear to work that journals would be flooded with them. Academics want ideas that they can build on top of. They don't want to be building on castles made of sand. Nothing creates certainty like a real theorem.

In fact, Efrom did not invent the bootstrap. I had been widely used for years. Rather he proved that it actually gave correct confidence intervals. Apparently that was considered a break-through whereas it's actually invention (and likely reinventions) were not and perhaps there is indeed something wrong with that. 

Comment by Vincent Granville on January 2, 2015 at 3:52pm

David, you make a number of good points. But publishing in a blog like this one - with many highly educated analytic people including PhDs like you - actually allows for real criticism and peer-review of whatever is posted. Let me illustrate this with one of my articles posted here: model-free confidence intervals. I can make the following comments:

  • It was proved by readers of this very blog, to be equivalent to traditional methods, so maybe it is not so original (someone posted that it was exactly identical to some very obscure method that I've never heard of, and that's OK with me)
  • However, it is part of an new unified approach, where no models are used, and robustness is paramount as the final purpose is to develop sound, coherent methods understood and successfully used by the non-initiated, automatable, and not subject to over-fitting or other misuses
  • It is OK if not truly original as long as it works as well or nearly as well as the best alternatives, in applied environments, and OK if other people copy it - actually, I'd love to see my article copied. For many statisticial researchers, 'best' has a different meaning than mine. For me, 'best' means good enough, simple, scalable, outlier-resistant, fast and easy to code, fast algorithm, and robust (it also involves testing via sound, state-of-the-art cross-validation techniques) 
  • Most importantly, the purpose is to disseminate broadly and quickly simple and efficient methods (part of a general, unified framework), not to gain fame, originality or a job in a research lab, or even any job, for that matter

So it serves a useful purpose, and different purpose than academic research. And people like me don't have the time nor the interest in publishing in scientific journals. I could not be hired in any research position anyway because the application process, quite tedious, would overwhelm me. So academia is biased in the sense that it only hires certain types of researchers, and it does not include very creative, applied, business-oriented people like me (the real true reasons in my case are because I am overpaid, enjoy operating a business and what I am doing now, and I fit badly in highly structured environments). 

The fact that people like me publish independently, makes the world richer, not poorer. In some ways, I democratize knowledge, making it more accessible and robust to use in black-boxes and other production environments. My 2 cents.

Comment by David Johnston on January 2, 2015 at 2:17pm

I'm sorry but I have to say that I think this idea is off-base. I'm not going to defend everything about academic life but I will say that academic research definitely has it's place. I'd hate to see research become dominated by internet popularity, moneyed interest and politics. There are some great things about academic research. When you read a peer reviewed journal article you can't be sure everything is correct. However you can be almost sure that the work isn't garbage created by charlatans because experts in the field have read through it closely. You can be almost sure that the ideas weren't simply stolen from others because a published article needs to have references to prior work and a brief summary of the current state of the field.

 

If internet popularity was the gauge of the quality of work then sensationalism would be the way to stardom. One could argue that sensationalism is the way to stardom in the blogosphere. (I don't mean that as a dig at you or any blogger.)

I'm seen these sentiments made by people who seem to feel that there is an unsurmountable barrier to them publishing in academic journals if they are not an academic. However there is no such barrier except for the fact that one needs to meet the expectations of publishable work. Those without a research background typically aren't very good at preparing their work to meet those standards. The most difficult hurdles to overcome are showing that the work is correct and that it is original. The first is simply rigor. Often people have good ideas but not the training to produce a well thought out rigorous argument. If they submit such work to a journal, the referee may simply not have confidence that the results are correct and put the burden back on the authors to supply this rigor at which point the authors often give up as they don't know how to proceed. 

The other big sticking point is originality. This is the biggest hurdle. Most non-academic authors (or academics publishing in a foreign domain) simply don't have the expertise in the field to know whether their work really is original. Not being original doesn't mean intentionally copying someone else's work. It means that you need to have arrived at the solution before anyone else. If you're not the first one there, you might write a review article (summarizing a topic) or a book but you can't publish it as original work. Many, many great ideas are rediscovered hundreds of times. 

The reasons why academics are so good at knowing what is original is simply that they specialize in a small part of the world's knowledge. That's what academia is all about. Most people outside of academia are not able to do that and so cannot fully understand a field well enough to contribute original research. It's as simple as that. 

The best way for a nonacademics to publish research in academic journals is simply to approach somebody in academia and partner with them on a publication. You contribute your idea and the academia contributes expertise. In many cases they will simply tell you that your idea is a good one but not original. Sorry. You can still blog about it or popularize it or monetize it but you can't claim to be the original discover because you aren't. If it is original but you haven't made a rigorous argument, they can help you do that. They'll know the level or rigor that is required for that particular field of study. 

I suppose there is also a place in the world for the mavericks who just publish their research online in whatever format they see fit. Sometime it will be correct and lead to great advances. The number of such successful attempts is, as far as I know, minimal. I can't really think of a single one in modern times. The number of incorrect, quackish works like that number in the thousands. Do you really want to put your work into that kind of environment? Yeah, you might get more readers but you are unlikely to get the readers you really want. And what's to stop someone from stealing the idea on your blog and publishing it in a journal and claiming it as their own?

 

 

Videos

  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service