Subscribe to DSC Newsletter

Why statistical community is disconnected from Big Data and how to fix it

This discussion was posted on our large LinkedIn group (100,000+ members) by our friend Gregory, pictured below. It has generated a tremendous volume of great comments by a number top leaders. Below is some of my comments. You can read and participate in the discussion by clicking here.

Gregory Piatetsky-Shapiro

One of Gregory's comments:

@carey - why should statisticians "be the leaders of the Big Data and data science movement" ? Except for a few statisticians like Breiman & Tibshirani, most statisticians missed the boat on Data Science and Big Data, and statistics does not deal with computational aspect which is critical for Big Data, nor with the business aspect which is critical for getting results.


Vincent Granville's reactions (selected comments - feel free to add yours even if you totally disagree with me; there will be no censorship):

For those who believe that big data and data science are just pure engineering or CS fields with ignorance or poor application of statistics, I invite out to read my book at http://www.datasciencecentral.com/profiles/blogs/my-data-science-book

You'll see that data science has its own core of statistics and statistical research. For instance, in my article "the curse of big data", I discuss the fact that in big data, you are bound to find spurious correlations when you compute billions or trillions of correlations. These spurious correlations overshadow real correlations that get undetected. I mention that instead of looking at correlations, you should compare correlograms. Correlograms, uniquely determine if two time series are similar, correlations do not. I also talk about normalizing for size. You don't need to be a statistician to identify these issues and bias, and correct them. A data scientist should know these things too, as well as other stuff such as experimental design, applied extreme value theory and Monte Carlo simulations, confidence intervals created without underlying statistical model (Analyticbridge's first theorem), identifying non-randomness, and much more.


Also I think that you can be both data scientist and statistician at the same time. Just like you can be data scientist and entrepreneur at the same time, I would even go as far as to say that it is a requirement. It's certainly not incompatible, we just have to be aware that the official image of statisticians as pictured in AMSTAT publications or on job boards, does not represent (for now at least) the reality and what many statisticians do.


Still, data science and statistics are different, in my opinion. Many of the data science books I've read can give you the impression that it's one and the same, but it's because the author just re-used old stuff (not even part of data science), added a bit of R or Python, and put a new name on it. I call this fake data science. Likewise, data science without statistics (or with reckless application of statistical principles) is not real data science either.

At the end, I guess anyone has his own idea of what data science, statistics, computer science, BI, entrepreneurship is or is not. You decide what you want to call yourself, and for me, I'm clearly not (no longer) a statistician but a data scientist (I was a computational statistician to begin with anyway). My knowledge and expertise is different from that of a statistician (it's probably closer to computer science). And although I have a good knowledge of experimental design, Monte Carlo, sampling, etc. most of it, I did not learn it at school or in a training program. The knowledge is available for free on the Internet. Anybody - a lawyer, a politician, a geographer - can acquire it without attending statistic classes. And part of my data science apprenticeship is to make this knowledge accessible to a broader group of people. My intuitive Analyticbridge Theorem on model-free confidence intervals is an example (among several) of "statistical" tool designed to be understood by a 12-year old, with no mathematical prerequisite, and applied in big data environments with tons of data buckets.


Statistics in data science can be built from within by data scientists, or brought in by outsiders (those who don't want to be called data scientists, and call themselves e.g. statisticians). I am an example of insider creating a culture of statistics inside data science. It will be more apparent when my Wiley book is published, and my data science apprenticeship will take off, in three months. Some are going to like contributions from outsider statisticians, some will like insider contributions like mine, as I am already familiar with data science and understand the difference between statistics and data science, and tend to be very close to the business people, being one myself.


@Dan: I thought the FDA stepped in because 23andMe didn't abide by the highly regulated process of doing clinical trials, a task which is at the core of statistical science. In my opinion, the FDA statisticians (actually, their bosses) think that they know better than other statisticians, when indeed they know less. These (FDA, government) statisticians are precisely the ones that I criticize most, for having unfit skills / expertise / knowledge to do data science, yet claiming that we, data scientists, are doing things the wrong way, and that they know better than us.

One of the issues is the FDA hiring process:

  • (1) extremely bureaucratic (discouraging creative people from applying),
  • (2) many jobs require a security clearance (eliminating all green card holders and applicants from abroad),
  • (3) the pay is not great, and
  • (4) the location is probably Washington DC (it's a fantastic, vibrant city, but if you are flexible about location, you can attract more talent, for instance people who only want to work in the Bay Area, the #1 spot for data science).

Some of the best practitioners won't apply for these FDA jobs because they simultaneously face all the 4 challenges (1)-(4). I'm one of them.


A point that nobody discussed is how to design a database, at the very high level: metrics to use (exact definitions), how they are captured, and how granular we need to be when keeping old data. I believe that this is part of data science. It involves working very closely with DB engineers and data architects. I think a data scientists must have some of the knowledge of data architects, and also of business people, to understand exactly what we do, what we are going to capture. Data scientists should also be involved in dashboard design. Typically, all these things is stuff that many statisticians don't want to do or believe it is not part of their job.


@Bill: Evidently, discussing the question is the first, critical, fundamental step. Not doing it is like building a skyscraper without foundations. But flexibility should be allowed. If something does not work as expected, how can we re-build or change project? How do we adapt to change? Is the design flexible enough to adapt? Who's in charge of defining the project and its main feature? 

For instance, regarding Obamacare and its website (not really a data science project, although who knows - maybe it is one, aimed at collecting data on people to cluster them for further government action), it smells like the whole project was designed and architect-ed almost entirely by a lawyer (possibly Obama himself), with the consequences that we all know. Should a lawyer (or a CEO or a business executive) be the one in charge of strategic thinking and pre-architecture? Sure they should be involved, but to what extent? Who else should be involved in the preliminary discussions? CTO's? CMO's? COO's? CFO's? 

PS: You can do "blind data science", that is collecting data without asking any questions, and see whether you manage to extract anything valuable. But you need to be a very good statistician (with vision and intuition), and probably it applies only to projects (small or big) entirely run by one or two senior people, combining the role of CEO, CTO, CFO, lawyer, marketing / product bis dev / COO and statistician in just one or two people. I call this extreme data science, and I compare it to extreme rock climbing - a guy who summited Mount Everest in a solo expedition, in winter, in 48 hours, without oxygen (yes a guy did it). But in short, very few people are able to do it.


@Bill: Big data is not stat ignorant. What you are saying is that anyone who does stat but does not call himself statistician is an ignorant and arrogant person. I do stats and big data, I call myself data scientist, but I completed my post doc at the statslab at Cambridge, published in J. Royal Statistical Society series B and other respected stat journals, own patents on statistical scoring, not to mention extensive industry experience (detection of low frequency Botnets in very large data sets). What more do you need to not be called ignorant (with respect to stats)? And while I recognize there is a lack of statistical knowledge in the big data community, I'm one of the guys who is here to help.

Someone who knows about sampling, ARIMA, and experimental design, if he thinks he can do better than data scientists and knows better than big data practitioners with 20 years of experience, that's real arrogance (and ignorance as well) .
I think there are two types of statisticians: those that associate themselves with AMSTAT and they seem rather bitter and resentful (possibly they are more criticized), and then other statisticians. Likewise, there are two types of big data practitioners:

  • Those who know little about stats (but they know far more than statisticians in other fundamental areas of big data),
  • Those who know as much as the best statisticians in the world, with statistical knowledge oriented towards data science. 

The concept of p-value is rarely used in data science, not because of ignorance, but because we use a different wording, a different metric (far easier to understand in my opinion), but it serves a similar purpose. In data science contexts, many times there is no underlying models, we do model-free inference (predictions, confidence intervals and so on, data-driven, with no statistical model). Google "Analyticbridge First Theorem" as an illustration (interestingly, the proof of this theorem requires mathematical / combinatorial / probabilistic arguments - but its application is straighforward, and it is also intuitive, unlike p-values).

Rather than p-values, I frequently use "predictive power",a synthetic metric that I created myself, and which is a bit similar to the natural metric called Entropy. More on this at http://www.datasciencecentral.com/profiles/blogs/feature-selection-...


@Gregory: Sometimes you need the entire data set. When you create a system to estimate the value of every single home in US, you probably want the entire data set being part of the analysis: millions of people each day are checking millions of houses and neighboring houses for comparison shopping. In this case, it makes sense to use all the data. If your database only had 10% of all historical prices, sure you would be able to do some inference (though you would miss a lot of the local patterns), but 90% of the time, when a user enters an address to get a price estimate, he would also have to provide square footage, sales history, number of bathrooms, school rankings etc. In short, this application (www.zillow.com) would be useless.


@Michael: I would not say this is a widespread trend in big data, and I might be one of the very few to explore and develop what I call AEDA: automated exploratory data analysis. I believe that the bulk of EDA can be automated.


@Dan: I've been working pretty much all my professional life (25 years) on observational, big data, where each observation does matter (not necessarily high dimensional data). I'm of course an outlier in the statistical community, and that explains why I call myself a data scientist. Nevertheless, I am savvy about statistical science just like any statistician.

Examples of observational data where sampling is not allowed includes:

  • Credit card processing: each single transaction must be approved or declined.
  • Book recommendations on Amazon: each book, each user must be part of the data set.
  • Price estimates for each house.
  • High frequency trading (trillions of tiny bins of data; the more data per bin, the better) .
  • Friends and other recommendations on social networks.
  • Email categorization: spam, not spam. Each single piece of email must be processed.
  • Sensor data: higher resolution and frequency provides better predictive power.
  • Customized hotel room pricing for each online request to book a room.
  • Keyword bidding: each of 1 billion keyword must be priced right, in real time.
  • Keyword correlations: find all keywords related to a specific keywords. Needed in search engine technology or for keyword taxonomy, for billions of searches entered daily by users.
  • Ad relevancy: matching an ad with a user and a web page, billions of times a day and individually for each page request.
  • News feed aggregator: detection, categorization and management of millions of micro-blogs postings to deliver high quality news to syndicated partners. Each posting counts.

Related articles

Views: 3590

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service