Subscribe to DSC Newsletter

Is data science a new paradigm, or recycled material?

Data science is the result of a new paradigm taking place in IT. The question was raised recently, and here I explain how and why data science is part of this new paradigm, and not recycled material.

New arsenal of techniques and metrics

Many data science techniques are very different, if not the opposite of old techniques that were designed to be implemented on abacus, rather than computers. These new tools are often model-free.

For instance, new tools include

Indeed, old techniques such as logistic regression and classification trees don't even belong to data science, more stable techniques are used in data science. You can find many of them published as open intellectual property, in our data science research lab.

The way (big) data is processed has also dramatically changed: it requires optimizing complex Hadoop-like architectures, and computational complexity is not an issue any more in many cases (as long as you use efficient algorithms). It's the time that it takes for data to flow back and forth in data pipeline systems, that is now the bottleneck.

A truly new paradigm

Saying that data science is not creating a new paradigm shift, is like saying that if we claim Earth rotates around the sun rather than the other way around, there's no change in paradigm, because after all, we are still dealing with 2 celestial bodies and 1 rotation - nothing changed. According to this, using an abacus or a computer means no change in paradigm: we are still dealing with automated computations to obtain more value faster.

The change in paradigm that I am referring to, consists of moving away from models, to focus on data. It is the data-to-algorithm approach (bottom-up) rather than model-to-data (top-down), and in the process many old tools are becoming obsolete. It also involves working with messy, unstructured data.

Also, big data has caused an explosion in spurious correlations and wrong analyses / conclusions, by people still using the old paradigm. The new paradigm allows you to (just to name a few)

  • identify causes rather than correlations
  • find real signal buried under noise that disguises itself as strong...,
  • perform 20,000 A/B tests without having tons of false positives (the concept of statistical testing or p-value no longer exist in the new paradigm)
  • use synthetic metrics designed to better measure outcomes or for variable selection, rather than metrics derived from solving elegant mathematical equations (and sensitive to outliers). Some of the new metrics measure stuff that has never been measured before, such as bumpiness in time series (critical e.g. for stock trading), and some are dealing with measuring predictive power of a variable, or level of clustering or randomness in a data set  
  • perform clustering on 100 millions observations in less than one hour (try that with any traditional algorithm, it will take hundreds of years to complete - and it's not easy to implement under Hadoop)
  • develop real time techniques, API's, and machine-to-machine communications,
  • automated updates of parameters, look-up tables, rules and training sets in machine learning algorithms such as scoring, according to some optimized schedule computed by the algorithm itself

Why some people don't see the unfolding data revolution?

They might see it coming but are afraid: it means automating data analyses at a fraction of the current cost, replacing employees by robots, yet producing better insights based on approximate solutions. It is a threat to would-be data scientists.

In addition, data science unifies domains that were previously considered as independent silos, and adds its own research core, and delivers knowledge (e.g. open-source intellectual property) outside traditional academia. Data science is also a cross-disciplines field: it's an horizontal, not a vertical domain. It might appear that nothing new is created if you follow academic research closely, but the reason is because innovation is now done outside academia.

Other links

Views: 3889

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Nagaraj Kulkarni on October 17, 2014 at 5:01am

Yes.

Comment by Sione Palu on October 16, 2014 at 4:57pm

I may post a link to a brief description on Diffusion-Wavelet which may be useful to readers here:

http://en.wikipedia.org/wiki/Diffusion_wavelets

Comment by Sione Palu on October 16, 2014 at 3:17pm

I agree Theordore, scientific computing has a blend of data science. In fact, I think that these domains of scientific computing were the original data science, but they weren't being labelled as such.

Vincent, good article overview, however I think that old techniques are still around & have evolved into more sophisticated new version (ie, new variants with powerful capabilities, where such capabilities were missing from the original techniques).

A good example of this, is the "wavelets". Its been around since 1910s and re-emerged with new variants in the 1980s, but in recent years upto now, it has  more variants that has appeared in the literature with powerful capabilities that were not seen in any other techniques previously published.  Wavelet is everywhere today from computer-network detection (anomaly), fraud-detection, image processing (MRI and so forth), telecommunication signal processing application (electrical noise suppressor), embedded system (Wavelet chip), multi-level (or multi-scale) text mining & topic modeling, text search engine, neural-wavelets (ie hybrids of neural-net & wavelet where the activation functions of the neural network is the wavelet basis),  privacy preserving data mining using wavelet, and many more, that are keep coming up in the literature. The new types of wavelets in recent years can do certain mathematical computations that the old variants of wavelets couldn't do.  One new type in the literature is called Diffusion-Wavelet (DW) with many applications, but it caught my eye in its application to text-mining (multi-scale topic modeling), because LDA (latent dirichilet allocation) and other latent semantic analysis algorithms (SVD, NMF, PLSA and so forth) are all single scale algorithms. They only discover topics on one scale but not many scale at once.  I've seen coming out of IBM JWatson lab with application of DW to automatically extract corpus of textual data into an ontology of concepts, which is very hard to do it using SVD, LDA, PLSA and other flat techniques (single scale techniques).

So, there's no doubt that old techniques will keep evolving into more robust variants now & the future.

Comment by Theodore Omtzigt on October 16, 2014 at 8:34am

In the supercomputing world, computational science has become a 'third' component in the scientific method along side and on equal footing as theory and experimentation. The ability to computational characterize, resolve, or predict using 'big data' is IMHO similar to the computational component in the scientific method. But that also allows us to defend the statement that little has changes as there is still a lot of use for the 'old' methods. Therefore, I would argue that it is a false dilemma: computational methods in science and statistics are a wonderful additional mechanism to pursue and create better understanding of complex systems, and that has been the goal for both science and statistics since the beginning of time.

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service