According to the Emcien, data mining is almost dead. Do you agree with this statement?
Here's my answer:
I don't think Data Mining is dead. It's been renamed as big data / data science, although data science is much more than data mining: 20% of data science is pure data mining, that is exploratory analysis to detect patterns, clusters etc. to develop automated predictive or visual solutions or automated systems such as fraud detection, automated bidding or costumized recommendations (books, restaurants, Facebook friends etc.) to targeted individuals. Indeed, data mining is at the very core of data science, together with data arcihtecture / data warehousing / database modeling.
What's your opinion?
I agree with you. I don't think that it's dead by any stretch of the imagination. The author in the Emcien article gives three reasons for this assertion:
1. Too much data
2. Outdated data
3. Wrong questions, hence wrong queries, all at the hand of a small groups of analysts
All of the above points then lead to an inefficient system. Moreover, this inefficiency stems from the fact that companies may be approaching the search for answers from a reactive stance, therefore the data pulled arrives too late.
In regards to point 1, the sense of being overwhelmed with too much data comes from the fact that the wrong questions are being asked. But if the people asking the questions are knowledgeable individuals in the organization, then the questions will be more accurate. So point 1 is intimately connected to point 3. By the time the data is pulled, is the data antiquated? It depends how often that data is being queried and the history that it's creating, A well managed set of questions where data is being pulled over a period of time can be very revealing in the long run. Even if we are looking for immediate results, historical data, depending on how the data is being broken down, can be very insightful.
There are reasons why good statistical analyses are based on valid and reliable methods. A good statistician or data miner will always make sure that his or her methods are compliant with established practices.
Having said the above, then, is a reactive approach worse than a proactive approach? Well, leaving aside the meaning of these fashionable terms, usually members in a organization will learn from their behavior by looking at the past. Once they know what they have done, then measures can be created to monitor current practices.
So, is data mining dead? I don't believe so.
Some 20 years ago, when I was chartered to design memory and cache controllers for the Intel Pentium, we had the ability to run hundreds of thousands of simulations that would produce a couple hundred statistics each. My total disdain for Excel grew from that period as trying to graph any of these experiments would result in the blue screen of death. To deal with this blob of data, it was pretty clear that you needed two skills: statistics and programming. Without the statistics skill you couldn't interpret a summary statistic, and without the programming skill you couldn't get to the data.
I don't feel that "big data" has made any fundamental progress on this front. We may now program in MapReduce or Scala, and take advantage of the deep functionality of R, the two skills are still needed to first evaluate if a question can be answered by the underlying data, and second to answer it.
So whether you call it data mining, data science, big data, or deep analytics, using raw data to answer deep questions hasn't gotten much simpler. What has changed is that data is now easily generated and easily stored.
It appears to me that the Emcien article author is referring to data mining as retrospective and OLAP or SQL queries. That's not a technically incorrect definition of manually mining data for information. However, I believe the more popular view of data mining -- building automated models that generate prospective insight -- is not only far from dead, but is in need of substantial expectations leveling in the context of big data.
As a short example, only a limited amount of data needs to be sampled in order for a predictive model to obtain a solid representation of the solution space. And once the model is trained, it's highly efficient in scoring new cases -- in real time and high volume. At least in my opinion, the Emcien article does not apply to machine-learning, prospective / descriptive / predictive models. Even so, business practitioners will still need to make direct retrospective queries of the data!
Probably the biggest change that will occur with Big Data is around dashboarding and real-time standard measurements and metering of data as it arrives in order to provide different ranges of moving averages / recency for leadership to oversee recent and current performance. But when it comes to the requirements and function of predictive modeling and machine-learning data mining, I believe the article is 180 degrees from reality.
Data science /mining /whatever-you-want-to-call-it is not even close to dead, I think it's really just beginning. If we don't use historical data, how can we predict? Simply using data to make a decision can be called data analysis or science. We are creating data at an unprecedented rate, which will allow for broader use of and creation/refinement of data models. We are creating data that has never existed before, giving new life and purpose to it and its endeavorers. As long as someone wants to find a patter in anything then it stands to reason that data mining is alive and well and will continue to be so.