There’s been a great deal of discussion over the past several weeks regarding data mining and predictive models. Terms like “meta data” and “algorithm” are fast moving from the domain of IT practitioners and into the realm of water cooler discussion. This might be a good opportunity to briefly review some of these concepts in order to better understand data mining practices and standards.
First, some terms.
Meta Data - refers to data about data. We frequently hear this term misused. Common meta data might be file size, extract dates, field counts, definitions, etc.
Algorithm - a set of precisely defined steps expected to arrive at an answer to a problem or set of problems. In the data world, this is a mathematical equation with a series of variables that are repeatedly populated with data from a designated data set. The results of the calculation form the basis of a decision engine.
Probability - a measure or estimation of how likely it is that something will happen. Probability is presented in terms of a “percentage of likelihood”. In the case of a decision engine, you might hear, “If a customer enters the store, there is a 27% probability they will purchase at least one item.”
The purpose behind data mining is to consolidate a single large data set, or several large, diverse data sets and create meaningful, decision-ready intelligence. It is expected the decision-makers will take action based on the output once the intelligence is delivered. The action taken depends in some part on the confidence the decision-maker has in the intelligence source, the risks involved, and other factors that exist outside the data output. Data mining, at least in today’s corporate world, rarely provides a decision. Final decisions in the majority of today’s professional community rests largely with the individual or group who will be held accountable for results. It is thus possible that companies will take action that is contrary to the results of an analysis, despite the probability and risk involved.
For a decision engine to be of value, it must be reliable, have an algorithm that produces the probability needed for the decision process, and contain data that can be used for the calculation. There are many challenges to developing an effective decision engine. Data access is always one of the greatest, but a knowledge and understanding of the process is also important, and often overlooked. Statistics can calculate correlation, but can not define cause and effect. Statistical calculations also have a margin of error, meaning the decision engine can provide “false positive” indications, or miss identifying possible occurrences. This is why data mining and decision engines are rarely used as final decision-makers