There’s been a great deal of discussion over the past several weeks regarding data mining and predictive models. Terms like “meta data” and “algorithm” are fast moving from the domain of IT practitioners and into the realm of water cooler discussion. This might be a good opportunity to briefly review some of these concepts in order to better understand data mining practices and standards.

First, some terms.

Meta Data - refers to data about data. We frequently hear this term misused. Common meta data might be file size, extract dates, field counts, definitions, etc.

Algorithm - a set of precisely defined steps expected to arrive at an answer to a problem or set of problems. In the data world, this is a mathematical equation with a series of variables that are repeatedly populated with data from a designated data set. The results of the calculation form the basis of a decision engine.

Probability - a measure or estimation of how likely it is that something will happen. Probability is presented in terms of a “percentage of likelihood”. In the case of a decision engine, you might hear, “If a customer enters the store, there is a 27% probability they will purchase at least one item.”

The purpose behind data mining is to consolidate a single large data set, or several large, diverse data sets and create meaningful, decision-ready intelligence. It is expected the decision-makers will take action based on the output once the intelligence is delivered. The action taken depends in some part on the confidence the decision-maker has in the intelligence source, the risks involved, and other factors that exist outside the data output. Data mining, at least in today’s corporate world, rarely provides a decision. Final decisions in the majority of today’s professional community rests largely with the individual or group who will be held accountable for results. It is thus possible that companies will take action that is contrary to the results of an analysis, despite the probability and risk involved.

For a decision engine to be of value, it must be reliable, have an algorithm that produces the probability needed for the decision process, and contain data that can be used for the calculation. There are many challenges to developing an effective decision engine. Data access is always one of the greatest, but a knowledge and understanding of the process is also important, and often overlooked. Statistics can calculate correlation, but can not define cause and effect. Statistical calculations also have a margin of error, meaning the decision engine can provide “false positive” indications, or miss identifying possible occurrences. This is why data mining and decision engines are rarely used as final decision-makers

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central