An important principle of data science is that data mining is a *process.* It includes the application of information technology, such as the automated discovery and evaluation of patterns from data. It also includes an analyst’s creativity, business knowledge, and common sense. Understanding the whole process helps to structure data mining projects.

Since the data mining process breaks up the overall task of finding patterns from data into a set of well-defined subtasks, it is also useful for structuring discussions about data science.

Each data-driven business decision-making problem is unique. There are sets of common tasks that underlie the business problems. A data scientist decompose a business problem into subtasks. The solutions to the subtasks can then be composed to solve the overall problems. Some of these subtasks are unique to the particular business problem, but others are common data mining tasks.

Despite the large number of specific data mining algorithms, there are only a handful of fundamentally different types of tasks they address. Illustrating fundamental concepts, it is obvious, to start with classification and regression.

1. *Classification* and *class probability* estimation attempt to predict an entity belongs to.

2. *Regression* attempts to estimate or predict the numerical value.

3. *Similarity matching* attempts to *identify* entities based on known data

4. *Clustering* attempts to *group* entities by their similarity.

5. *Co-occurrence grouping* attempts to find *associations* between entities based on transactions.

6. *Profiling* attempts to characterize the typical behavior of an entity.

7. *Link prediction* attempts to predict connections between data items.

8. *Data reduction* attempts to replace a large set of data with a smaller set of data.

9. *Causal modeling* attempts to illustrate what events or actions *influence* others.

The terms *supervised* and *unsupervised* were inherited from the field of machine learning.

Metaphorically, a teacher “supervises” the learner by carefully providing target information along with a set of examples. An unsupervised learning task might involve the same set of examples but would not include the target information.

If a specific target can be provided, the problem can be phrased as a supervised one. A supervised technique is given a specific purpose for the grouping, predicting the target. Clustering, an unsupervised task, produces groupings based on similarities, but there is no guarantee that these similarities will be useful.

Classification, regression, and causal modeling generally are solved with supervised methods.

Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Y = f(X)

The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.

Classification and regression, are distinguished by the type of target. Regression involves a numeric target while classification involves a categorical target.

Clustering, co-occurrence grouping, and profiling generally are unsupervised.

Unsupervised learning is where you only have input data (X) and no corresponding output variables.

The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.

For business applications, we often need a numerical prediction. For example, a customer is will to continue subscribing his magazine

In the early stage of the data mining process is to decide whether the line of attack will be supervised or unsupervised, and if supervised, to produce a precise definition of a target variable. This variable must be a specific quantity that will be the focus of the data mining.

© 2020 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central