An important principle of data science is that data mining is a process. It includes the application of information technology, such as the automated discovery and evaluation of patterns from data. It also includes an analyst’s creativity, business knowledge, and common sense. Understanding the whole process helps to structure data mining projects.
Since the data mining process breaks up the overall task of finding patterns from data into a set of well-defined subtasks, it is also useful for structuring discussions about data science.
From Business Problems to Data Mining Tasks
Each data-driven business decision-making problem is unique. There are sets of common tasks that underlie the business problems. A data scientist decompose a business problem into subtasks. The solutions to the subtasks can then be composed to solve the overall problems. Some of these subtasks are unique to the particular business problem, but others are common data mining tasks.
Despite the large number of specific data mining algorithms, there are only a handful of fundamentally different types of tasks they address. Illustrating fundamental concepts, it is obvious, to start with classification and regression.
1. Classification and class probability estimation attempt to predict an entity belongs to.
2. Regression attempts to estimate or predict the numerical value.
3. Similarity matching attempts to identify entities based on known data
4. Clustering attempts to group entities by their similarity.
5. Co-occurrence grouping attempts to find associations between entities based on transactions.
6. Profiling attempts to characterize the typical behavior of an entity.
7. Link prediction attempts to predict connections between data items.
8. Data reduction attempts to replace a large set of data with a smaller set of data.
9. Causal modeling attempts to illustrate what events or actions influence others.
Supervised Versus Unsupervised Methods
The terms supervised and unsupervised were inherited from the field of machine learning.
Metaphorically, a teacher “supervises” the learner by carefully providing target information along with a set of examples. An unsupervised learning task might involve the same set of examples but would not include the target information.
If a specific target can be provided, the problem can be phrased as a supervised one. A supervised technique is given a specific purpose for the grouping, predicting the target. Clustering, an unsupervised task, produces groupings based on similarities, but there is no guarantee that these similarities will be useful.
Classification, regression, and causal modeling generally are solved with supervised methods.
Supervised learning is where you have input variables (x) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.
Y = f(X)
The goal is to approximate the mapping function so well that when you have new input data (x) that you can predict the output variables (Y) for that data.
Classification and regression, are distinguished by the type of target. Regression involves a numeric target while classification involves a categorical target.
Clustering, co-occurrence grouping, and profiling generally are unsupervised.
Unsupervised learning is where you only have input data (X) and no corresponding output variables.
The goal for unsupervised learning is to model the underlying structure or distribution in the data in order to learn more about the data.
For business applications, we often need a numerical prediction. For example, a customer is will to continue subscribing his magazine
In the early stage of the data mining process is to decide whether the line of attack will be supervised or unsupervised, and if supervised, to produce a precise definition of a target variable. This variable must be a specific quantity that will be the focus of the data mining.