The discovery process used by data scientists commonly consists of four steps (see also Figure 1):
Data acquisition: In this first step, data is collected from various data sources. Data scientists select the data sources that may be useful and relevant for their study.
Data preparation: In this step, data is transformed, aggregated, integrated and cleansed until it has the form that data scientists need for their study. For example, for many data mining algorithms, it can be useful to transform real-life values to binary values.
Data analysis: In this step, data is analyzed using various types of techniques, including simple reporting techniques; classic statistical techniques, such as forecasting, predictive modeling and clustering; advanced data mining techniques;data visualization techniques such as affinity visualization, path visualization, scatter clouds, geo-visualization techniques; and time-series analysis.
Data interpretation: When the techniques and tools present results and insights, it’s still the responsibility of the data scientist to determine whether the results make sense. This requires in-depth knowledge of the business and the data, and it demands common sense.
Figure 1: The data scientist’s discovery process consists of four steps.
Characteristics of the Data Scientist’s Discovery Process
The discovery process deployed by data scientists has the following characteristics:
The discovery result consists of rules. The result of a discovery process is in most situations insights, and these insights are formulated as a set of rules. These rules can be simple if-then rules. For example, if two payments are done with the same credit card within 10 seconds, they are probably fraudulent. Rules can also be advanced statistical formulas indicating the relationship between specific variables. For example, a 10 degree rise in temperature increases sales of barbecue meat by 300%. Sometimes rules are sophisticated, self-learning data mining models that can predict customer behavior by combining historical and new incoming data.
The discovery process is an iterative process. Figure 1 suggests that the discovery process is a serial process: when one step is finished, the next one starts, and we never return to a previous step. However, less would be closer to the truth. The discovery process is very iterative. For example, when a data analysis step has been finished, the conclusion may be to collect more data and start all over again. Even a data preparation step may lead to a return to the data acquisition step. In fact, this entire four-step process may have to be repeated several times before the right insights rise to the surface.
Discovery results should be actionable. When a discovery process is finished, the organization has experienced no advantages yet – no money has been made, no ROI. The discovery process has to be followed up by a step called Act. In this step, the gained insights have to be used or implemented. Examples of implementing insights are: organization policies are changed, decision rules are embedded in operational applications, business processes are optimized, customers are offered special discounts and so on. Without the Act step, the entire discovery exercise has been for nothing. In other words, it’s important that discovery results are actionable. Note that the data scientist is not always involved in the Act step.