Data science (covering data mining and related practices) is a multidisciplinary field that requires knowledge of a number of different skills, practices, and technologies, including but not limited to machine learning, pattern recognition, mathematics, programming, algorithms, statistics, and databases. In the context of big data, more skills and knowledge is required, such as knowledge of distributed computing techniques/algorithms and architectures. By nature, data science is a creative process that is a combination of both science, engineering, and art. Hence its success has been more dependent on the quality and the experience of the team that has been carrying it out. Thus in the past, for some time, data mining projects were not repeatable with the same level of success across different enterprises. However, with the maturity of the practice, that has changed.
Since the late 1990s there have been a variety of efforts to create standard methodologies and process models for data mining, such as CRISP-DM (Wirth and Hipp 2000). In this methodology there is an important focus on business, data, and deployment aspects, as well as the modeling, which used to be the main focus. Today, data science practices are more mature and well tested. Even though different methodologies may use different names for each step of the process, in general, I can logically divide any data science exercise into four phases (See Figure below):
Business Problem Understanding/Use,
Data Understanding/Use and Preparation,
Analytics and Assessment,
Implementation (Deployment and Monitoring).
In the context of big data, these logical phases stay the same; however, some low-level details of data preparation, analysis, and implementation may be impacted.
We all love acronyms and I have been using DS-BuDAI to refer to this process to communicate with business sponsors and users. The lowercase ‘u‘ represents “Understanding/Use” to overemphasize their importance during Business and Data focused phases. It bridges the two. Analytics and Implementation are simply realizations of the data science deliverable.
The “Understanding” part needs no explanation specially in the context of business problem and data that are specifically going to be addressed and leveraged in the effort. “Use” however needs a bit of explanation given some recent experiences.
A DS project must start with a full understanding of the business challenge and how it could be solved leveraging data sources available or to be obtained. However, there could be cases that after everything is done and the value proven, the business users are not still willing to use the new insights for actions. This lack of responsiveness has a lot to do with the culture of the organization, how decisions have been historically made in the past, and the marginal improvement the new actions will bring. These however could be overcome with education and training and full support of senior management for change.
In some cases though, actionable insights are perceived by business users as “this is what we already knew” and “it is good that the data analysis confirms that.” Basically saying that there is no novel new findings but a confirmation of what is known. There is truth to this perception sometimes but at times it is simply resisting change or accepting changes in practice.
In the context of data, “Use” also is essential. Collection, storage, preparation, and management of big data is still expensive no matter how much the storage costs have dropped in recent years with advent of open source systems and price drops in storage/processing systems. Data could easily be abused or misused. Sometimes too much data is used, and sometimes data is not used at the right level of details or aggregation.
The lowercase “u” in BuDAI is to overemphasize understanding and use during business and data focused phases.