How does the typical data science project life-cycle look like?

This post looks at practical aspects of implementing data science projects. It also assumes a certain level of maturity in big data (more on big data maturity models in the next post) and data science management within the organization. Therefore the life cycle presented here differs, sometimes significantly from purist definitions of 'science' which emphasize the hypothesis-testing approach. In practice, the typical data science project life-cycle resembles more of an engineering view imposed due to constraints of resources (budget, data and skills availability) and time-to-market considerations.

The CRISP-DM model (CRoss Industry Standard Process for Data Mining) has traditionally defined six steps in the data mining life-cycle. Data science is similar to data mining in several aspects, hence there's some similarity with these steps.

CRISP-DM lifecycle
Fig. CRISP-DM lifecycle

The CRISP model steps are:
1. Business Understanding
2. Data Understanding
3. Data Preparation
4. Modeling
5. Evaluation and
6. Deployment

Given a certain level of maturity in big data and data science expertise within the organization, it is reasonable to assume availability of a library of assets related to data science implementations. Key among these are:
1. Library of business use-cases for big data/ data science applications
2. Data requirements - business use case mapping matrix
3. Minimum data quality requirements (test cases to ensure minimum level of data quality to ensure feasibility)

In most organizations, data science is a fledgling discipline, hence data scientists (except those from actuarial background) are likely to have limited business domain expertise - therefore they need to be paired with business people and those with expertise in understanding the data. This helps data scientists gain or work together on steps 1 and 2 of the CRISM-DM model - i.e. business understanding and data understanding.

The typical data science project then becomes an engineering exercise in terms of a defined framework of steps or phases and exit criteria, which allow making informed decisions on whether to continue projects based on pre-defined criteria, to optimize resource utilization and maximize benefits from the data science project. This also prevents the project from degrading into money-pits due to pursuing nonviable hypotheses and ideas.

The data science life-cycle thus looks somewhat like:
1. Data acquisition
2. Data preparation
3. Hypothesis and modeling
4. Evaluation and Interpretation
5. Deployment
6. Operations
7. Optimization

Data Science Project Life-cycle
Fig. Data Science Project Life-cycle

Data Acquisition - may involve acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and transfer routines would be in place, and new sources, once identified would be acquired following the established processes.

Data preparation - Usually referred to as "data wrangling", this step involves cleaning the data and reshaping it into a readily usable form for performing data science. This is similar to the traditional ETL steps in data warehousing in certain aspects, but involves more exploratory analysis and is primarily aimed at extracting features in usable formats.

Hypothesis and modeling are the traditional data mining steps - however in a data science project, these are not limited to statistical samples. Indeed the idea is to apply machine learning techniques to all data. A key sub-step is performed here for model selection. This involves the separation of a training set for training the candidate machine-learning models, and validation sets and test sets for comparing model performances and selecting the best performing model, gauging model accuracy and preventing over-fitting.

Steps 2 through 4 are repeated a number of times as needed; as the understanding of data and business becomes clearer and results from initial models and hypotheses are evaluated, further tweaks are performed. These may sometimes include Step 5 (deployment) and be performed in a pre-production or "limited" / "pilot" environment before the actual full-scale "production" deployment, or could include fast-tweaks after deployment, based on the continuous deployment model.

Once the model has been deployed in production, it is time for regular maintenance and operations. This operations phase could also follow a target DevOps model which gels well with the continuous deployment model, given the rapid time-to-market requirements in big data projects. Ideally, the deployment includes performance tests to measure model performance, and can trigger alerts when the model performance degrades beyond a certain acceptable threshold.

The optimization phase is the final step in the data science project life-cycle. This could be triggered by failing performance, or due to the need to add new data sources and retraining the model, or even to deploy improved versions of the model based on better algorithms. 

Agile development processes, especially continuous delivery lends itself well to the data science project life-cycle. As mentioned before, with increasing maturity and well-defined project goals, pre-defined performance criteria can help evaluate feasibility of the data science project early enough in the life-cycle. This early comparison helps the data science team to change approaches, refine hypothesis and even discard the project if the business case is nonviable or the benefits from the predictive models are not worth the effort to build it.

Views: 55856

Tags: data, lifecycle, project, science


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sione Palu on January 15, 2015 at 9:50am

Data science and data mining is indistinguishable. You can't decouple machine learning from data mining because machine learning is in fact data mining.  I hope that the authors of the popular opensource machine learning Java API  WEKA (which they have published a book for WEKA) don't see their book as a machine learning rather than a datamining book.

"Data Mining: Practical Machine Learning Tools and Techniques"


Comment by jaap Karman on January 1, 2015 at 11:40pm

If you are seeing data-science only as an extension to Crisp-DM   / SEMMA then you van automate to a hig level. You could use the old word KDD Knowledge Discovery in Database as a rebranding http://www.executionmih.com/data-mining/kdd-methodology-framework-p...

That is a very limited area to have your view on, mainly the marketing stuff.
It is different to actuary and many others and real science   

Comment by Maloy Manna on January 1, 2015 at 11:27am

There are 2 figures - business understanding is included in figure 1 - CRISP model. Beyond theory, this is a practical life cycle used in projects where defining 'business understanding' as a gate is ambiguous. While the outcome is still business understanding, the steps iterate over 2-data prep -> 3-model -> 4-evaluate/interpret -> 5-deploy. A satisfactory evaluation (in the form of a predictive value/score) is usually used as a gate out of the cycle.

Regarding optimization, it is mostly useful to go back and repeat the life-cycle rather than spawn a new iterative process for itself.. This simplifies managing project processes in the organization and reduces complexity. 

BTW, all iterative processes are based on the PDCA, as is this one.

Comment by jaap Karman on December 30, 2014 at 9:09am

at 2-4 there is mentioned in the words business understanding but it missing in the figure
Could be a block to be jumped outside there.  

at 1-7 the optimizaton should IMO possible lead into an interative process the OPDCA or DMAIC cylce
Missing there is a block indicating business goals
Funny having a big (1-7 around)  and a small one 2-4. Reminds me of Kondratiev. 

Comment by Michael Clayton on December 27, 2014 at 6:43pm

2nd to Stepan:  "Understanding of the busines domain is a (THE in my opinion) key success factor in all of them.

Theur us now-recognized a critical role for a domain knowledge TRANSLATOR for the big data scrapers and analytic graph-masters so that actionable information can be gleaned and communicated to the leadership of a domain.  Teams should include, or at least know-well the local deep knowledge person that also can communicate to the leadership what the project team has discovered.  But now days, those "translators" must be more computer-system and statistical methods literate to now the traps and limitations of various machine learning methods and huge vs sparse sampling and spurious correlations vs inciteful associations relative to the enterprise products and needs.  

Comment by Stephan Meyn on December 22, 2014 at 3:59pm

Indeed many SW methodologies should fit well in this domain. Understanding of the business domain is a key success factor to all of them.

Additionally, in many situation there is a beneficial interaction between the activity of business domain discovery and the architectural approach selection, which leads to a mature solution approach that will stand the test of time.  I am a great fan of an architecturally driven approach as it tries to position the business need in the context of the technology landscape and derives from that a risk based understanding of the solution space. Doing that should improve the efficiency of the subsequent modelling approach.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service