In this article, I summarize the components of any data science / machine learning / statistical project, as well as the cross-dependencies between these components. This will give you a general idea of what a data science or other analytic project is about.


1. Problem

This is the top, fundamental component. I have listed 24 potential problems in my article 24 uses of statistical modeling. It can be anything from building a market segmentation, building a recommendation system, association rule discovery for fraud detection, or simulations to predict extreme events such as floods.  

2. Data

It comes in many shapes: transactional (credit card transactions), real-time, sensor data (IoT), unstructured data (tweets), big data, images or videos, and so on. Typically raw data needs to be identified or even built and put into databases (NoSQL or traditional), then cleaned and aggregated using EDA (exploratory data analysis). The process can include selecting and defining metrics.

3. Algorithms

Also called techniques. Examples include decision trees, indexation algorithm, Bayesian networks, or support vector machines. A rather big list can be found here.

4. Models

By models, I mean testing algorithms, selecting, fine-tuning, and combining the best algorithms using techniques such as model fitting, model blending, data reduction, feature selection, and assessing the yield of each model, over the baseline. It also includes calibrating or normalizing data, imputation techniques for missing data, outliers processing, cross-validation, over-fitting avoidance, robustness testing and boosting, and maintenance. Criteria that make a model desirable include robustness or stability, scalability, simplicity, speed, portability, adaptability (to changes in the data), and accuracy (sometimes measured using R-squared, though I recommend this alternative instead).

5. Programming

There is almost always some code involved, even if you use a black-box solution. Typically, data scientists use Python, R or Java, and SQL. However, I've completed some projects that did not involve real coding, but instead, machine-to-machine communications via API's. Automation of code production (and of data science in general) is an hot topic, as evidenced by the publication of articles such as The Automated Statistician, and my own work to design simple, robust black-box solutions.    

6. Environments

Some call it packages. It can be anything such as a bare Unix box accessed remotely combined with scripting languages and data science libraries such as Pandas (Python), or something more structured such as Hadoop. Or it can be an integrated database system from Teradata, Pivotal or other vendors, or a package like SPSS, SAS, RapidMiner or MATLAB, or typically, a combination of these.

7. Presentation

By presentation, I mean presenting the results. Not all data science projects run continuously in the background, for instance to automatically buy stocks or predict the weather. Some are just ad-hoc analyses that need to be presented to decision makers, using Excel, Tableau and other tools. In some cases, the data scientist must work with business analysts to create dashboards, or to design alarm systems, with results from analysis e-mailed to selected people based on priority rules.


These components interact as follows. I invite you to create a nice graph from the dependencies table below. The first relationships reads as "the problem impacts or dictate the data".

Problem -> Data

Problem -> Algorithms

Algorithms -> Models

Algorithms -> Programming

Algorithms -> Environment

Data -> Environment

Environment -> Data

Data -> Algorithms

Data -> Problem

Problem -> Presentation

Models -> Presentation

Also read the lifecycle of data science projects (see also this article).

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 17832


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Nancy Grady on June 7, 2017 at 9:39am

This follows well the CRISP-DM process in use since 2000, but doesn't cover aspects of big data or systems development. If the analytics are to be part of a system to be acted upon, then implementation considerations need to be much further up the list. Right after the problem statement needs to be an explicit determination of how the information will be used. If for example the analytics will be used as information to provide bank loans, then there is a restriction on which techniques can be used (some lack transparency in how the result was generated). With DevOps or continuous delivery, large systems can't strictly follow this model which like CRISP-DM is focused on the individual data miner running on their own desktop, communicating the results manually.

Comment by Tim Cooper on August 13, 2016 at 12:05am

Agree as far as this goes, tho one key step is missing: specifying the analytical problem.  

The business problem definitely must be clear, as the step 1 states. How this translates into the analytical specification can massively effect the type of outcome.  

For example, how key concepts are defined (e.g. is defection of bank deposits measured as accounts closed or some drop in funds in the deposit accounts, is this at a customer level or at a single account level, etc?), the framework applied (e.g. which groups of customers are in/excluded and the time frame for the analysis such as the observation period, etc ), what level of result is conclusive (e.g. a 0.1% increase in some measure of marketing campaign results may be good in some circumstances, but deemed a failure in others).

After the above, going back to the business problem and ensuring that the analytical problem specification will effectively address it may seem like a statement of the obvious, but I'm sure we've all gone down some interesting analytical rabbit holes at some point. 

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service