Home » Uncategorized

Architecture of Data Science Projects

In this article, I summarize the components of any data science / machine learning / statistical project, as well as the cross-dependencies between these components. This will give you a general idea of what a data science or other analytic project is about.

1327899

Components

1. Problem

This is the top, fundamental component. I have listed 24 potential problems in my article 24 uses of statistical modeling. It can be anything from building a market segmentation, building a recommendation system, association rule discovery for fraud detection, or simulations to predict extreme events such as floods.  

2. Data

It comes in many shapes: transactional (credit card transactions), real-time, sensor data (IoT), unstructured data (tweets), big data, images or videos, and so on. Typically raw data needs to be identified or even built and put into databases (NoSQL or traditional), then cleaned and aggregated using EDA (exploratory data analysis). The process can include selecting and defining metrics.

3. Algorithms

Also called techniques. Examples include decision trees, indexation algorithm, Bayesian networks, or support vector machines. A rather big list can be found here.

4. Models

By models, I mean testing algorithms, selecting, fine-tuning, and combining the best algorithms using techniques such as model fitting, model blending, data reduction, feature selection, and assessing the yield of each model, over the baseline. It also includes calibrating or normalizing data, imputation techniques for missing data, outliers processing, cross-validation, over-fitting avoidance, robustness testing and boosting, and maintenance. Criteria that make a model desirable include robustness or stability, scalability, simplicity, speed, portability, adaptability (to changes in the data), and accuracy (sometimes measured using R-squared, though I recommend this alternative instead).

5. Programming

There is almost always some code involved, even if you use a black-box solution. Typically, data scientists use Python, R or Java, and SQL. However, I’ve completed some projects that did not involve real coding, but instead, machine-to-machine communications via API’s. Automation of code production (and of data science in general) is an hot topic, as evidenced by the publication of articles such as The Automated Statistician, and my own work to design simple, robust black-box solutions.    

6. Environments

Some call it packages. It can be anything such as a bare Unix box accessed remotely combined with scripting languages and data science libraries such as Pandas (Python), or something more structured such as Hadoop. Or it can be an integrated database system from Teradata, Pivotal or other vendors, or a package like SPSS, SAS, RapidMiner or MATLAB, or typically, a combination of these.

7. Presentation

By presentation, I mean presenting the results. Not all data science projects run continuously in the background, for instance to automatically buy stocks or predict the weather. Some are just ad-hoc analyses that need to be presented to decision makers, using Excel, Tableau and other tools. In some cases, the data scientist must work with business analysts to create dashboards, or to design alarm systems, with results from analysis e-mailed to selected people based on priority rules.

Cross-Dependencies

These components interact as follows. I invite you to create a nice graph from the dependencies table below. The first relationships reads as “the problem impacts or dictate the data”.

Problem -> Data

Problem -> Algorithms

Algorithms -> Models

Algorithms -> Programming

Algorithms -> Environment

Data -> Environment

Environment -> Data

Data -> Algorithms

Data -> Problem

Problem -> Presentation

Models -> Presentation

Also read the lifecycle of data science projects (see also this article).

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge