In this article, I summarize the components of any data science / machine learning / statistical project, as well as the cross-dependencies between these components. This will give you a general idea of what a data science or other analytic project is about.
This is the top, fundamental component. I have listed 24 potential problems in my article 24 uses of statistical modeling. It can be anything from building a market segmentation, building a recommendation system, association rule discovery for fraud detection, or simulations to predict extreme events such as floods.
It comes in many shapes: transactional (credit card transactions), real-time, sensor data (IoT), unstructured data (tweets), big data, images or videos, and so on. Typically raw data needs to be identified or even built and put into databases (NoSQL or traditional), then cleaned and aggregated using EDA (exploratory data analysis). The process can include selecting and defining metrics.
Also called techniques. Examples include decision trees, indexation algorithm, Bayesian networks, or support vector machines. A rather big list can be found here.
By models, I mean testing algorithms, selecting, fine-tuning, and combining the best algorithms using techniques such as model fitting, model blending, data reduction, feature selection, and assessing the yield of each model, over the baseline. It also includes calibrating or normalizing data, imputation techniques for missing data, outliers processing, cross-validation, over-fitting avoidance, robustness testing and boosting, and maintenance. Criteria that make a model desirable include robustness or stability, scalability, simplicity, speed, portability, adaptability (to changes in the data), and accuracy (sometimes measured using R-squared, though I recommend this alternative instead).
There is almost always some code involved, even if you use a black-box solution. Typically, data scientists use Python, R or Java, and SQL. However, I've completed some projects that did not involve real coding, but instead, machine-to-machine communications via API's. Automation of code production (and of data science in general) is an hot topic, as evidenced by the publication of articles such as The Automated Statistician, and my own work to design simple, robust black-box solutions.
Some call it packages. It can be anything such as a bare Unix box accessed remotely combined with scripting languages and data science libraries such as Pandas (Python), or something more structured such as Hadoop. Or it can be an integrated database system from Teradata, Pivotal or other vendors, or a package like SPSS, SAS, RapidMiner or MATLAB, or typically, a combination of these.
By presentation, I mean presenting the results. Not all data science projects run continuously in the background, for instance to automatically buy stocks or predict the weather. Some are just ad-hoc analyses that need to be presented to decision makers, using Excel, Tableau and other tools. In some cases, the data scientist must work with business analysts to create dashboards, or to design alarm systems, with results from analysis e-mailed to selected people based on priority rules.
These components interact as follows. I invite you to create a nice graph from the dependencies table below. The first relationships reads as "the problem impacts or dictate the data".
Problem -> Data
Problem -> Algorithms
Algorithms -> Models
Algorithms -> Programming
Algorithms -> Environment
Data -> Environment
Environment -> Data
Data -> Algorithms
Data -> Problem
Problem -> Presentation
Models -> Presentation