A traditional business problem customized here to data science.
1. Identify the problem
- Identify metrics used to measure success over baseline (doing nothing)
- Identify type of problem: prototyping, proof of concept, root cause analysis, predictive analytics, prescriptive analytics, machine-to-machine implementation
- Identify key people within your organization and outside
- Get specifications, requirements, priorities, budgets
- How accurate the solution needs to be?
- Do we need all the data?
- Built internally versus using a vendor solution
- Vendor comparison, benchmarking
2. Identify available data sources
- Extract (or obtain) and check sample data (use sound sampling techniques); discuss fields to make sure data is understood by you
- Perform EDA (exploratory analysis, data dictionary)
- Assess quality of data, and value available in data
- Identify data glitches, find work-around
- Is quality and fields populated consistent over time?
- Are some fields a blend of different stuff (example: keyword field, sometimes equal to user query, sometimes to advertiser keyword, with no way to know except via statistical analyses or by talking to business people)
- How to improve data quality moving forward
- Do I need to create mini summary tables / database to
- Which tool do I need (R, Excel, Tableau, Python, Perl, Tableau, SAS and so on)
3. Identify if additional data sources are needed
- What fields should be capture
- How granular
- How much historical data
- Do we need real time data
- How to store or access the data (NoSQL? Map-Reduce?)
- Do we need experimental design?
4. Statistical Analyses
- Use imputation methods as needed
- Detect / remove outliers
- Selecting variables (variables reduction)
- Is the data censored (hidden data, as in survival analysis or time-to-crime statistics)
- Cross-correlation analysis
- Model selection (as needed, favor simple models)
- Sensitivity analysis
- Cross-validation, model fitting
- Measure accuracy, provide confidence intervals
5. Implementation, development
- FSSRR: Fast, simple, scalable, robust, re-usable
- How frequently do I need to update lookup tables, white lists, data uploads, and so on
- Need to create an API to communicate with other apps?
6. Communicate results
- Need to integrate results in dashboard? Need to create an email alert system?
- Decide on dashboard architecture, with business people
- Discuss potential improvements (with cost estimates)
- Provide training
- Commenting code, writing a technical report, explaining how your solution should be used, parameters fine-tuned, and results interpreted
- Test the model or implementation; stress tests
- Regular updates
- Final outsourcing to engineering and business people in your company, once solutions is stable
- Help move solution to new platform or vendor