A traditional business problem customized here to data science.

1. Identify the problem

  • Identify metrics used to measure success over baseline (doing nothing)
  • Identify type of problem: prototyping, proof of concept, root cause analysis, predictive analytics, prescriptive analytics, machine-to-machine implementation
  • Identify key people within your organization and outside
  • Get specifications, requirements, priorities, budgets
  • How accurate the solution needs to be?
  • Do we need all the data?
  • Built internally versus using a vendor solution
  • Vendor comparison, benchmarking

2. Identify available data sources

  • Extract (or obtain) and check sample data (use sound sampling techniques); discuss fields to make sure data is understood by you
  • Perform EDA (exploratory analysis, data dictionary)
  • Assess quality of data, and value available in data
  • Identify data glitches, find work-around
  • Is quality and fields populated consistent over time?
  • Are some fields a blend of different stuff (example: keyword field, sometimes equal to user query, sometimes to advertiser keyword, with no way to know except via statistical analyses or by talking to business people)
  • How to improve data quality moving forward
  • Do I need to create mini summary tables / database to
  • Which tool do I need (R, Excel, Tableau, Python, Perl, Tableau, SAS and so on)

3. Identify if additional data sources are needed

  • What fields should be capture
  • How granular
  • How much historical data
  • Do we need real time data
  • How to store or access the data (NoSQL? Map-Reduce?)
  • Do we need experimental design?

4. Statistical Analyses

  • Use imputation methods as needed
  • Detect / remove outliers
  • Selecting variables (variables reduction)
  • Is the data censored (hidden data, as in survival analysis or time-to-crime statistics)
  • Cross-correlation analysis
  • Model selection (as needed, favor simple models)
  • Sensitivity analysis
  • Cross-validation, model fitting
  • Measure accuracy, provide confidence intervals

5. Implementation, development

  • FSSRR: Fast, simple, scalable, robust, re-usable
  • How frequently do I need to update lookup tables, white lists, data uploads, and so on
  • Debugging
  • Need to create an API to communicate with other apps?

6. Communicate results

  • Need to integrate results in dashboard? Need to create an email alert system?
  • Decide on dashboard architecture, with business people
  • Visualization
  • Discuss potential improvements (with cost estimates)
  • Provide training
  • Commenting code, writing a technical report, explaining how your solution should be used, parameters fine-tuned, and results interpreted

7. Maintenance

  • Test the model or implementation; stress tests
  • Regular updates
  • Final outsourcing to engineering and business people in your company, once solutions is stable
  • Help move solution to new platform or vendor

Views: 38070


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Khim Trillana on July 10, 2017 at 6:45pm

Thanks! I'm about to start my data science group and this is just what I needed. 

Comment by Saad Hamdan on January 3, 2017 at 6:33am

thank you for sharing. in a high level, typical IT project.    How can this process be altered for the Agile process model?

Comment by Able Kuriakose on May 20, 2015 at 10:34am

Great to see in data science perspective

Thanks . 

Comment by Sam Sur on August 4, 2014 at 8:56am

@Rana: The more data you have and the better the quality of data the better your models will be. Both quality and quantity of data are important.

Comment by Rana S Gautam on August 4, 2014 at 5:50am
How to judge how much historic data is required.
Comment by Vincent Granville on July 24, 2014 at 6:54pm

Sam - you are absolutely right. Even the stats models themselves need continuous re-assessment/refresh, in some automated way if possible.

Comment by Sam Sur on July 24, 2014 at 4:39pm

An important aspect of the entire lifecycle is being iterative with the steps mentioned above. The solution is only "stable" until business needs and metrics change, which is almost always continuously. Being able to iterate rapidly will show quick ROI to the business, though accuracy may be fairly low to start with.  

Comment by MuraliKrishnan on March 22, 2014 at 5:03pm

Good info. Thanks.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service