According to Paco Nathan, a data scientist should:
- prepare an analysis and visualization of an unknown data set, while impatient stakeholders watch over your shoulder and ask pointed questions; be prepared to make quantitative arguments about the confidence of the results
- describe “loss function” and “regularization term” each in 25 words or less, with a compare/contrast of several examples, and show how to structure a range of tradeoffs for model transparency, predictive power, and resource requirements
- pitch a reorg proposal to an executive staff session which implies firing some ranking people
- interview 34 different departments that are hostile to your project, to tease out the metadata for datasets that they’ve been reluctant to release
- build, test, and deploy a mission-critical app with realtime SLAs, efficiently across a 1000+ node cluster
- troubleshoot intermittent bugs in somebody else’s code which is at least 2000 lines long, without their assistance
- leverage ensemble approaches to enhance a predictive model that you’re working on
- work on a deadline in paired programming with people from 34 different fields completely disjoint from the work that you’ve done
- learn to leverage the evolving Py data stack: IPython, Pandas, scikitlearn, etc.
- learn how to lead an interdisciplinary team
- get experience in 1+ domains outside of data/analytics/programming
- get a good grounding in design and apply it to data visualization
- do everything you can to become a better writer and speaker (outside of academic confs)
- participate in meetups; publish blogs, presentations, etc. (hiring managers ignore resumes and look for published content online)
- get a good grounding in abstract algebra, Bayesian stats, linear algebra, convex optimization
- study up on algorithms and frameworks for streaming data (the bigger use cases on the horizon are not batch)
- learn Scalding and functional programming with type safety
- avoid Business Intelligence (like the plague)
- avoid anything referred to as “The Hadoop Ecosystem” or “Hadoop as an OS”
Do you agree with this?
Vincent Granville replied and wrote: There are all sorts of data scientists. In my case, as an entrepreneur managing a company on auto-pilot (no employee, 7-digits yearly revenue with 80% margins, with significant outsourcing to vendors), none of the above test questions apply, I'd probably fail most of them, but I am a data scientist nevertheless (click here to see what I do), as well as business / growth / data hacker.