This article was written by Daniel McAuley.
I recently had the pleasure of speaking on a few panels about analytics to my fellow MBA students and alumni, as well as many Penn undergrads. After these talks, I’ve been asked for my advice on what the best resources are for someone coming from the business world (i.e., non-technical) who wants to develop the skills to become an effective data scientist. This post is an attempt to codify the advice I give and general resources I point people towards. Hopefully, this will make what I have learned accessible to more people and provide some guidance for those who realize that the future belongs to the empirically inclined (see below) but don’t know where to start their journey to becoming part of the club.
However, I would caution the reader that what I propose here is only a starting point on a journey towards really understanding the power of good data science. And, as Sean Taylor once told me, learn only what you need to accomplish your goal; if there are things on this list that you know you don’t need then skip them, you won’t hurt my feelings. At its core, data science is really about curiosity, optimism, and continual learning, all of which are ongoing habits rather than boxes to be checked. Therefore, I expect this list to evolve as the tools themselves change and as I continue to discover more about data science itself.
1. Linear Algebra
Linear algebra is a topic that underlies a lot of the statistical techniques and machine learning algorithms that you will employ as a data scientist. I like to recommend a MOOC I took through Coursera years ago, Coding the Matrix: Linear Algebra through Computer Science Applications. As the name implies, the course teaches linear algebra in the context of computer science (specifically using Python, which lends itself well to data science). There is also an optional companion textbook that makes a great reference manual.
Given that we use R at Wealthfront, I have a few resources that I think are important here. The first, written by Garrett Grolemund and Hadley Wickham, R for Data Science will be published in physical form in July 2016 but is available for free online now. And rather than explain what the book is about in my own words. If you only read one data science book, it should be this.
Next up, our friend Hadley has also written Advanced R, which covers functional programming, metaprogramming, and performant code as well as the quirks of R.
Hadley is also responsible for some of the packages I use every day that make 90% of common data science tasks quicker and less verbose. I recommend checking out the following libraries; they will change the way you write code in R:
- ggplot2 — An implementation of the Grammar of Graphics in R
- devtools —Tools to make an R developer’s life easier
- dplyr — Plyr specialized for data frames: faster & with remote data stores
- purrr — Make your pure R function purrr with functional programming
- tidyr — Easily tidy data with spread and gather functions
- lubridate — Make working with dates in R just that little bit easier
- testthat — An R package to make testing fun
For extra credit, check out yet another of Hadley’s books: R Packages. This is a great follow-up resource for those of you that want to write reproducible, well-documented R code that other people can easily use (other people includes your future self!)
This is probably the easiest section of the guide as you can teach yourself most of SQL in a few hours. Code School has both introductory and intermediate courses that you can get through in an afternoon.
The Sequel to SQL covers everything from aggregate functions and joins to normalization and subqueries. And while mastering these skills takes practice, you can still get an idea of what SQL can and cannot do without too much work.
4. Bayesian Reasoning
this book is probably one of the best all-around resources for learning how to do data science in R.
Without wading into the age-old Frequentist vs. Bayesian debate (or non-debate), I think that a solid foundation in Bayesian reasoning and statistics is a crucial part of any data scientist’s repertoire. For example, Bayesian reasoning underpins much of modern A/B testing and Bayesian methods are applied in many other areas of data science (and are generally covered less in introductory statistics courses).
John K. Kruschke has a great ability to break down complex material and convey it in a way that is intuitive and practical. Along with R for Data Science, this book is probably one of the best all-around resources for learning how to do data science in the R programming language.
Additionally, Kruschke’s blog makes a great companion resources to the textbook if you’re looking for more examples of problems to solve or answers to questions you still have after reading the book. And if a textbook isn’t exactly what you’re looking for, then Rasmus Bååth’s research blog, Publishable Stuff, is another great resource for learning about Bayesian approaches to problem-solving.
To read the whole article, with the link for each resource, click here.