I have been looking to create this list for a while now. There are many people on quora who ask me how I started in the data science field. And so I wanted to create this reference.
To be frank, when I first started learning it all looked very utopian and out of the world. The Andrew Ng course felt like black magic. And it still doesn’t cease to amaze me. After all, we are predicting the future. Take the case of Nate Silver – What else can you call his success if not Black Magic?
But it is not magic. And this is a way an aspiring guy could take to become a self-trained data scientist. Follow in order. I have tried to include everything that comes to my mind. So here goes:
The one stat course you gotta take. If not for the content then for Prof. Blitzstein sense of humor. I took this course to enhance my understanding of probability distributions and statistics, but this course taught me a lot more than that. Apart from Learning to think conditionally, this also taught me how to explain difficult concepts with a story.
This was a Hard Class but most definitely fun. The focus was not only on getting Mathematical proofs but also on understanding the intuition behind them and how intuition can help in deriving them more easily. Sometimes the same proof was done in different ways to facilitate learning of a concept.
One of the things I liked most about this course is the focus on concrete examples while explaining abstract concepts. The inclusion of Gambler’s Ruin Problem, Matching Problem, Birthday Problem, Monty Hall, Simpsons Paradox, St. Petersberg Paradox etc. made this course much much more exciting than a normal Statistics Course.
It will help you understand Discrete (Bernoulli, Binomial, Hypergeometric, Geometric, Negative Binomial, FS, Poisson) and Continuous (Uniform, Normal, expo, Beta, Gamma) Distributions and the stories behind them. Something that I was always afraid of.
He got a textbook out based on this course which is clearly a great text:
2. Data Science CS109: –
Again by Professor Blitzstein. Again an awesome course. Watch it after Stat110 as you will be able to understand everything much better with a thorough grinding in Stat110 concepts. You will learn about Python Libraries like Numpy,Pandas for data science, along with a thorough intuitive grinding for various Machine learning Algorithms. Course description from Website:
After doing these two above courses you will gain the status of what I would like to call a “Beginner”. Congrats!!!. You know stuff, you know how to implement stuff. Yet you do not fully understand all the math and grind that goes behind all this.
Here comes the Game Changer machine learning course. Contains the maths behind many of the Machine Learning algorithms. I will put this course as the one course you gotta take as this course motivated me into getting in this field and Andrew Ng is a great instructor. Also this was the first course that I took.
Also recently Andrew Ng Released a new Book. You can get the Draft chapters by subscribing on his website here.
You are done with the three musketeers of the trade. You know Python, you understand Statistics and you have gotten the taste of the math behind ML approaches. Now it is time for the new kid on the block. D’artagnan. This kid has skills. While the three musketeers are masters in their trade, this guy brings qualities that adds a new freshness to our data science journey. Here comes Big Data for you.
Let us first focus on the literal elephant in the room – Hadoop. Short and Easy Course. Taught the Fundamentals of Hadoop streaming with Python. Taken by Cloudera on Udacity. I am doing much more advanced stuff with python and Mapreduce now but this is one of the courses that laid the foundation there.
Once you are done through this course you would have gained quite a basic understanding of concepts and you would have installed a Hadoop VM in your own machine. You would also have solved the Basic Wordcount Problem. Read this amazing Blog Post from Michael Noll: Writing An Hadoop MapReduce Program In Python – Michael G. Noll. Just read the basic mapreduce codes. Don’t use Iterators and Generators yet. This has been a starting point for many of us Hadoop developers.
Now try to solve these two problems from the CS109 Harvard course from 2013:
A. First, grab the file word_list.txt from here. This contains a list of six-letter words. To keep things simple, all of the words consist of lower-case letters only.Write a mapreduce job that finds all anagrams in word_list.txt.
B. For the next problem, download the file baseball_friends.csv. Each row of this csv file contains the following:
- A person’s name
- The team that person is rooting for — either “Cardinals” or “Red Sox”
- A list of that person’s friends, which could have arbitrary length
For example: The first line tells us that Aaden is a Red Sox friend and he has 65 friends, who are all listed here. For this problem, it’s safe to assume that all of the names are unique and that the friendship structure is symmetric (i.e. if Alannah shows up in Aaden’s friends list, then Aaden will show up in Alannah’s friends list). Write an mr job that lists each person’s name, their favorite team, the number of Red Sox fans they are friends with, and the number of Cardinals fans they are friends with.
Try to do this yourself. Don’t use the mrjob (pronounced Mr. Job) way that they use in the CS109 2013 class. Use the proper Hadoop Streaming way as taught in the Udacity class as it is much more customizable in the long run.
If you are done with these, you can safely call yourself as someone who could “think in Mapreduce” as how people like to call it.Try to do groupby, filter and joins using Hadoop. You can read up some good tricks from my blog:
Hadoop Mapreduce Streaming Tricks and Techniques
If you are someone who likes learning from a book you can get:
5. Spark – In memory Big Data tool.
Now comes the next part of your learning process. This should be undertaken after a little bit of experience with Hadoop. Spark will provide you with the speed and tools that Hadoop couldn’t.
Now Spark is used for data preparation as well as Machine learning purposes. I would encourage you to take a look at the series of courses on edX provided by Berkeley instructors. This course delivers on what it says. It teaches Spark. Total beginners will have difficulty following the course as the course progresses very fast. That said anyone with a decent understanding of how big data works will be OK.
I have written a little bit about Basic data processing with Spark here. Take a look: Learning Spark using Python: Basics and Applications
Also take a look at some of the projects I did as part of course at github
If you would like a book to read:
If you don’t go through the courses, try solving the same two problems above that you solved by Hadoop using Spark too. Otherwise the problem sets in the courses are more than enough.
6. Understand Linux Shell:
Shell is a big friend for data scientists. It allows you to do simple data related tasks in the terminal itself. I couldn’t emphasize how much time shell saves for me everyday.
Read these tutorials by me for doing that:
Shell Basics every Data Scientist Should know -Part I Shell Basics every Data Scientist Should know – Part II(AWK)
If you would like a course you can go for this course on edX.
If you want a book, go for:
Congrats you are an “Hacker” now. You have got all the main tools in your belt to be a data scientist. On to more advanced topics. From here it depends on you what you want to learn. You may want to take a totally different approach than what I took going from here. There is no particular order. “All Roads lead to Rome” as long as you are running.
I took the previous version of the specialization which was a single course taught by Mine Çetinkaya-Rundel. She is a great instrucor and explains the fundamentals of Statistical inference nicely. A must take course. You will learn about hypothesis testing, confidence intervals, and statistical inference methods for numerical and categorical data. You can also use these books:
8. Deep Learning
Intro – Making neural nets uncool again. An awesome Deep learning class from Kaggle Master Jeremy Howard. Entertaining and enlightening at the same time.
Advanced – A series of notes from the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition.
Bonus – A free online book by Michael Nielsen.
Advanced Math Book – A math intensive book by Yoshua Bengio & Ian Goodfellow
This course used to be there on Coursera but now only video links on youtube available. You can learn from this book too:
Apart from that if you want to learn about Python and the basic intricacies of the language you can take the Computer Science Mini Specialization from RICE universitytoo. This is a series of 6 short but good courses. I worked on these courses as Data science will require you to do a lot of programming. And the best way to learn programming is by doing programming. The lectures are good but the problems and assignments are awesome. If you work on this you will learn Object Oriented Programming,Graph algorithms and games in Python. Pretty cool stuff.
10. Advanced Maths:
Couldn’t write enough of the importance of Math. But here are a few awesome resources that you can go for.
Linear Algebra By Gilbert Strang – A Great Class by a great Teacher. I Would definitely recommend this class to anyone who wants to learn LA.
Convex Optimization – a MOOC on optimization from Stanford, by Steven Boyd, an authority on the subject.
The Machine learning field is evolving and new advancements are made every day. That’s why I didn’t put a third tier. The maximum I can call myself is a “Hacker” and my learning continues. Hope you do the same.
Hope you like this list. Please provide your inputs in comments on more learning resources as you see fit.
Till then. Ciao!!!