When I stumbled upon the phrase "Data Scientist" 3 years ago, I immediately recognized it as my best prospect for a productive career. How to start? What are the tools of the trade?
This is the blog post I wish I could have read back then.
Many of the things I list here didn't exist or were unstable until recently.
I discovered the "predictive analytics" rabbit hole and started to read and watch whatever I could find on the subject. Upon watching the Gigaohm interview with Flip Kromer (https://www.youtube.com/watch?v=OaOKbeWs9d4), the rabbit holes multiplied. I watched the interview several times to make sure I had everything in his data management laundry list:
Every aspiring "Big Data" worker should watch his interview.
Which of these databases will I need for my DS career? As I learned the different flavours of NOSQL, it became clear that I need to learn ALL of them - different nails need different hammers.
A Flip Kromer quote driving my current project is "We need a Mechanical Turk that slides up the talent scale." Mechanical Turk for specialists is a major frontier of Data Science, and one I'm most interested in. My favourite example is FoldIt http://fold.it/portal/ . I call these "Data Workbenches": the turning point towards visualizations and browsers doing more than just browsing.
Learning NodeJS (closures, promises..) dumps even more tools on the list and my learning curve is turning into a wall.
Can somebody please help me put all of this in context?!?
Thank you Hadley Wickham! The 3 main competencies of a Data Scientist is TRANSFORM, MODEL, and VISUALIZE.
Python: Pandas data.frames
R: plyr and dplyr
Statistical programming becomes straightforward when you know what stats are needed for your data.
Take a look at your data, learn some stats, and then go shopping for modules.
Python: numpy & scipy modules
Wait! That's not it. We still don't have any data! We need to get some and make it machine readable.
Scrapy web scraper (python libxml2 and libxslt)
PLY: Python Lex and Yacc
There's a few tools missing that we need to set up an environment
Does a Data Scientist really need to learn all of these tools? Yes. These, and more still. Serving data will be discussed in my next post.
Enterprise people might recognize that I haven't mentioned Hadoop. This is because I believe that people are trying to use MapReduce in situations where it causes more problems than it solves. I also don't like Java, and can't wait for an alternative to the HDFS. Travis Oliphant has a great presentation about fighting against the tide of Hadoop in the enterprise https://www.youtube.com/watch?v=i0FCn889ucs
I almost forgot the MOOCs - Massive Open Online Courses. Coursera.org has offerings for almost every subject you need to learn about. My personal favourite MOOC was Machine Learning with Andrew Ng - go watch it on youtube and do the exercises too https://www.youtube.com/watch?v=UzxYlbK2c7E
I enjoyed videos on:
Introduction to Data Science
Computational Methods for Data Analysis
Statistics: Making Sense of Data
Big Data in Education
Mathematical Biostatistics Boot Camp
Probabilistic Graphical Models
Social and Economic Networks: Models and Analysis
Markets with Frictions
Maps and the Geospatial Revolution
Computational Finance and Financial Econometrics
Principles of Reactive Programming
Analytic Combinatorics I & II
If you're an aspiring Data Scientist, welcome to a career of lifetime learning. Good luck and have fun :)
Next post: web stacks, cloud services, converging trends, and why I say "Future Web == Big Data"