Subscribe to DSC Newsletter

The Data Science Toolkit - taking your first steps towards becoming a Data Scientist

When I stumbled upon the phrase "Data Scientist" 3 years ago, I immediately recognized it as my best prospect for a productive career. How to start? What are the tools of the trade?

This is the blog post I wish I could have read back then.
Many of the things I list here didn't exist or were unstable until recently.

I discovered the "predictive analytics" rabbit hole and started to read and watch whatever I could find on the subject. Upon watching the Gigaohm interview with Flip Kromer (https://www.youtube.com/watch?v=OaOKbeWs9d4), the rabbit holes multiplied. I watched the interview several times to make sure I had everything in his data management laundry list:

PostgreSQL
MongoDB
HBase
Cassandra
ElasticSearch
Redis
Tokyo Tyrant

Chef/Puppet/Ansible

Every aspiring "Big Data" worker should watch his interview.

Which of these databases will I need for my DS career? As I learned the different flavours of NOSQL, it became clear that I need to learn ALL of them - different nails need different hammers.

A Flip Kromer quote driving my current project is "We need a Mechanical Turk that slides up the talent scale." Mechanical Turk for specialists is a major frontier of Data Science, and one I'm most interested in. My favourite example is FoldIt http://fold.it/portal/ . I call these "Data Workbenches": the turning point towards visualizations and browsers doing more than just browsing.

Visualizing data is the most glamorous of the DS skills, and most of us are dazzled with d3.js and feel love at first sight. Falling in love with d3.js brings a new set of rabbit holes: HTML5, JavaScript, CSS, SVG, and NodeJS.

Learning NodeJS (closures, promises..) dumps even more tools on the list and my learning curve is turning into a wall.
Can somebody please help me put all of this in context?!?

https://www.youtube.com/watch?v=TaxJwC_MP9Q
Thank you Hadley Wickham! The 3 main competencies of a Data Scientist is TRANSFORM, MODEL, and VISUALIZE.

TRANSFORM
Python: Pandas data.frames
R: plyr and dplyr

MODEL
Statistical programming becomes straightforward when you know what stats are needed for your data.
Take a look at your data, learn some stats, and then go shopping for modules.
Python: numpy & scipy modules
R:http://www.r-bloggers.com/the-50-most-used-r-packages/

VISUALIZE
Python: matplotlib
R: ggplot2
HTML5: d3.js

Wait! That's not it. We still don't have any data! We need to get some and make it machine readable.

GET DATA
Scrapy web scraper (python libxml2 and libxslt)
PLY: Python Lex and Yacc

There's a few tools missing that we need to set up an environment
Vagrant
Docker
Git

Does a Data Scientist really need to learn all of these tools? Yes. These, and more still. Serving data will be discussed in my next post.

Enterprise people might recognize that I haven't mentioned Hadoop. This is because I believe that people are trying to use MapReduce in situations where it causes more problems than it solves. I also don't like Java, and can't wait for an alternative to the HDFS. Travis Oliphant has a great presentation about fighting against the tide of Hadoop in the enterprise https://www.youtube.com/watch?v=i0FCn889ucs

I almost forgot the MOOCs - Massive Open Online Courses. Coursera.org has offerings for almost every subject you need to learn about. My personal favourite MOOC was Machine Learning with Andrew Ng - go watch it on youtube and do the exercises too  https://www.youtube.com/watch?v=UzxYlbK2c7E

I enjoyed videos on:

Introduction to Data Science
Scientific Computing
Computational Methods for Data Analysis
Startup Engineering
Statistics: Making Sense of Data
Big Data in Education
Mathematical Biostatistics Boot Camp
Information Theory
Probabilistic Graphical Models
Social and Economic Networks: Models and Analysis
Markets with Frictions
Maps and the Geospatial Revolution
Computational Finance and Financial Econometrics
Principles of Reactive Programming
Analytic Combinatorics I & II

If you're an aspiring Data Scientist, welcome to a career of lifetime learning. Good luck and have fun :)

Next post: web stacks, cloud services, converging trends, and why I say "Future Web == Big Data"

Views: 18409

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Adrian Walker on December 12, 2015 at 10:21am

You can also jump right in and write data base apps in Executable English, using a browser.  Then run your apps in the browser, and get English explanations of the results.

Here's a simple example...

"Code":


some-person is a man
-----------------------------
that-person is mortal


this-person is a man
================
 Socrates

Running the code gives the answer

this-person is mortal
================
Socrates

You can always get an explanation, such as


Socrates is a man
------------------------
Socrates is mortal


Here's an example in which this kind of literate programming can be
used to test and clarify policy hypotheses, such as "a high level of
debt reduces economic growth"

www.reengineeringllc.com/demo_agents/GrowthAndDebt1.agent

with a paper about it

www.astd.org/Publications/Magazines/The-Public-Manager/Archives/201...

Thanks for comments,  -- Adrian

Executable Open English / Internet Business Logic
Online at www.reengineeringllc.com  
Nothing to download, shared use is free, and there are no advertisements

Comment by Jerry Smith on October 20, 2015 at 4:57am

So much to learn 

Comment by Jeff Dixon on March 16, 2015 at 4:27am

This is a great post.  Thanks for taking the time to put this all together.  I'm just starting down this rabbit hole and feeling a bit overwhelmed, but this gives me a road map.  Much appreciated.

Comment by Dr.Sarma M.V.K on January 30, 2014 at 10:52pm

Good compilation and really helps to organize oneself.

Rgds

Dr. Sarma

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service