When I stumbled upon the phrase "Data Scientist" 3 years ago, I immediately recognized it as my best prospect for a productive career. How to start? What are the tools of the trade?
This is the blog post I wish I could have read back then.
Many of the things I list here didn't exist or were unstable until recently.
I discovered the "predictive analytics" rabbit hole and started to read and watch whatever I could find on the subject. Upon watching the Gigaohm interview with Flip Kromer (https://www.youtube.com/watch?v=OaOKbeWs9d4), the rabbit holes multiplied. I watched the interview several times to make sure I had everything in his data management laundry list:
PostgreSQL
MongoDB
HBase
Cassandra
ElasticSearch
Redis
Tokyo Tyrant
Chef/Puppet/Ansible
Every aspiring "Big Data" worker should watch his interview.
Which of these databases will I need for my DS career? As I learned the different flavours of NOSQL, it became clear that I need to learn ALL of them - different nails need different hammers.
A Flip Kromer quote driving my current project is "We need a Mechanical Turk that slides up the talent scale." Mechanical Turk for specialists is a major frontier of Data Science, and one I'm most interested in. My favourite example is FoldIt http://fold.it/portal/ . I call these "Data Workbenches": the turning point towards visualizations and browsers doing more than just browsing.
Visualizing data is the most glamorous of the DS skills, and most of us are dazzled with d3.js and feel love at first sight. Falling in love with d3.js brings a new set of rabbit holes: HTML5, JavaScript, CSS, SVG, and NodeJS.
Learning NodeJS (closures, promises..) dumps even more tools on the list and my learning curve is turning into a wall.
Can somebody please help me put all of this in context?!?
https://www.youtube.com/watch?v=TaxJwC_MP9Q
Thank you Hadley Wickham! The 3 main competencies of a Data Scientist is TRANSFORM, MODEL, and VISUALIZE.
TRANSFORM
Python: Pandas data.frames
R: plyr and dplyr
MODEL
Statistical programming becomes straightforward when you know what stats are needed for your data.
Take a look at your data, learn some stats, and then go shopping for modules.
Python: numpy & scipy modules
R:http://www.r-bloggers.com/the-50-most-used-r-packages/
VISUALIZE
Python: matplotlib
R: ggplot2
HTML5: d3.js
Wait! That's not it. We still don't have any data! We need to get some and make it machine readable.
GET DATA
Scrapy web scraper (python libxml2 and libxslt)
PLY: Python Lex and Yacc
There's a few tools missing that we need to set up an environment
Vagrant
Docker
Git
Does a Data Scientist really need to learn all of these tools? Yes. These, and more still. Serving data will be discussed in my next post.
Enterprise people might recognize that I haven't mentioned Hadoop. This is because I believe that people are trying to use MapReduce in situations where it causes more problems than it solves. I also don't like Java, and can't wait for an alternative to the HDFS. Travis Oliphant has a great presentation about fighting against the tide of Hadoop in the enterprise https://www.youtube.com/watch?v=i0FCn889ucs
I almost forgot the MOOCs - Massive Open Online Courses. Coursera.org has offerings for almost every subject you need to learn about. My personal favourite MOOC was Machine Learning with Andrew Ng - go watch it on youtube and do the exercises too https://www.youtube.com/watch?v=UzxYlbK2c7E
I enjoyed videos on:
Introduction to Data Science
Scientific Computing
Computational Methods for Data Analysis
Startup Engineering
Statistics: Making Sense of Data
Big Data in Education
Mathematical Biostatistics Boot Camp
Information Theory
Probabilistic Graphical Models
Social and Economic Networks: Models and Analysis
Markets with Frictions
Maps and the Geospatial Revolution
Computational Finance and Financial Econometrics
Principles of Reactive Programming
Analytic Combinatorics I & II
If you're an aspiring Data Scientist, welcome to a career of lifetime learning. Good luck and have fun :)
Next post: web stacks, cloud services, converging trends, and why I say "Future Web == Big Data"
Comment
You can also jump right in and write data base apps in Executable English, using a browser. Then run your apps in the browser, and get English explanations of the results.
Here's a simple example...
"Code":
some-person is a man
-----------------------------
that-person is mortal
this-person is a man
================
Socrates
Running the code gives the answer
this-person is mortal
================
Socrates
You can always get an explanation, such as
Socrates is a man
------------------------
Socrates is mortal
Here's an example in which this kind of literate programming can be
used to test and clarify policy hypotheses, such as "a high level of
debt reduces economic growth"
www.reengineeringllc.com/demo_agents/GrowthAndDebt1.agent
with a paper about it
www.astd.org/Publications/Magazines/The-Public-Manager/Archives/201...
Thanks for comments, -- Adrian
Executable Open English / Internet Business Logic
Online at www.reengineeringllc.com
Nothing to download, shared use is free, and there are no advertisements
So much to learn
This is a great post. Thanks for taking the time to put this all together. I'm just starting down this rabbit hole and feeling a bit overwhelmed, but this gives me a road map. Much appreciated.
Good compilation and really helps to organize oneself.
Rgds
Dr. Sarma
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central