I'm hoping someone can give me some advice around how to move into the field of data science.
My background is in physics, in which I achieved a first class masters in 2006, but since graduating I have pursued a career in corporate strategy consulting. I decided on strategy because it is about looking beneath what is happening and analyzing trends to help guide decisions.
However, this career path has never fully satisfied my desire to manipulate data, identify patterns and make decisions based on these. Data science on the other hand is about doing just this and I've started exploring options to move into this field. I recognize that to achieve this I will have to develop new skills and am currently assessing academic courses to achieve this.
In terms of understanding statistical methods, my degree gave me a good grounding in this and my work within consulting has helped develop my skills in bringing data to life. The big gap in my knowledge though is around programming.
Are there any courses available in data science that are widely recognized and could be completed part-time? Or, if there are none, could people suggest the programming languages that I should focus on and any associated qualifications?
Answers to the above, as well as any other advice on how to get into this field, would be gratefully received and much appreciated.
Thanks Fari, much appreciated. The book on Visualizing should be very useful.
All of these suggestions are great! How about any suggestions for someone like me with a programming background looking to add the analytical experience to my resume? I am taking some of the Coursera classes and reading and trying things out, but is there something else I should be doing to add to my skill set?
There are several courses but I'd be very suspicious about most of them that offer to teach you all the essentials in X weeks/months. However, if you take a variety of specialized course that could help you gain the bare essentials so that you can proceed to the next steps of your training. IM me for more details.
A couple of courses I've found very useful are Web Intelligence and Big Data (IIT), Machine Learning (Stanford), and Computing for Data Analysis (John Hopkins). All of these are offered in Coursera and you can get a certificate of accomplishment if you pass them (not that it's going to upgrade your resume, but it's nice to get something for all the work you put). As I mentioned before, this is the bare minimum you need to do just to get started. I believe data science is a life-long pursuit.
I'm coming to this a little late, but another thing to consider is that R is nearly impossible for recruiters to search directly for because searches treat it as just one letter. If you're interested in being approached by recruiters (e.g., through LinkedIn), I would strongly suggest beginning with Python.
It looks like this thread is a bit old, but it popped into my inbox today, so here's my 2 cents.
It appears R and Python are neck and neck in the data science field. Carl will probably find R easier to learn since he does not have a programming background and since he comes from a research and statistical background. R looks more like the things he's used to.
However, I prefer Python for a variety of reasons. The first is I have a programming background making Python more or less natural and familiar to me. Since Carl will no doubt encounter other programming languages in his Data Science career learning a programming language is a must. And Python is a good first language to learn.
Ben James said:
I think the three languages you're most likely to encounter in the field are R, Matlab, and Python. I'd recommend learning Python.
R is good for classical stats, and has a number of nice libraries, but its syntax and usage will not give you a good intro to other programming languages - it's kind of its own thing. The biggest issues with R has to do with efficiency. By default, it's got very slow I/O, gobbles RAM like you wouldn't believe, and often doesn't make the most efficient use of your CPU. If you're working with big data sets, you'll run into R's bottlenecks pretty quickly. I've been told there are ways to mitigate some of these issues, but is the hoop-jumping really worth it?
Matlab tends to be pretty prevalent in academia. I'm not a big fan of this language, in part because of its syntax, and in part because it's proprietary, as well as a few other reasons.
Python is the one you'll probably most often encounter in professional environments. It's easy to learn and use, and once you've got a good grip on it, you can easily transition to many other programming languages. Some programmers don't like that whitespace is significant in Python, but it actually helps you produce extremely readable code. Python implementations tend to be very efficient - you sometimes get performance improvements between one and two orders of magnitude using Python over R, for example.
For data science purposes, check out Python's scikits module (bunch of machine learning/stats stuff), matplotlib (visualization), and pandas (you can use dataframes, which are perhaps the best feature inherent to R).
You'll also want to look up NoSQL databases, MapReduce, and Hadoop.
To get some experience, participate in kaggle contests.Very well said! In particular,
"R is good for classical stats, and has a number of nice libraries, but its syntax and usage will not give you a good intro to other programming languages'' really hits the nail on the head. I would definitely start with python.