It is interesting to see what Harvard considers to be data science. They use Python in all projects / training (there's nothing wrong with that, though exposure to other languages - R, Stata, SQL - would be great, in top of Python). It is too traditional, and too heavy in statistics in particular. I did not see anything about machine-to-machine communications (e.g. keyword bidding), processing real time data, the curse of big data (and how to address it), API building and implementation, automation. Too much time spent on old regression and clustering methods. Exploratory analysis is something that should be automated. Experimental design need to be added. Their recommended reading list is biased towards traditional data analysis.
- data munging/scraping/sampling/cleaning in order to get an informative, manageable data set;
- data storage and management in order to be able to access data - especially big data - quickly and reliably during subsequent analysis;
- exploratory data analysis to generate hypotheses and intuition about the data;
- prediction based on statistical tools such as regression, classification, and clustering; and
- communication of results through visualization, stories, and interpretable summaries.
The three modules are as follows:
- Prediction and elections module: how did Nate Silver predict 50 out of 50 states correctly in the 2012 U.S. presidential election, and 49 out of 50 correctly in the 2008 election? How much of that was luck? We will discuss how to find, process, combine, visualize, simulate, and summarize election-related data and questions, especially if there are conflicting polls with different reliabilities.
- Recommendation and business analytics module: the Neflix Prize was a famous recent example of collaborative filtering: given information about which movies various users have liked and disliked, how should Netflix make recommendations for what movies a user should watch? Many other companies are interested in closely-related problems. Often there is a very large but very sparse data set (e.g., there could be millions of users and tens of thousands of movies, but very few users rate more than a few hundred movies). We will explore techniques for working with such data.
- Sampling and social network analysis module: social, biological, and technological networks are attracting interest from many fields. They are examples of relational data, in which there are measurements on pairs of individuals, not just on individuals. But computation and visualization for a network with more than, say, 50 nodes (individuals) presents many challenges in scalability and interpretability. We will study techniques for drawing a sample from a network, for analyzing network data (e.g., finding “communities” and “influential” nodes in the network), and for visualizing network data.
Programming knowledge at the level of CS 50 or above, and statistics knowledge at the level of Stat 100 or above (Stat 110 recommended).
On the plus side, they offer an interesting list of available data sets for prospective students, including LinkedIn Data.
Click here for more information.