In this post, I discuss a possible new approach to teaching Programming for Data Science.
Here, I argue that we look beyond Python vs. R debate and look to teach R, Python and SQL together. To do this, we need to look at the big picture first (the problem we are solving in Data science) and then see how that problem is broken down and solved by different approaches. In doing so, we can more easily master multiple approaches and then even combine them if needed.
On first impressions, this Polyglot approach (ability to master multiple languages) sounds complex.
Why teach 3 languages together? (For simplicity – I am including SQL as a language here)
Here is some background
Outside of Data science, I also co-founded a social enterprise to teach Computer Science to kids Feynlabs. At Feynlabs, we have been working with ways to accelerate learning to Code. One way to do this is to compare and contrast multiple programming languages. This approach makes sense for Data Science also because a learner can potentially approach Data science from many directions.
To learn programming for Data Science, it would thus help to build up from an existing foundation they are already familiar with and then co-relate new ideas to this foundation through other approaches. From a pedagogical standpoint, this approach is similar to David Asubel who stressed the importance of prior knowledge in being able to learn new concepts: "The most important single factor influencing learning is what the learner already knows.”
But first, we address what is the problem we are trying to solve and how that problem can be broken down
I also propose to make this approach as part of Data Science for IoT course/certification but I also expect I will teach it as a separate module – probably in a workshop format in London and USA. If you are interested to know more, please sign up on the mailing list HERE
Data science involves the extraction of knowledge from data. Ideally, we need lots of data from a variety of sources. Data Science lies at the intersection of multiple disciplines: Programming, Statistics, Algorithms, Data analysis etc. The quickest way to solve Data Science problems is to start analyzing data as soon as possible. However, Data Science also needs a good understanding of the theory – especially the machine learning approaches.
A Data Scientist typically approaches a problem using a methodology like OSEMN (Obtain, Scrub, Explore, Model, Interpret). Some of these steps are common to a classic data warehouse and are similar to classic ETL (Extract Transform Load) approach. However, the modelling and interpreting stage are unique to Data Science. Modelling needs an understanding of Machine Learning algorithms and how they fit together. For example: Unsupervised algorithms (Dimensionality reduction, Clustering) and Supervised algorithms (Regression, Classification)
To understand Data Science, I would expect some background in Programming. Certainly, one would not expect a Data Scientist to start from “Hello World”. But on the other hand, the syntax of a language is often over-rated. Languages have quirks – and they are easy to get around with most modern tools.
So, if we try to look at the problem / big picture first (ex the Obtain, Scrub, Explore, Model and Interpret) stages – it is easier to fit in the Programming languages to the stages. Machine Learning has 2 phases: the Model Building phase and the Prediction phase. We first build the model (often as a batch mode – and it takes longer). We then perform predictions on the model in a dynamic/real-time mode. Thus, to understand Programming for Data Science, we can divide the learning into four stages: The Tool itself (IDE), Data Management, Modelling and Visualization
After understanding the base syntax - it’s easier to understand the language in terms of its packages and libraries. Both Python and R have a vast number of packages (such as Statsmodels) – often distributed as libraries (scikit-learn). Both languages are interpreted. Both have good IDEs such as Spyder, iPython for Python and RStudio for R. If using Python, you would probably use a library like scikit-learn and a distribution of Python such as the Anaconda distribution. With R, you would use the RStudio and install specific packages using R’s CRAN package management system.
Apart from R and Python, you would also need to use SQL. I include SQL because SQL plays a key role in the Data Scrubbing stage. Some have called this stage as the Janitor work of Data Science and it takes a lot of time. SQL also plays a part in SQL on Hadoop approaches like Apache Drill which allow users to write SQL queries on data stored in Hadoop and receive results
With SQL, you are manipulating data in Sets. However, once the data is inside the Programming environment, it is treated differently depending on the language.
In R, everything is a vector and R Data structures and functions are vectorized . This means, most functions in R work on Vectors (i.e. on all the elements – not on individual elements in a loop). Thus, in R, you read your data in a data frame and use a built-in model (here are the steps / packages for linear regression) . In Python, if you did not use a library like scikit-learn , you would need to make many decisions yourselves and that can be a lot harder. However, with a package like scikit-learn, you get a consistent, well documented interface to the models. That makes your job a lot easier by focussing on the usage.
After the Data modelling stage, we come to Data exploration and visualization. Here, for Python – the pandas package is a powerful tool for data exploration. Here is a simple and quick intro to the power of Python Pandas (YouTube video). Similarly, R uses dplyr and ggplot2 packages for Data exploration and visualization.
Finally, much of this discussion is a rapidly moving goalpost. For example, in R, large calculations need the data to be loaded in a matrix (ex nxn matrix manipulation). But, with platforms like Revolution Analytics – that can be overcome. Especially with the acquisition of Revolution analytics by Microsoft – and with Microsoft’s history for creating good developer tools – we can expect development in R would be simplified.
Also, since both R and Python are operating in the context of Hadoop for Data science, we would expect to leverage the Hadoop architecture through HDFS connectors both for Python Hadoop frameworks and R Hadoop integration. Also, one would argue that we are already living in a post hadoop/mapreduce world with Spark and Storm especially for Real time calculations and that at least some Hadoop functions may be replaced by Spark
Here is a good introduction to Apache Spark and a post about Getting started with Spark in Python. Interestingly, the Spark programming guide includes integration with 3 languages (Scala, Java and Python) but no R. But the power of Open source means we have SparkR which integrates R with Spark.
The approach to cover multiple languages has some support - for instance, with the Beaker notebook . You could also achieve the same effect by working on the command line for example in Data Science at the Command Line
Even in a brief blog post – you can get a lot of insights when we look at the wider problem of Data science and compare how different approaches are addressing segments of that problem. You just need to get the bigger picture of how these Languages fit together for Data Science and understand the major differences (for example vectorization in R).
Use of good IDEs, packages etc softens the impact of programming.
It then changes our role, as Data Scientists, to mixing and matching a palette of techniques as APIs – sometimes spanning languages.
I hope to teach this approach as part of Data Science for IoT course/certification
Programming for Data Science will also be a separate module talk over the next few months at fablab london, London IT contractors meetup group, CREATE Miami, a venture accelerator at Miami Dade College, City Sciences conference(as part of a larger paper) in Shanghai and MCS Madrid
For more schedules and details please sign up here