Subscribe to DSC Newsletter

Programming for Data Science the Polyglot approach: Python + R + SQL

In this post, I discuss a possible new approach to teaching Programming for Data Science.

Programming for Data Science is focussed on the R vs. Python question.  Everyone seems to have a view including the venerable Nature journal (Programming – Pick up Python).

Here, I argue that we look beyond Python vs. R debate and look to teach R, Python and SQL together. To do this, we need to look at the big picture first (the problem we are solving in Data science) and then see how that problem is broken down and solved by different approaches. In doing so, we can more easily master multiple approaches and then even combine them if needed.

On first impressions, this Polyglot approach (ability to master multiple languages) sounds complex.

Why teach 3 languages together?  (For simplicity – I am including SQL as a language here)

Here is some background

Outside of Data science, I also co-founded a social enterprise to teach Computer Science to kids  Feynlabs. At Feynlabs, we have been working with ways to accelerate learning to Code. One way to do this is to compare and contrast multiple programming languages. This approach makes sense for Data Science also because a learner can potentially approach Data science from many directions.

To learn programming for Data Science, it would thus help to build up from an existing foundation they are already familiar with and then co-relate new ideas to this foundation through other approaches. From a pedagogical standpoint, this approach is similar to David Asubel who stressed the importance of prior knowledge in being able to learn new concepts:  "The most important single factor influencing learning is what the learner already knows.”

But first, we address what is the problem we are trying to solve and how that problem can be broken down

I also propose to make this approach as part of Data Science for IoT course/certification but I also expect I will teach it as a separate module – probably in a workshop format in London and USA. If you are interested to know more, please sign up on the mailing list   HERE

Data Science – the problem we are trying to solve

Data science involves the extraction of knowledge from data. Ideally, we need lots of data from a variety of sources.  Data Science lies at the intersection of multiple disciplines: Programming, Statistics, Algorithms, Data analysis etc. The quickest way to solve Data Science problems is to start analyzing data as soon as possible. However, Data Science also needs a good understanding of the theory – especially the machine learning approaches.

A Data Scientist typically approaches a problem using a methodology like OSEMN (Obtain, Scrub, Explore, Model, Interpret). Some of these steps are common to a classic data warehouse and are similar to classic ETL (Extract Transform Load) approach. However, the modelling and interpreting stage are unique to Data Science. Modelling needs an understanding of Machine Learning algorithms and how they fit together. For example: Unsupervised algorithms (Dimensionality reduction, Clustering) and Supervised algorithms (Regression, Classification)

To understand Data Science, I would expect some background in Programming. Certainly, one would not expect a Data Scientist to start from “Hello World”. But on the other hand, the syntax of a language is often over-rated. Languages have quirks – and they are easy to get around with most modern tools.

So, if we try to look at the problem / big picture first (ex the Obtain, Scrub, Explore, Model and Interpret) stages – it is easier to fit in the Programming languages to the stages. Machine Learning has 2 phases: the Model Building phase and the Prediction phase. We first build the model (often as a batch mode – and it takes longer). We then perform predictions on the model in a dynamic/real-time mode. Thus, to understand Programming for Data Science, we can divide the learning into four stages: The Tool itself (IDE), Data Management, Modelling and Visualization

Tools, IDE and Packages

After understanding the base syntax - it’s easier to understand the language in terms of its packages and libraries. Both Python and R have a vast number of packages (such as Statsmodels)  – often distributed as libraries (scikit-learn). Both languages are interpreted. Both have good IDEs such as SpyderiPython for Python and RStudio for R. If using Python, you would probably use a library like scikit-learn and a distribution of Python such as the Anaconda distribution. With R, you would use the RStudio  and install specific packages using R’s  CRAN package management system.

Data management

Apart from R and Python, you would also need to use SQL. I include SQL because SQL plays a key role in the Data Scrubbing stage. Some have called this stage as the Janitor work of Data Science and it takes a lot of time. SQL also plays a part in SQL on Hadoop approaches like Apache Drill which allow users to write SQL queries on data stored in Hadoop and receive results

With SQL, you are manipulating data in Sets. However, once the data is inside the Programming environment, it is treated differently depending on the language.

In R, everything is a vector and R Data structures and functions are vectorized . This means, most functions in R work on Vectors (i.e. on all the elements – not on individual elements in a loop). Thus, in R, you read your data in a data frame and use a built-in model (here are the steps / packages for linear regression) . In Python, if you did not use a library like scikit-learn , you would need to make many decisions yourselves and that can be a lot harder. However, with a package like scikit-learn, you get a consistent, well documented  interface to the models. That makes your job a lot easier by focussing on the usage.

Data Exploration and Visualization

After the Data modelling stage, we come to Data exploration and visualization. Here, for Python – the pandas package is a powerful tool for data exploration. Here is a simple and quick intro to the power of Python Pandas (YouTube video). Similarly, R uses dplyr and ggplot2 packages for Data exploration and visualization.

A moving goalpost and a Polyglot approach

Finally, much of this discussion is a rapidly moving goalpost. For example, in R, large calculations need the data to be loaded in a matrix (ex nxn matrix manipulation). But, with platforms like Revolution Analytics – that can be overcome. Especially with the acquisition of Revolution analytics by Microsoft – and with Microsoft’s history for creating good developer tools – we can expect development in R would be simplified.

Also, since both R and Python are operating in the context of Hadoop for Data science, we would expect to leverage the Hadoop architecture through HDFS connectors both for Python Hadoop frameworks and R Hadoop integration. Also, one would argue that we are already living in a post hadoop/mapreduce world with Spark and Storm especially for Real time calculations and that at least some Hadoop functions may be replaced by Spark

Here is a good introduction to Apache Spark and a post about Getting started with Spark in Python. Interestingly, the Spark programming guide includes integration with 3 languages (Scala, Java and Python) but no R. But the power of Open source means we have SparkR which integrates R with Spark.

The approach to cover multiple languages has some support - for instance, with the Beaker notebook . You could also achieve the same effect by working on the command line for example in Data Science at the Command Line

Conclusions

Even in a brief blog post – you can get a lot of insights when we look at the wider problem of Data science and compare how different approaches are addressing segments of that problem. You just need to get the bigger picture of how these Languages fit together for Data Science and understand the  major differences (for example vectorization in R).

Use of good IDEs, packages etc softens the impact of programming.

It then changes our role, as Data Scientists, to mixing and matching a palette of techniques as APIs – sometimes spanning languages.

I hope to teach this approach as part of Data Science for IoT course/certification

Programming for Data Science will also be a separate module talk over the next few months at fablab londonLondon IT contractors meetup groupCREATE Miami, a venture accelerator at Miami Dade CollegeCity Sciences conference(as part of a larger paper) in Shanghai and MCS Madrid

For more schedules and details please sign up here

Views: 29649

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Baguinebie Bazongo on August 21, 2015 at 12:18pm

Thank you for sharing these information with us. I'de really liked it

Best

Comment by Riya Saxena on May 4, 2015 at 3:24am

Thanks for your post! Data science has taken the world by storm. Every field of study and area of business has been affected as people increasingly realize the value of the incredible quantities of data being generated. But to extract value from those data, one needs to be trained in the proper data science skills. The R programming language has become the de facto programming language for data science. Its flexibility, power, sophistication, and expressiveness have made it an invaluable tool for data scientists around the world. More at

www.youtube.com/watch?v=1jMR4cHBwZE
Comment by IvánG Orozco on February 16, 2015 at 3:46pm

I have one question regarding how data science relates to big data. Are they the same? If so, Do tools as Python and R can be used in big data analysis? Thanks.

Comment by Sione Palu on February 16, 2015 at 8:54am

Data-Science Triglot ==>  Java + Matlab + SQL

Comment by Siddhartha M. on February 13, 2015 at 5:50am

@Ihe

Great points & one shouldn't oversimplify the question of paradigms especially as solution architects do have to be critical about picking the right tool for the job. This becomes more imperative (no pun) where there is a need & scope for multi-paradigm approach.

Comment by Ihe Onwuka on February 13, 2015 at 4:37am

@Jitin

The issues you raise are a great opportunity for critical thought.

If there is no fixed paradigm for data science how does it make sense to fix the programming languages to be used for it. First of all before we dismiss them as irrelevant to the discussion, let's step back and ask what exactly is a paradigm and what is a programming paradigm.

I like the definition in wikipedia - 

In science and epistemology (the theory of knowledge), a paradigm /ˈpærədm/ is a distinct concept or thought pattern. If I may simplify that, a paradigm is a way of thinking.

In  computer science, programming languages are used to codify solutions to problems and a language that follows a certain paradigm is a means of codifying thinking in that paradigm.

A person who can code in C++ and Java can code in multiple languages but is that really polyglot programming?. Is there a fundamental difference in the way they think about problems. Is there a fundamental difference in the Python approach vs the R approach to problem solving. You can say oh Python is multi-paradigm but lots of people who write Python are uncomfortable with some of the multi-paradigm aspects (e.g list comprehensions)

The author says

"But on the other hand, the syntax of a language is often over-rated. Languages have quirks – and they are easy to get around with most modern tools."


I'm not going to speculate as to whether that's why the author thinks he  can divorce a discussion on a polyglot programming approach from considerations of programming paradigms. Instead what I will do is apply the same concept to another domain.

Statisticians have paradigms too. Supposing I got up and made this statement.

The difference between the Bayesian and Frequentist approaches are overrated, basically a bunch of notational quirks that are easy to overcome with modern tools.

Well first of all as a computer scientist I would not dare to make such a statement. Secondly if I did I would instantly expose myself as someone who did not have a proper mastery of what he was talking about and I shouldn't be surprised if a statistician hinted as much and suggested I STFU.

If a couple of months ago you were a no Math/Stat guy what sort of statistical/mathematical models would you expect to be building now. Should anyone be expecting you to build any at all?   So why do you expect to be programming then, or more importantly why is anybody expecting you to be programming at all? 

If you work in the field of data science and you are not very good at programming there are several ways you can still be effective and valuable in the job. Should I go around saying that everybody should use linear regression or neural networks because I'm not comfortable with  support vector machines. Or that everybody should use parametric or frequentist techniques because I don't understand non-parametric or Bayesian ones. At what point does that start to sound ridiculous?

The approach to programming needed in data science is the same as the approach needed in any other domain. If a person is not very good at programming it is better for all if they stay away from it

Comment by ajit jaokar on February 12, 2015 at 10:24pm

@Jitin - thanks for your comments. agree. 

Comment by Jitin Kapila on February 12, 2015 at 9:02pm

It seems that Mr Ajit here is addressing the much needed approach for Data Science. Few months back I was literally a no Programming guy, but lately my old courses in C++ and Java has helped me Learning the "languages"and there approaches for R , Python and SQL. 

The main Problem what I see with Data Science is that there is no fixed paradigm for it. And in such case every successful attempt of "Solution Methodology" is followed by seeking vast approval from practitioners. This is leading to a chaos where people are addressing Methodologies more than Problem itself.

I havent seen an article lately trying to help Economies to grow, Eradicate Terrorism and Poverty. With help of Data Science why cant we Cure Cancer, Stop spreading AIDS or Ebola.

What Mr. Ajit and Dr. Vincent are emphsasing that we should address Problem with best of our capabilities irrespective of Programming Paradigm. Mr. Ihe, being a Computer Scientist urself, I would expect a new software tool from You which could kill all problems in R, Python, Hadoop, SQL or any which could help Data Scientist solving problems in more pragmatically and holistically. Till then we have to use resources which are comfortable to Us.  

Comment by ajit jaokar on February 11, 2015 at 11:03pm

@Vincent

Thanks. Exactly! "A data scientist who combines both (strategic / tactical) is usually better equipped to deliver high return."

Comment by Ihe Onwuka on February 11, 2015 at 5:30pm

Meanwhile "If you go to Silicon Valley and talk to any of the startups they are all using Scala nobody even questions it any more". - Dean Wampler

http://www.slideshare.net/deanwampler/why-scala-is-taking-over-the-...

or for the video

https://www.parleys.com/talk/53a7d2c5e4b0543940d9e544

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service