This article summarizes a trend in programming languages usage, based on a number of proxy metrics. This change started to be more pronounced in early 2017: Python became the language of choice, over R, for data science and machine learning applications.
Statistics from Google
Google has one app called Google Trend to find out trends about specific subjects, to compare interest for a number of search topics, broken down by region or time period.
Search index for Python Data Science (blue) versus R Data Science (red) over the last 5 years, in US
We used the app in question to compare search interest for R data Science versus Python Data Science, see above chart. It looks like until December 2016, R dominated, but fell below Python by early 2017. The above chart displays an interest index, 100 being maximum and 0 being minimum. Click here to access this interactive chart on Google, and check the results for countries other than US, or even for specific regions such as California or New York.
Note that Python always dominated R by a long shot, because it is a general-purpose language, while R is a specialized language. But here, we compare R and Python in the niche context of data science. The map below shows interest for Python (general purpose) per region, using the same Google index in question.
Interest for Python, by region (last 12 months; source: Google)
Indeed statistics
Indeed is a job aggregator. The jobs listed there might have expired or could be duplicate, or irrelevant, anyway it is worth having a quick look:
Python Data Science returns 15,741 full time jobs. Top cities in US are:
R Data Science returns 7,533 full time jobs. Top cities in US are:
7,533 full time jobs
Our internal statistics
We have 83 fresh, active job ads, relevant to data science and mostly in US and London, for Python: you can check them out here. For R, we have 66, and you can check them out here. It would be interesting to compare these stats with job number stats from LinkedIn.
Another metric of interest is the number of articles written about each language, in the context of data science. On Data Science Central, we have 19,500 documents where R is mentioned (since 2008) versus 11,500 with Python. However, when you click on these two links to check out the top results, 9 out of 10 are in 2017 for Python, versus 7 out of 10 for R. In short, R is starting to show its age. A Google search for R or Python (on Data Science Central) will yield similar conclusions.
It would be interesting to check what is happening with Java and C++, as they have been the workhorses of software development for a long time.
DSC Resources
Popular Articles
Comment
I agree there, recently i blogged on Linked-in on the IEEE programming ranking 2017.
https://www.linkedin.com/pulse/python-top-language-2017-programming...
I feel like it took a long for python to be top, but as Vincent suggested, it needs more to be done there. esp for stats and math libraries (as of now, there are ~10000+ R libs in CRAN, +5000 in Bioconductor for R) and I wonder if we can have better and good quality lib in python (I don't belive in number of libs) and I think we should add a rule for library development in python: ...zen of python ... quality is better than quantity, lets do better library development ....
Vincent, thanks for your insight. I want to clarify that I do agree we are seeing more and more people with "real programming/algorithmic background." No arguments there. I was talking about the practice of drawing upon mathematical statistics, mathematical economics, mathematical physics, numerical and nonstandard solutions, nonparametric methods and integrative and/or out-of-box thinking.
What I was saying is that we may be, for whatever reason, beginning to get tapped out on theoretical frameworks as doorways to designing truly innovative algorithms, such as simulated testing of unresolved distributional issues, etc., in data science.
This may instead be a function of limitations in current measurement technology. Or, even a desire among data science teachers and mentors to focus on getting new folks more familiar with basic principles, such as how to minimize the need for debugging. Data engineering is prioritized, at this point in time, over data science, which is more prone to pursue the fringes of current knowledge (finding better, more statistically valid algorithms for use in particle colliders, for example). All of this seems to help explain the increasing favorability of Python, which is a beautiful, easy and flexible language. Increasing demand for data engineering is, of course, just one hypothesis to explain the observed trend, subject to further research, etc.
I think we are seeing more and more people with a real programming/algorithmic background in data science / machine learning, and thus more of Python. Besides, lots of statistical libraries have been added to Python, so the need to rely on a more specialized language (R) for statistical programming, is not as strong as it used to be. Some domains (biostatistics for instance) will continue to use R (and SAS) for a long time.
Vincent, this is an interesting trend. The usage and employee recruiting efforts do seem to have escalated for Python over R, and there are reasons, but perhaps not all have been included in the discussion. R is a tool for statisticians, numerical experts and optimization specialists. It has some great data-handling capabilities, but the strength of its libraries lies in designing and testing new statistical, numerical and optimization techniques. Here, it outweighs similar Python resources.
What we are seeing in this trend is that the innovative analytical techniques for machine learning have begun to peak (note: "begun"). Once you produce an algorithm that can optimize over several dimensions in non-continuous, non-differentiable space where optima are vastly multi-modal, you're pretty close to the outer limits of mathematical physics, except for quantum mechanics. (In fact, quantum mechanics may be the only place left for truly innovative data science methods.)
Otherwise the vast activity of data science has become a process of rehashing and recombining known techniques such as kerneling, neural nets and shrinkage methods, or bringing in past theories that are now computable given the machine environment.
Everything else new is in the data and network architectures. And why not? The original data scientists perhaps never had to worry about those architectures, while now, all of us do. Hence, the builds, improvements and talent acquisition in data science more and more focuses on implementation of existing data science techniques (those that work) within these architectures. Python is better for these purposes than R, or at least it seems so to me. Especially for massive text mining operations (yet, still using the original textual analytics that begin with frequencies of pairs of words and phrases, etc.).
One thing to note is that if Artificial Intelligence ever fulfills its ultimate vision, you can be sure the algorithms will be written in R, perhaps with embedded C++ to speed up complex processing. So, the trends in which Python is "winning" are obviously a reflection of maturity and evolution in the underlying field of data science, not how easy the program is to use relative to R.
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central