In this post, I show a network analysis of the R and Python ecosystems in terms of their competitors. The objective of the post is to -

- Identify the key tools/languages being commonly searched in the data science area related to R/Python
- Identify the sets of associated tools

To identify the typical substitutes/ competitors of a tool, I use the Google search autofill recommendations. Google search prompts identify the most frequently searched terms which occur after a given string and automatically provides a list of suggestions. Thus, this may be treated as a proxy for the common substitutes people search for against a particular tool.

Figure 1 "R vs.." autofill suggestions

In Fig 1 when I start typing “R vs “ in the Google Search bar, Google provides a list of suggestions based on their ‘autocomplete’ feature. This predictions are based on factors like frequent searches by other people, the region, language, past web site visits, etc.

Those who are familiar with the data science wars, can immediately recognize that R vs Python was the most frequent search. Other common searches are substitute products for R e.g. sas/ matlab or commonly confused concepts e.g. r versus r squared. Since we are interested only in the products / tools /packages we ignore the suggestions which are not relevant (e.g. r squared). We add the rest of the terms to an edgelist table with source R and targets as ‘python’, ‘sas’ etc. Thus we have identified some of the common substitutes for R and also created an edge for them.

In the next step, we take each of the target nodes (‘python’, ‘sas’ etc.) identified through the above process and repeat the above steps. We add the results again to the edgelist table.

Figure 2 "Python vs.." autofill suggestions

Once the edgelist table is thus formed, we perform Network analysis of the eco-system to identify the most important entities.

We use ‘Force Atlas 2’ to identify components in the graph and visualize the eco-system. ForceAtlas2 is a force-directed layout used for network spatialization in Gephi. It is an attempt to integrate different techniques such as the Barnes Hut simulation, degree-dependent repulsive force, and local and global adaptive temperatures. In this algorithm – “Nodes repulse each other like charged particles, while edges attract their nodes, like springs. These forces create a movement that converges to a balanced state. This final configuration is expected to help the interpretation of the data.”

Thus, the layout positions the nodes which are compared together closer (since they have edges connecting them). Hence, groups of tools which are frequently compared occur closer in the network map than others. Thus, R, python, Matlab, SAS, are placed closed together, while tensorflow and php occur far apart. We then use ‘in-degree’ to size the nodes and use ‘Page rank’ of the nodes to size the labels. The edge colors indicate the cluster to which a tool is compared.

Figure 3 The R vs Python ecosystem

Based on the observations and analysis of the network we see that the eco-system consists of diverse sets of communities. Eight modularity classes emerged from the analysis using Blondel et. al., 2008, algorithm to identify communities as provided in Gephi. The modules are colored separately to allow easy visual identification. We can name them as follows -

I. Statistical data analysis tools and languages - R, Python, Matlab, SAS, SPSS, Minitab, Excel. etc

II. Enterprise tools for business intelligence - Weka, Rapidminer, Alteryx, Tableau etc.

III. Scripting and Traditional Programming languages – Java, php, C#, C++, etc.

IV. Database 1 – sql, Oracle, mysql, mongodb, Cassandra, etc.

V. Database 2 (Hadoop ecosystem) – Hadoop, Spark, Storm, Mapreduce, Hive, etc.

VI. Deep learning tools – Tensorflow, Theano, Cntk etc.

VII. Modules, libraries and packages for machine learning – pandas, scipy, opencv, tm, etc.

VIII. Machine learning platforms from Organizations – Google, IBM (Watson), Microsoft, Alphabet (Deepmind), etc.

Thus, the use of **Python** can be specifically more useful in teams where **programming and scripting knowledge** are already available, e.g. teams familiar with php, C#, Java, etc. It is also frequently compared to other **statistical analysis languages** like R, Matlab, and **machine learning languages on haddop and similar platforms** (e.g. mahout). Thus, the observation in the datacamp blog about Python being used by ‘programmers who want to delve into data analysis or apply statistical techniques, and by developers that turn to data science’ seems to still hold.

The use of **R**, on the other hand, is more commonly compared with the use of **statistical data analysis** tools like SAS, SPSS, Excel. It is also frequently compared to enterprise **business intelligence tools and self-service platforms** like Rapidminer, Tableau, Alteryx etc. Thus, similar to datacamps observation, R users are from academics, research, as well as from business analytics (the enterprise market). The frequent comparisons with Python, weka, etc. indicate the use of R by the machine learning programming community as well.

Similarly, those familiar with **database technologies** like SQL, oracle etc. may be more comfortable using big data techniques **and machine learning languages available on platforms** of Hadoop, spark etc. This is also the group which seems to be comparing their existing tools against **enterprise** **machine learning toolkits and platforms** on offer from firms like Google, IBM etc.

Those familiar with **Matlab**, frequently compare with **scientific and machine learning package** set from Python (numpy, scipy etc.) and also with R. They seem to be having an equally fair chance of **using R and Python**. This is the group, along with the R users, who are also comparing their tools against **deep learning technologies** like tensorflow, kafka etc.

Google autocomplete depends on regions and language as well as previous search behavior. This may have affected the suggestions.

ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network ..., Mathieu Jacomy , Tommaso Venturini, Sebastien Heymann, Mathieu Bastian, Published: June 10, 2014DOI: 10.1371/journal.pone.0098679

Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, Fast unfolding of communities in large networks, in Journal of Statistical Mechanics: Theory and Experiment 2008 (10), P1000

https://support.google.com/websearch/answer/106230?hl=en

http://www.datasciencecentral.com/profiles/blogs/data-science-wars-...

https://www.datacamp.com/community/tutorials/r-or-python-for-data-a...

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central