Subscribe to DSC Newsletter

Summary:  Someone had to say it.  In my opinion R is not the best way to learn data science and not the best way to practice it either.  More and more large employers agree.

 

Someone had to say it.  I know this will be controversial and I welcome your comments but in my opinion R is not the best way to learn data science and not the best way to practice it either.

 

Why Should We Care What Language You Use For Data Science

Here’s why this rises to the top of my thoughts.  Recently my local Meetup had a very well attended hackathon to walk people through the Titanic dataset using R.  The turnout was much higher than I expected which was gratifying.  The result, not so much.

Not everyone in the audience was a beginner and many were folks who had probably been exposed to R at some point but were just out of practice.  What struck me was how everyone was getting caught up in the syntax of each command, which is reasonably complex, and how many commands were necessary for example, to run the simplest decision tree.

Worse, it was as though they were learning a programming language and not the data science.  There was little or no conversation or questioning around cleansing, prep, transforms, feature engineering, feature selection, model selection, and absolutely none about about hyperparameter tuning.  In short, I am convinced that group left thinking that data science is about a programming language whose syntax they had to master, and not about the underlying major issues in preparing a worthwhile model.

 

Personal Experience

I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years.  I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms). 

The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.

The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.

 

Some Perspective

Through the 90’s (and actually well before) up through the early-2000’s if you studied data science in school you learned it on SAS or SPSS and in the base code for those packages that actually looks a lot like R.  R wasn’t around then and for decades, SAS and SPSS recognized that the way to earn market share was to basically give away your product to colleges which used it to train.  Those graduates would gravitate back to what they knew when they got out into the paying world.

In the mid-2000s these two platforms probably had at least an 80% market share and even today they have 36% and 17% respectively.  This doesn’t even begin to reflect how dominant they are among the largest companies which is probably double these numbers.

By 2000 both these providers were offering advanced drag-and-drop platforms deemphasizing code.  The major benefit, and it was huge, is that it let learners focus on the major elements of the process and understand what went on within each module or modeling technique without having to code it.

At the time, and still today, you will find SAS and SPSS purists who grew up coding who still maintain hand coding shops.  It’s what you learned on that you carry forward into commercial life.

 

Then Why Is R Now So Popular

It’s about the money.  R is open source and free.  Although SAS and SPSS provided very deep discounts to colleges and universities each instructor had to pay several thousand dollars for the teaching version and each student had to pay a few hundred dollars (eventually there were student web based versions that were free but the instructor still had to pay).

The first stable beta version of R was released in 2000.  If you look at the TIOBE index of software popularity you’ll see that R adoption had its first uptick when Hadoop became open source (2007) and the interest in data science began to blossom.  In 2014 it started a strong upward adoption curve along with the now exploding popularity of data science as a career and its wide ranging adoption as the teaching tool of choice.

This was an economic boon for colleges but a step back for learners who now had to drop back into code mode.  The argument is common that R’s syntax is at least easier than others but that begs the question that drag-and-drop is not only orders of magnitude easier but makes the modeling process much more logical and understandable.

 

Do Employers Care

Here you have to watch out for what appears to be a logical contradiction.  Among those who do the hiring, the requirement that you know R (or Python) is strong and almost a go/no go factor.  Why?  Because those doing the hiring were very likely to have been taught in R and their going-in assumption is if I had to know it then so do you.

Here’s the catch.  The largest employers, those with the most data scientists are rapidly reconsolidating on packages like SAS and SPSS with drag-and-drop.  Gartner says this trend is particularly strong in the mid-size and large companies.  You need to have at least 10 data scientists to break into this club and the average large company has more like 50. 

We’re talking about the largest banks, mortgage lenders, insurance companies, retailers, brokerages, telecoms, utilities, manufacturers, transportation, and largest B2C services companies.  Probably where you’d like to work unless you’re in Silicon Valley.

Once you have this many data scientists to manage you rapidly become concerned about efficiency and effectiveness.  That’s a huge investment in high priced talent that needs to show a good ROI.  Also, in this environment it’s likely that you have from several hundred to thousands of models that direct core business functions to develop and maintain. 

It’s easy to see that if everyone is freelancing in R (or Python) that managing for consistency of approach and quality of outcome, not to mention the ability for collaboration around a single project is almost impossible.  This is what’s driving the largest companies to literally force their data science staffs (I’m sure in a nice way) onto common platforms with drag-and-drop consistency and efficiency.

 

Gartner Won’t Even Rate You Unless You Have Drag-and-Drop

Gartner’s ‘Magic Quadrant for Advanced Analytic Platforms’ and Forrester’s report on ‘Enterprise Insight Platform Suites’ are both well regarded ratings of comprehensive data science platforms.  The difference is that Gartner won’t even include you in their ranking unless you have a Visual Composition Framework (drag-and-drop). 

As a result Tibco, which ranks second in the 2016 Forrester chart was not even considered by Gartner because it lacks this particular feature.  Tibco users must work directly in code.  Salford Systems was also rejected by Gartner for the same reason.

Gartner is very explicit that working in code is incompatible with the large organization need for quality, consistency, collaboration, speed, and ease of use.  Large groups of data scientists freelancing in R and Python are very difficult to manage for these characteristics and that’s no longer acceptable.

Yes essentially all of these platforms do allow highly skilled data scientists to insert their own R or Python code into the modeling process.  The fact is however that the need for algorithms not already embedded in the platform is rapidly declining.  If you absolutely need something as exotic as XGboost you can import it.  But only if that level of effort is warranted by a need for an unusually high level of accuracy.  It’s now about efficiency and productivity.

 

Should You Be Giving Up R

If you are an established data scientist who learned in R then my hat’s off to you, don’t change a thing.  If you’re in a smaller company with only a few colleagues you may be able to continue that way.  If you move up into a larger company that wants you to use a standardized platform you shouldn’t have any trouble picking it up.

If you’re an early learner you are probably destined to use whatever tools your instructor demands.  Increasingly that’s R.  It’s not likely that you have a choice.  It’s just that in the commercial world the need to actually code models in R is diminishing and your road map to a deep understanding of predictive modeling is probably more complex than it needs to be.

 

A Quick Note About Deep Learning

A quick note about deep learning.  Most programming in Tensorflow is occuring in Python and if you know R you shouldn’t have a problem picking it up.  Right now, to the best of my knowledge, there are no drag-and-drops for deep learning.  For one thing, deep learning is still expensive to execute in terms of manpower, computing resource, and data acquisition.  The need for those skills here in 2017 are still pretty limited, albeit likely to grow rapidly.  Like core predictive modeling, when things are difficult I’m sure there’s someone out there focusing on making it easier and I bet that drag-and-drop is not far behind.

 

 

About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

 

Views: 39198

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by R. Lanham on May 18, 2017 at 1:51am

I'll take R.  First, you aren't dependent on a firm to keep interfaces and tools up to date.  Second, knowing syntax allows you to integrate other code and optimise algorithms easily.  Sure, if you do basic stats on small data, SPSS is fine.  So is Excel.  But most of us do a lot more. SAS allows that and always has--but it is vast, clunky and arcane in my opinion.  You get nowhere without classes, etc.  I dispute the idea that their user base is growing--in my experience, especially in Europe, it is in cash cow mode--used only at old and relatively low innovation shops like food companies.  But that is a fact that can be checked.  I doubt anyone quite knows.  One thing is certain.  R is free and has thousands of brilliant people building free things for it every day.  Over my 30+ years in data, I have found a lot of people just don't like free.  So be it.  Free tends to win, and it always will.  BTW, R is not enough, you need Python too, but if R scares you for syntax, Python will give you nightmares.  That said, I have seen several 9 year olds who code it just fine.  I am happy SAS exists--sometimes they innovate.  Within weeks the equivalent is available for free in R.  Long may that continue.

Comment by Tony Lange on May 18, 2017 at 12:40am

I think everyone is entitled to their opinion. But in essence I totally disagree with William. The purpose of R is not to teach data analytics, neither is any language for that matter. IF one uses IMSL one of the oldest sources of numerical processing algorithms written in FORTRAN, and goes through the documentation, one can start to get an idea of how complex much of these "analytical tools" are and how dangerous using any analytics can be. 

I personally hate the hyped up term "data analytics" etc, its actually maths, stats, and numerical methods. How do you teach these topics, with a toolset ? Absolutely not. Teaching can only be done by going through the pure and applied maths step by step , the equations, the assumptions, etc with a pen or chalk. Computerisation is just final small step.

You would be surprised how many people don't really know what a mean is, that gaussians are the biggest underlying assumptions, and simply totally at odds with reality. Never mind an the s-domain, which begat the FFT, and most theories and methods which simply do not fit with reality. Who understands the huge error issues in numerical techniques, significant digits, and that computer ALU's randomly make mistakes ?  

Yes graphical tools are awesome, easier to use, but in fact most users of these tools have no clue about what the blocks do, or how they work. In fact they worsen the situation where people apply methods or algorithms willy-nilly and take their results as gospel.

Today, with all these tools, every man and his dog can become a "data analyst", and "therein lies the rub". 

Comment by Mari Tietze on May 17, 2017 at 11:41pm

Thank YOU!!!  Teaching health informatics, including SPSS for analytics, I have often critiqued the value of R . . . this put it in perspective.  

Comment by Srikanth KS on May 17, 2017 at 9:32pm

Hi,

The article looks more like a propaganda, than a criticism.

With Data Science boom, we have people who want to learn quickly with languages like R, Python, SAS without a deep understanding of the language, its design choices and memory management. Most workshops/hackathons are done in haste, leaving learners in dire straits.


For those who are willing to spend time and effort to understand the language itself, choose to read the right books (Hadley's Advanced R, Matloff's Art of R programming) instead of doing a few MOOCs.


Quoting JJ Allaire: " ... Again, you should understand R as a user interface, or a language that’s capable of providing very good user interfaces. Developers don’t implement core algorithms or core distributed computing primitives in R. They use R as a way of interacting with those things.Once people understand this, and they see how good R is, how they can use dplyr to interact with a Spark cluster ... "


R provides convenient statistical modeling  and machine learning capabilites (orchestrated by caret, mlr (inspired by scikit-learn)), ability to pipeline cleaning, building models, hyper parameter tuning, ensembling and building API endpoints to be used in production.


There are few GUIs built on R(a lot of commenters mentioned them). There is even radiant built on shiny. Those who wish drag and drop kind of stuff, choose them. But, once you are adept at programming, you are most likely not to use them. Note that visualization is a different behemoth altogether.


I believe that this is neither the time or place to talk more about R or Python. I have spent years doing data analysis in R and Python. I have built production models in R, Python, H2o etc. I have written R packages, some are on CRAN.


For those who want to choose a language/tool for data analysis in an unbiased manner, please get hold of R, Python, Julia, SAS and SPSS (if your pockets permit) and spend some time with them. People are paid to write against R and open-source.

Best Wishes,
Srikanth KS

Comment by Nitin Sareen on May 17, 2017 at 6:07pm
This issue is not particular to R. These days, the apparent way to learn/ practice data science is to know how to use pre-packaged tools. The new practitioners hardly seem to be focusing on the core of the techniques.
Comment by William Vorhies on May 17, 2017 at 3:30pm

Thanks everyone for their thoughtful and well considered opinions.  As of this time there are 36 comments, a number that is sure to grow.

As a starter I did a quick tally of the commenters’ opinions and saw roughly this:

  • 25% Pro-R
  • 50% Balanced, use it when you need it.
  • 25% Not necessarily anti-R but sorry it makes learning so complicated.

I do want to point out, and I believe the comments support this, that I was not suggesting that we should not use R where it’s needed.  Only that there is an opinion shared by many that it did indeed get in the way of learning the broader strokes of predictive analytics.

As for drag-and-drop versus code, I completely empathsize with those who know R and want to ‘get down in the weeds’ as one contributor offered.  Many of you who took this point were also pointing out that the work you did in R was also to develop code to be implemented in operational systems and products.  That’s a skill level above and beyond the normal consumer propensity scoring models that still represent the great majority of models developed by the largest companies with the most data scientists.

I did particularly want to thank those of you that took the balanced middle ground that there’s room for both.  If you need it, there’s no replacement for R (well maybe Python).  However if it’s about building and maintaining lots and lots of models then the simplified platforms will do perfectly well.

As a final point, it was Gartner that first alerted me to what I’ll call the ‘efficiency and effectiveness’ movement in data science.  It really is a force in the largest data science shops and you only need to stop and ask yourself how you personally would go about managing the work of say 20 other data scientists without adopting some unified standard platform.  I’ve personally seen this happen in even mid-size shops.  Lots of people freelancing in R and/or Python are almost impossible to manage for consistency of efficiency or effectiveness which is the point Gartner makes.

Thanks particularly to those that welcomed the opportunity to engage in this conversation.  It’s about creating engagement in our data science community that we here at DataScienceCentral strive to do.

Comment by Con Menictas on May 17, 2017 at 3:12pm

One of the things I like about R is that anyone can do a few online MOOC courses with absolutely no formal statistical training whatsoever and start estimating models !!!

I was thinking of doing a few online MOOC courses myself on human behavior and start practicing as a clinical psychologist. Anyone have any objections to that?

Comment by Shantanu Karve on May 17, 2017 at 3:09pm

 After that hackathon that Bill mentioned, that Titanic problem interested me enough to look harder at it. by the next day I'd got my team's rank to 14th of 6955 , a  score  of .99522. The other 13 ahead of me had a perfect score. a score of 1.0.

How ?

I bet by doing what I did. They and I understand the epistemology, the philosophical underpinnings of Data Science I expect. and we remembered a key learning -  "Look for more data" :-) . Now to teach that aspect - you don't need R or Python but you also don't need a high priced software package like SAS or SPSS or Statistica. 

Comment by Michele Chambers on May 17, 2017 at 2:50pm

There is no "one size fits all" approach to data science. In my work over the last 20 years in analytics then data science and now DL/AI, I have found that folks (analysts, statisticians, data miners, data scientists, data engineers, developers) want flexibility to use the right tool(s) for the problems that they work on.

Open source by it's very nature provides that type of flexibility but proprietary tools such as SAS, SPSS and others also provide a limited "openness" via APIs and connectors. Each has it's pros and cons but arguments for or against any language or tool misses the point since there are situations when each are appropriate and inappropriate. 

In my experience, SAS users as well as data scientists prefer to use a language approach (using base SAS, Python, R, Scala, etc.) rather than a drag and drop approach as they can develop faster. Gartner argues that you have to have a GUI approach so that analytics/data science can be "democratized for the masses" (aka: business analysts). However, there is a major flaw in that thinking as all the GUI tools (including sexy new tools such as IBM Data Science Experience and MS Azure ML) are designed for systems thinkers and the typical business analyst is NOT a systems thinker. So only the most seasoned/advanced business analysts tend to use GUI/data mining tools.

While SAS shops can/do use SAS Enterprise Miner they are often using it to generate scoring code for production from models that were created in base SAS. However, there are folks that prefer to develop their models with an IDE (ie: R Studio, Jupyter Notebooks, JupyterLab, etc.) or an Analytic Development Environment (aka ADE) (ie: SAS Enterprise Miner, Knime, RapidMiner, DataIku, Orange Data Mining, etc.) and yet others prefer automated ML tools such as DataRobot. All are valid tools based on your approach (statistical, data mining, exploratory data analysis, programming). In an ideal open data science world, these languages and tools are open so that an individual or team can leverage the right tool and approach for the problem at hand today and use a different set of tools, libraries, approaches for a different problem tomorrow. So rather than dismiss any of the technology (heck, we're still using Fortran because it's amazing at numerical processing!), we should be promoting and demanding openness in our tools so that we can solve increasing complex problems rather than plowing over the same old use cases with old tools/methods/data/approaches. The only way for us to solve the complex, messy problems that face us in business and in the world, is to build on the shoulders of giants that came before us and contribute to building a new generation of more advanced algorithms, models, apps that can make the world a better place. 

Comment by Vincent Granville on May 17, 2017 at 1:32pm

R has been changing a lot recently, with new libraries added all the time, the ability to process big data, integration with other platforms, and so on. It is no longer the little, limited tool that one uses to produce neat graphs or for one-time ad-hoc analyses - it has become much more than that. The GUI is more than enough for me. There are different types of data scientists: some using heavily dashboards, and some who don't. I don't think that having or not having a great GUI should be a criterion.

In my case, I use R for a number of things (even to create data videos) but also use many other tools including all-purposes programming languages to design new machine learning techniques. While I am used to designing systems that work entirely in batch mode or for machine-to-machine communications, using Perl or Python, I can see that such systems could benefit from having some components or some calls to some R functions. 

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service