Summary:  Someone had to say it.  In my opinion R is not the best way to learn data science and not the best way to practice it either.  More and more large employers agree.


Someone had to say it.  I know this will be controversial and I welcome your comments but in my opinion R is not the best way to learn data science and not the best way to practice it either.


Why Should We Care What Language You Use For Data Science

Here’s why this rises to the top of my thoughts.  Recently my local Meetup had a very well attended hackathon to walk people through the Titanic dataset using R.  The turnout was much higher than I expected which was gratifying.  The result, not so much.

Not everyone in the audience was a beginner and many were folks who had probably been exposed to R at some point but were just out of practice.  What struck me was how everyone was getting caught up in the syntax of each command, which is reasonably complex, and how many commands were necessary for example, to run the simplest decision tree.

Worse, it was as though they were learning a programming language and not the data science.  There was little or no conversation or questioning around cleansing, prep, transforms, feature engineering, feature selection, model selection, and absolutely none about about hyperparameter tuning.  In short, I am convinced that group left thinking that data science is about a programming language whose syntax they had to master, and not about the underlying major issues in preparing a worthwhile model.


Personal Experience

I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years.  I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms). 

The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.

The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.


Some Perspective

Through the 90’s (and actually well before) up through the early-2000’s if you studied data science in school you learned it on SAS or SPSS and in the base code for those packages that actually looks a lot like R.  R wasn’t around then and for decades, SAS and SPSS recognized that the way to earn market share was to basically give away your product to colleges which used it to train.  Those graduates would gravitate back to what they knew when they got out into the paying world.

In the mid-2000s these two platforms probably had at least an 80% market share and even today they have 36% and 17% respectively.  This doesn’t even begin to reflect how dominant they are among the largest companies which is probably double these numbers.

By 2000 both these providers were offering advanced drag-and-drop platforms deemphasizing code.  The major benefit, and it was huge, is that it let learners focus on the major elements of the process and understand what went on within each module or modeling technique without having to code it.

At the time, and still today, you will find SAS and SPSS purists who grew up coding who still maintain hand coding shops.  It’s what you learned on that you carry forward into commercial life.


Then Why Is R Now So Popular

It’s about the money.  R is open source and free.  Although SAS and SPSS provided very deep discounts to colleges and universities each instructor had to pay several thousand dollars for the teaching version and each student had to pay a few hundred dollars (eventually there were student web based versions that were free but the instructor still had to pay).

The first stable beta version of R was released in 2000.  If you look at the TIOBE index of software popularity you’ll see that R adoption had its first uptick when Hadoop became open source (2007) and the interest in data science began to blossom.  In 2014 it started a strong upward adoption curve along with the now exploding popularity of data science as a career and its wide ranging adoption as the teaching tool of choice.

This was an economic boon for colleges but a step back for learners who now had to drop back into code mode.  The argument is common that R’s syntax is at least easier than others but that begs the question that drag-and-drop is not only orders of magnitude easier but makes the modeling process much more logical and understandable.


Do Employers Care

Here you have to watch out for what appears to be a logical contradiction.  Among those who do the hiring, the requirement that you know R (or Python) is strong and almost a go/no go factor.  Why?  Because those doing the hiring were very likely to have been taught in R and their going-in assumption is if I had to know it then so do you.

Here’s the catch.  The largest employers, those with the most data scientists are rapidly reconsolidating on packages like SAS and SPSS with drag-and-drop.  Gartner says this trend is particularly strong in the mid-size and large companies.  You need to have at least 10 data scientists to break into this club and the average large company has more like 50. 

We’re talking about the largest banks, mortgage lenders, insurance companies, retailers, brokerages, telecoms, utilities, manufacturers, transportation, and largest B2C services companies.  Probably where you’d like to work unless you’re in Silicon Valley.

Once you have this many data scientists to manage you rapidly become concerned about efficiency and effectiveness.  That’s a huge investment in high priced talent that needs to show a good ROI.  Also, in this environment it’s likely that you have from several hundred to thousands of models that direct core business functions to develop and maintain. 

It’s easy to see that if everyone is freelancing in R (or Python) that managing for consistency of approach and quality of outcome, not to mention the ability for collaboration around a single project is almost impossible.  This is what’s driving the largest companies to literally force their data science staffs (I’m sure in a nice way) onto common platforms with drag-and-drop consistency and efficiency.


Gartner Won’t Even Rate You Unless You Have Drag-and-Drop

Gartner’s ‘Magic Quadrant for Advanced Analytic Platforms’ and Forrester’s report on ‘Enterprise Insight Platform Suites’ are both well regarded ratings of comprehensive data science platforms.  The difference is that Gartner won’t even include you in their ranking unless you have a Visual Composition Framework (drag-and-drop). 

As a result Tibco, which ranks second in the 2016 Forrester chart was not even considered by Gartner because it lacks this particular feature.  Tibco users must work directly in code.  Salford Systems was also rejected by Gartner for the same reason.

Gartner is very explicit that working in code is incompatible with the large organization need for quality, consistency, collaboration, speed, and ease of use.  Large groups of data scientists freelancing in R and Python are very difficult to manage for these characteristics and that’s no longer acceptable.

Yes essentially all of these platforms do allow highly skilled data scientists to insert their own R or Python code into the modeling process.  The fact is however that the need for algorithms not already embedded in the platform is rapidly declining.  If you absolutely need something as exotic as XGboost you can import it.  But only if that level of effort is warranted by a need for an unusually high level of accuracy.  It’s now about efficiency and productivity.


Should You Be Giving Up R

If you are an established data scientist who learned in R then my hat’s off to you, don’t change a thing.  If you’re in a smaller company with only a few colleagues you may be able to continue that way.  If you move up into a larger company that wants you to use a standardized platform you shouldn’t have any trouble picking it up.

If you’re an early learner you are probably destined to use whatever tools your instructor demands.  Increasingly that’s R.  It’s not likely that you have a choice.  It’s just that in the commercial world the need to actually code models in R is diminishing and your road map to a deep understanding of predictive modeling is probably more complex than it needs to be.


A Quick Note About Deep Learning

A quick note about deep learning.  Most programming in Tensorflow is occuring in Python and if you know R you shouldn’t have a problem picking it up.  Right now, to the best of my knowledge, there are no drag-and-drops for deep learning.  For one thing, deep learning is still expensive to execute in terms of manpower, computing resource, and data acquisition.  The need for those skills here in 2017 are still pretty limited, albeit likely to grow rapidly.  Like core predictive modeling, when things are difficult I’m sure there’s someone out there focusing on making it easier and I bet that drag-and-drop is not far behind.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]


Views: 107633

Tags: Python, R, SAS, SPSS, modeling, predictive


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Tony Lange on May 19, 2017 at 3:54pm

To William - your are "Stunned at the complexity of R " etc 

That is your problem. You cannot generalise, and in fact the irony is that the problem of generalisation which you want in analytics, is not really achieved anyway

Again going out on a limb, most people have no clue about this whole area, and for example ANN's are one of the most abused and dangerous sets of mathematical methods, and so is a mean standard deviation. 

Understanding the minutea of the problem space, the data, and the dangerous assumptions behind all these analytical tools, can only be done or taken into account at a low level in FORTRAN, C, R or even assembler. Abstracting to a block is why things blow up, people die, and totally erroneous conclusions are reached.

Comment by William Vorhies on May 19, 2017 at 12:22pm

A number of readers have sent me comments directly and I wanted to provide this very thoughtful one sent by my LinkedIn colleague Paul:

I saw your recent note on Data Science Central about “Why R is Bad for you” and, since I’m in the process of joining the board and can’t yet post, wanted to at least send you a note.  Thanks so much for your post.  That, along with the ensuing discussion, is probably the most useful thing I’ve read in five years in any of the Data Science discussion boards, trade press, etc.  I’ve long wondered about the apparent disconnect between what Gartner and Forrester say about Data Science/Insight platforms and the fact that so many people are using R.  I’d read the various papers and magic quadrants, the definitions they were using and always wondered what the Gartner/Forrester analysts would say if asked how they square their assessments of the platforms with increasing use of R (I come from a market intelligence background and am quite familiar with Gartner/Forrester/IDC on computer networking/telecom issues so I knew it wasn’t that the firms simply didn’t understand what was going on in the Data Science world;  in fact, my decision to get into analytics began with looking at what they had to say about the industry.)  I realize now that the “disparity” is that the research firms are looking at the largest companies, with huge data science capabilities and infrastructures, and the firms are gravitating towards platforms for all the reasons you articulated.  The Data Science folks using R seem to be looking at a slightly different market segment.


Awhile back I decided to learn both SAS database programming and statistical programming with the thought of taking things as far as I could using Base SAS (hoping to replicate most of the capabilities in Enterprise Miner), and then start looking at R/Python to see what I needed to fill in the blanks.  I’ve recently begun doing the latter and am kind of stunned at the complexity/verboseness of R.  It’s no wonder that people complain about spending so much time learning the syntax rather than focusing on the concepts of statistics/data science.  SAS (both database programming and statistical programming) and SQL have made serious efforts to be “minimalistic” languages with excellent functionality.  R has obviously been written by, and for, computer scientists/statisticians with obvious implications for ease of learning, usability, and productivity.  For most things that matter, with SAS programming you can easily take a task in R and do it in one-third the time and with one-third the amount of code (and generate more useful output in terms of tables/graphics, with output in a variety of formats in PDF, HTML, rich text, etc, all of it “production-ready.”)   You may be able to reduce that time if you’re using Enterprise Guide or Enterprise Miner (not sure on the latter since I don’t have access to it right now.)  It’s no wonder that large companies are moving towards platforms.


One point on which we might differ is the issue of whether any coding at all is necessary/useful.  It’s probably true with Enterprise Miner that you can cover 90-95% of the things you want to do without programming.  But I suspect there are going to be situations where knowing the actual database or statistical programming is useful for extending the capabilities of Enterprise Miner (or Enterprise Guide.)  I can think of one situation where we were having to collapse some 5,000 columns down into a dataset with about 15 columns (I suspect the people doing this were using a SAS macro program which yielded over 20 pages of text when pasted into Word.)  I’m not sure you’d really want to try that with a point and click interface.  And there are other examples.  In fact, I’ve noticed in various SAS courses that you reach the point where they’re actually teaching you to use the “insert code” feature when doing something because that’s simpler than having SAS develop another drop-down menu.


One other point/question.  You said that to the best of your knowledge there are no drag-and-drops for deep learning.  If you’re referring to neural networking, I’m sort of curious about this myself as it pertains to SAS.  As you know, SAS Enterprise Miner has neural networking although it seems to use only a 1-layer or 2-layer neural network.  However, there was a paper done by a SAS person a couple years back entitled “SAS Does Data Science: How to Succeed in a Data Science Competition” that contained the following description of the use of neural networking:  “To visualize the combined results of both approaches, I used the NEURAL procedure in SAS Enterprise Miner to implement a type of deep neural network, known as a stacked denoising autoencoder, to accurately project the newly labeled points from the particular feature space into a two-dimensional space.”  The paper describing this and other techniques used during this Cloudera competition is at: https://support.sas.com/resources/papers/proceedings15/SAS2520-2015.pdf


So it must be possible to do some sort of deep neural networking in SAS although they don’t appear yet to have developed a regular PROC that is widely available in SAS/STAT, which you’d get with Base SAS.  As you say in your post, however, if neural networking is something that large firms want it’s quite likely that this will be developed and deployed in drag-and-drop platforms (and PROC NEURAL has existed since at least 2000, apparently.)


Anyway, thanks again for your post, which sparked a far more interesting/informational conversation than the ones I’ve seen previously when this topic has arisen.  I definitely think there is a place for both data science platforms and languages such as R and Python.   For someone like me who is fine with the annual licensing fee for Base SAS,  the most productive strategy is learning the major techniques (database programming, descriptive stats, correlation, regression, clustering, component reduction, decision trees, etc.) and then seeing if there’s anything that R can add.  I’m guessing that in terms of “structured data statistics” the answer is generally “no” and that Data Science platforms actually provide more capability in terms of speed and production.  But R (and Python) may, for a time, actually have a neural networking capability that is more available to people than that provided by SAS.  Hopefully, that changes as SAS validates and rolls out this capability at some point in SAS/STAT in the form of one or more PROCs. 

Comment by Shay Pal on May 19, 2017 at 10:02am

I would like to mention that in today's growing world of "Citizen Data Scientists" , more GUI based packages like SPSS come in handy. That being said, R is essential if you are a statistician or a data scientist

Comment by Tim Lin on May 19, 2017 at 8:21am

I’ve used a number of tools for “data science” work.  Although I can’t say I’ve been doing this long enough to exhaustively survey all the options, I do find myself going back to R over and over again. Reasons for this:

1)      Relatively speaking, a high level programing language that is intuitive enough for non-programmers to understand. R is the perfect fit for the data scientists described in that Venn diagram that gets published all over the place. Data scientists have programming skills, but are not programmers.  I think (maybe selfishly!) this description is completely appropriate. It’s easy to start programming in R and build from there.  Ever try to read a Java program? 

2)      The community.  This is the best part. There are so many professional quality packages out there with great documentation.  I have to admit to learning a lot of theory from reading papers on R packages. 

3)      Yeah, it’s free – This mattes, of course.  But it’s not the reason R is so popular.  Linux is free.  Libre Office is free.  Just sayin’

A few thoughts on drag and drop GUIs.  I hate them. There, I said it. First problem is collaboration. How do you communicate the process of the analysis in Excel or Tableau?  In code, with a notebook like Jupyter, it is all there for the world (or at least your peers) to see.  Second, GUIs are necessarily limited in their capability.  I think this attribute, not WYSIWYG, is the reason for their ease of use.  However, because they are limited, there is a high probability that you will run into something you want to do that only your crazy brain thought of but not envisioned by the GUI creators.  I can’t count the number of times I started with a GUI thinking my task was easy, only to run into a roadblock and revert back to a programming paradigm to get my result.

Well, that’s my opinion, based on my limited experience - hope it provides an alternative perspective.

Comment by David Reinke on May 19, 2017 at 7:32am

In the software engineering world it is considered bad form to try to convince someone else that your programming language/development environment/text editor/development tool is better than theirs. It is especially not cool to take a religious stance on one's own personal preferences when one has not had much experience with the alternatives, as this author admits. And, as Ammar A. Raja states below, basing one's opinion on a single hackathon is not valid reasoning.

There are a lot of tools out there. I've used S/S-Plus/R for 40 years, mainly because I like the object-oriented nature of the language. I've used SAS and SPSS as well, but I don't like being constrained to do data analysis their way. R provides a lot more flexibility than SAS or SPSS, and the graphics in R are superior to those in SPSS, SAS, and Python. If a particular tool doesn't have what I want, I'll code my own model in C# or F#.

I also suggest not constraining the debate to SAS vs R vs Python. Google's TensorFlow and MIT's Julia environments are two examples of emerging tools that appear to offer some distinct advantages over the "traditional" data analysis tools being bruited about in this discussion.

Comment by Wei-Chun Chu on May 19, 2017 at 5:09am

What the author really tried to say is "using coding platforms to do data science is bad". The title is a bit misleading; people will look at the title and think "ok R is bad compared with other programming languages."

Comment by Ammar A. Raja on May 19, 2017 at 4:58am
You are just basing your whole argument on the notion since big companies have people who know sas and soda, therefore they will hire people with knowledge in sas and spss. I had to use sas and soda in the university, but they never made me interested in Data. They were good tools but that's it.

But R and Python changed everything in the data arena. So much flexibility, if you have an idea involving data, you can do it. It's always challenging stuff which triggers interest of the youth, and drag and drop of sas and spss was boring and I'm flexible. That's the reason youngsters are so keen on be,coming data scientists, they can solve problems in their own ways rather than sas or spss's way.

The other thing about hackathon, you based this opinion just on one hackathon, that's it. Once people learn the basics they move on to experimenting with data in interesting ways so it's evolution which you confused with obsession with R and Python.
Comment by Michele Chambers on May 19, 2017 at 4:17am

Anaconda, based on Python, has a centralized repository and collaboration capabilities which allow Python data science team to manage and version control their work easily. So, it is incorrect to say that with Python (or R) that teams have to stray into the woods. There are many ways to manage your workflows in Python and the Python Orange Data Mining library is equivalent to IBM SPSS Data Modeler and SAS Enterprise Miner. 

Comment by Rick Henderson on May 19, 2017 at 4:14am

This is a great article with great discussion. You really do have a wide range of "data scientists" or more generally "data workers" of some kind with different needs and different requirements. The university I work at had it's share of troubles when the price of SPSS went up when IBM took over. Many professors just aren't savvy enough to move on to PSPP to teach skills to students while still staying within a limited budget. For people who can't code, drag-and-drop is great but it will eventually come up to some limitation. Same way that using a WYSIWYG HTML editor can be fine for most people, but doing anything serious requires hand-coding. That is, until the technology itself becomes more and more developed and refined. I could easily see a drag-and-drop system developed around R... but at the moment people don't really want it.

Comment by Harlan A Nelson on May 19, 2017 at 2:32am

I am going to write another comment because what I see as the central issue seems to be missed.   How is innovation managed?  SAS and IBM SPSS Data Modeler provide a way to keep everyone on the same page.  R and Python let people stray into the woods.  I worked with SPSS for a couple years, and if you don't have to do much data manipulation, the "data streams" stay reasonably transferable to others.  I debugged someone else's stream once and it took about an hour.  That was even with jython creating nodes. I also ended up using jython to generate my streams but the results were still readable.  Managing innovation is very difficult.  Standardized platforms like SPSS and SAS help you manage, but at the price of innovation.  In most cases that is not a problem.  The organization that rewords innovation is the exception, most would rather have control, SAS and SPSS and Azure give that to you.  That's my main point.  A side point is that IBM SPSS Data modeler was not able to extract and summarize the data I needed.  It (version 12) could not produce efficient enough code to query Hadoop.  There was too much data for it's code to run.  I had to write my own Pig code to do the queries, which I did using python. With Pig, I was able to use partitions to string together many terabyte size queries and get my data in about 12 hours.  I had been asked to keep my queries down down to around one terabyte per job.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service