Summary:  Someone had to say it.  In my opinion R is not the best way to learn data science and not the best way to practice it either.  More and more large employers agree.


Someone had to say it.  I know this will be controversial and I welcome your comments but in my opinion R is not the best way to learn data science and not the best way to practice it either.


Why Should We Care What Language You Use For Data Science

Here’s why this rises to the top of my thoughts.  Recently my local Meetup had a very well attended hackathon to walk people through the Titanic dataset using R.  The turnout was much higher than I expected which was gratifying.  The result, not so much.

Not everyone in the audience was a beginner and many were folks who had probably been exposed to R at some point but were just out of practice.  What struck me was how everyone was getting caught up in the syntax of each command, which is reasonably complex, and how many commands were necessary for example, to run the simplest decision tree.

Worse, it was as though they were learning a programming language and not the data science.  There was little or no conversation or questioning around cleansing, prep, transforms, feature engineering, feature selection, model selection, and absolutely none about about hyperparameter tuning.  In short, I am convinced that group left thinking that data science is about a programming language whose syntax they had to master, and not about the underlying major issues in preparing a worthwhile model.


Personal Experience

I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years.  I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms). 

The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.

The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.


Some Perspective

Through the 90’s (and actually well before) up through the early-2000’s if you studied data science in school you learned it on SAS or SPSS and in the base code for those packages that actually looks a lot like R.  R wasn’t around then and for decades, SAS and SPSS recognized that the way to earn market share was to basically give away your product to colleges which used it to train.  Those graduates would gravitate back to what they knew when they got out into the paying world.

In the mid-2000s these two platforms probably had at least an 80% market share and even today they have 36% and 17% respectively.  This doesn’t even begin to reflect how dominant they are among the largest companies which is probably double these numbers.

By 2000 both these providers were offering advanced drag-and-drop platforms deemphasizing code.  The major benefit, and it was huge, is that it let learners focus on the major elements of the process and understand what went on within each module or modeling technique without having to code it.

At the time, and still today, you will find SAS and SPSS purists who grew up coding who still maintain hand coding shops.  It’s what you learned on that you carry forward into commercial life.


Then Why Is R Now So Popular

It’s about the money.  R is open source and free.  Although SAS and SPSS provided very deep discounts to colleges and universities each instructor had to pay several thousand dollars for the teaching version and each student had to pay a few hundred dollars (eventually there were student web based versions that were free but the instructor still had to pay).

The first stable beta version of R was released in 2000.  If you look at the TIOBE index of software popularity you’ll see that R adoption had its first uptick when Hadoop became open source (2007) and the interest in data science began to blossom.  In 2014 it started a strong upward adoption curve along with the now exploding popularity of data science as a career and its wide ranging adoption as the teaching tool of choice.

This was an economic boon for colleges but a step back for learners who now had to drop back into code mode.  The argument is common that R’s syntax is at least easier than others but that begs the question that drag-and-drop is not only orders of magnitude easier but makes the modeling process much more logical and understandable.


Do Employers Care

Here you have to watch out for what appears to be a logical contradiction.  Among those who do the hiring, the requirement that you know R (or Python) is strong and almost a go/no go factor.  Why?  Because those doing the hiring were very likely to have been taught in R and their going-in assumption is if I had to know it then so do you.

Here’s the catch.  The largest employers, those with the most data scientists are rapidly reconsolidating on packages like SAS and SPSS with drag-and-drop.  Gartner says this trend is particularly strong in the mid-size and large companies.  You need to have at least 10 data scientists to break into this club and the average large company has more like 50. 

We’re talking about the largest banks, mortgage lenders, insurance companies, retailers, brokerages, telecoms, utilities, manufacturers, transportation, and largest B2C services companies.  Probably where you’d like to work unless you’re in Silicon Valley.

Once you have this many data scientists to manage you rapidly become concerned about efficiency and effectiveness.  That’s a huge investment in high priced talent that needs to show a good ROI.  Also, in this environment it’s likely that you have from several hundred to thousands of models that direct core business functions to develop and maintain. 

It’s easy to see that if everyone is freelancing in R (or Python) that managing for consistency of approach and quality of outcome, not to mention the ability for collaboration around a single project is almost impossible.  This is what’s driving the largest companies to literally force their data science staffs (I’m sure in a nice way) onto common platforms with drag-and-drop consistency and efficiency.


Gartner Won’t Even Rate You Unless You Have Drag-and-Drop

Gartner’s ‘Magic Quadrant for Advanced Analytic Platforms’ and Forrester’s report on ‘Enterprise Insight Platform Suites’ are both well regarded ratings of comprehensive data science platforms.  The difference is that Gartner won’t even include you in their ranking unless you have a Visual Composition Framework (drag-and-drop). 

As a result Tibco, which ranks second in the 2016 Forrester chart was not even considered by Gartner because it lacks this particular feature.  Tibco users must work directly in code.  Salford Systems was also rejected by Gartner for the same reason.

Gartner is very explicit that working in code is incompatible with the large organization need for quality, consistency, collaboration, speed, and ease of use.  Large groups of data scientists freelancing in R and Python are very difficult to manage for these characteristics and that’s no longer acceptable.

Yes essentially all of these platforms do allow highly skilled data scientists to insert their own R or Python code into the modeling process.  The fact is however that the need for algorithms not already embedded in the platform is rapidly declining.  If you absolutely need something as exotic as XGboost you can import it.  But only if that level of effort is warranted by a need for an unusually high level of accuracy.  It’s now about efficiency and productivity.


Should You Be Giving Up R

If you are an established data scientist who learned in R then my hat’s off to you, don’t change a thing.  If you’re in a smaller company with only a few colleagues you may be able to continue that way.  If you move up into a larger company that wants you to use a standardized platform you shouldn’t have any trouble picking it up.

If you’re an early learner you are probably destined to use whatever tools your instructor demands.  Increasingly that’s R.  It’s not likely that you have a choice.  It’s just that in the commercial world the need to actually code models in R is diminishing and your road map to a deep understanding of predictive modeling is probably more complex than it needs to be.


A Quick Note About Deep Learning

A quick note about deep learning.  Most programming in Tensorflow is occuring in Python and if you know R you shouldn’t have a problem picking it up.  Right now, to the best of my knowledge, there are no drag-and-drops for deep learning.  For one thing, deep learning is still expensive to execute in terms of manpower, computing resource, and data acquisition.  The need for those skills here in 2017 are still pretty limited, albeit likely to grow rapidly.  Like core predictive modeling, when things are difficult I’m sure there’s someone out there focusing on making it easier and I bet that drag-and-drop is not far behind.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]


Views: 107633

Tags: Python, R, SAS, SPSS, modeling, predictive


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Pedro Junqueira on November 29, 2017 at 3:51pm

Are you trying to criticize R or the audience of your meet-up?

The reason one use R or not is so "IT DEPENDS" scenario that the short article did not make justice to the Pros of using R.

The bottom line is you can use anything for data science from XLS to Expensive Corporate (with a "drag and drop") applications and to do good data science more important than the tool is the knowledge of the methods, algos and process and the tools is a personal preference and what each person or company can afford to get the job done.



Comment by Dmitry Zinoviev on November 28, 2017 at 1:42pm

I thought you would explain why R is bad for me, as opposed to Python. (Never mind, I know the answer.)

Comment by Kenneth C Black on August 7, 2017 at 8:06pm

Hi Bill,

It took me over 8 months to complete 3 articles that outline my thoughts on this topic, and I have to say that I agree with your basic premise (although I do not necessarily think that R is bad for anyone).

Starting in Sept 2016, I wrote the articles that took me 30 years of steady work in quantitative sciences to develop. The insights I explain in this series have been built throughout my career, starting with my work in numerical simulations and eventually migrating to my work in the field of advanced data analytics.

I call this series: "How to Improve Data Comprehension."

I encourage people that are interested in learning data science to consider reading these thoughts, although the series is long and takes commitment to finish. I believe my guidance is important for one simple reason: data science begins with data. To work effectively with data, you have to be able to comprehend data.

The very first sentence of my series is this: "My personal mission is to teach the next generation of analytics workers how to achieve better data comprehension."



Comment by Ralph Winters on August 3, 2017 at 4:29am
Good headline, but I don't think it as dire for R as your article makes it out to be. It is true that drag and drop encourages more collaboration among different participant skill levels (including some managers). In fact, one article I read recently indicated that an algorithm is more readily to be accepted if someone is able to 'tweak' it (quickly). This is much easier to do in a data flow / GUI environment rather than in a pure programming context in which things are not so transparent.

Products like Alteryx and Shiny are demonstrating that GUI environments are possible, and I think it will take a bit of time for these and other types of GUI /drag and drop interfaces to develop, pending the demand, of course. Right now the (retro) programming paradigm is hot, and that is what (non Entreprise) companies seem to want.

Also, as an alternative, intelligent program design based upon R functions (or SAS Macros) can help exposing the functionality and flow of algorithmic development without having to use GUI's.
Comment by Pablo Bernabeu on July 31, 2017 at 2:15pm

Really interesting! Here's also a very related article

Comment by Rosaria Silipo on July 29, 2017 at 12:10am

KNIME has a drag and drop option for deep learning.

Comment by Paul Bremner on May 24, 2017 at 2:08pm

There seems to be a lot confusion about the role of programming in relation to the Data Science platforms that research firms Gartner and Forrester have identified as the future of Data Science in large corporations.  For example, numerous people have stated that SAS is pushing a drag-and-drop platform (Enterprise Miner) that somehow limits choices and is destined to fail due to the fact that using programming (that is, R) allows greater flexibility.

It’s certainly true that Gartner is telling people (based on feedback from the largest firms) that “drag and drop” is a minimum requirement for large data science teams (corporate and line-of-business teams, to use Gartner definitions) and that open source programming is not manageable in this context.  Gartner does not address the issue of whether programming in general can, or should, play any part in these platforms.  This is most likely because Gartner (and the large firms) are focused on dealing with the issue of whether unsupported open source programming—with all its shortcomings—can  be used in an enterprise-level “production environment.”  Various folks in the open-source community apparently assume this means that data science platforms and programming are mutually exclusive, that platforms are therefore limiting, R is the obvious choice in terms of extensibility/innovation, and that the latter will prevail over Data Science platforms.  However, it is not the case that Enterprise Miner is just a point and click app.  Enterprise Miner actually has a full programming interface as seen in the quotes below from the Enterprise Miner Reference document.


So, you ask, why does SAS include a full programming capability if drag and drop Data Science platforms (with modules, building blocks, etc.) are the wave of the future in Data Science?  It’s because of what several people have said at various points in this thread (me included, I’m the “Paul” whose note Bill pasted in a few days back; Thanks, Bill.)  That is, it’s not feasible to build drop-downs for every possible contingency, even with the exhaustive capabilities of a top-notch data science platform.  For instance, if you look at the initial course for SAS Enterprise Miner (“Applied Analytics for EM”), it’s essentially all drop-downs.  But in the advanced course (“Advanced Analytics Using EM”), you’re using drop-downs for 2/3 of the material and programming statements for maybe 1/3.  As the SAS description below says, using programming allows Enterprise Miner users to utilize all aspects of the entire SAS system and expand use of EM beyond those provided in the drop-downs.  (Including full programming capability has been the case for years not just in Enterprise Miner but also in Enterprise Guide, a point and click interface that’s very good, but which falls short of the capabilities of Enterprise Miner.  See the papers below that talk about how you use the programming interface in Enterprise Guide.)


My own view of this is that while it’s nice to have drop-downs for something relatively simple like Enterprise Guide, I’d prefer to use programming statements, so that’s what I do using the main SAS application (comparable to using an IDE for R/Python.)  That’s because the programming interface for Enterprise Guide is “too good” – it’s better than the programming window in the main SAS app.  For now, I actually want to be staring at a blank window to input programming rather than having SAS suggest code/files/etc as I start typing text.  If I don’t know the right programming statement or get the syntax wrong, I want to know that, learn the right code, and not make the same mistake again.  The types of things you do in Enterprise Guide are relatively “discrete” and programming statements work best for me (this situation is comparable to what you’d be doing with Python or R, in my opinion.) 


However, when you get to the enormous complexity of Data Science tasks in big firms with large Data Science staffs, drag and drop applications like Enterprise Miner become a necessity.  You may be running hundreds/thousands of models using multiple techniques (regression, decision trees, neural networking, perhaps additional input from sources like R/Python), then scoring everything and developing your own unique model/equation.  You could no doubt write the code for all that but it would be enormously time consuming.


At some point, however, you may need to push the analysis on a specific aspect even deeper than what the platform drag and drops provide, and at this juncture you can use programming statements since there is no effective limit on what you can do with programming.  Needless to say, if you’re using Enterprise Miner and are regularly using programming statements to extend Enterprise Miner’s capabilities, then you are in a very different world than most people who use Enterprise Guide or “discrete programming,” whether that means Base SAS, R, or Python.  This sort of capability probably exceeds what is normally necessary in firms, and probably doesn’t get much explicit attention from the syndicated research firms or from companies (i.e. as a job requirement, since few people have the expertise in Enterprise Miner, SAS database and statistical programming, that allows them to implement these kinds of super-advanced capabilities.)


So the question is not do we do drag and drop, or do we do programming, for Data Science?  It all depends on the complexity of what you’re doing (i.e. is it “discrete” analysis or things that require a full-fledged Data Science Platform) and whether you know programming.  Enterprise Guide and Enterprise Miner are agnostic about what you use: you can do either, or both in the same workflow with each complementing the other.


In terms of the original question posed in this thread regarding R, if (like me) you’re fine with paying a yearly license fee for Base SAS, then R is not the best option for learning or doing data science (I can fill in whatever gaps remain with some R/Python after learning SAS programming.)  If you don’t like license fees, go with R but then you “pay another price” in terms of increased complexity/learning difficulty/productivity issues, and also face the question of how to deal with the need for data science platform capabilities when R doesn’t seem to have anything like this (one way to partially address this shortcoming, I guess, is to use a cloud service like Azure ML.)  Regardless of whether you use SAS programming, R, or something else, if you are good at programming you have the option of extending the capability of the data science platform you use on the rare occasions when that’s required.


SAS Enterprise Miner 14.1 Reference Help

Chapter 75 – SAS Code Node (p. 1121)

"The SAS Code node enables you to incorporate new or existing SAS code into process flow diagrams that were developed using SAS Enterprise Miner.  The node extends the functionality of SAS Enterprise Miner by making other SAS System procedures available for use in your data mining analysis.  You can also write SAS DATA steps to create customized scoring code, conditionally process data, or manipulate existing data sets.  The SAS Code node is also useful for building predictive models, formatting SAS output, defining table and plot views in the user interface, and for modifying variables metadata.  The SAS Code node can be placed at any location within a SAS Enterprise Miner process flow diagram…..The exported data that is produced by a successful SAS Code node run can be used by subsequent nodes in a process flow diagram….The code pane is where you write new SAS code or where you import existing code from an external source.  Any valid SAS language program statement is valid for use in the SAS Code node with the exception that you cannot issue statements that generate a SAS windowing environment."



Technical Papers on Programming in SAS Enterprise Guide



Comment by Khalid Riaz on May 22, 2017 at 2:41pm

I agree with  Wei-Chun Chu, the title seems to suggest a comparison between coding in R  and coding in other programming languages, rather than a comparison between coding and the drag-and-drop functionality.

SAS has consistently lost market share, efficiency-driven preference of large companies for non-programming approaches notwithstanding. The pace of development in SAS is slow compared to open source languages. On the other hand, SAS's resilience in the large company market segment stems from its high-quality customer support, reliability, and seamless integration. 

I feel that the drag-and-drop functionality is overrated as a driver of market share dynamics.  The enterprise analytics solutions market is contestable, and if profits could be made by simply adding drag-and-drop interface, many other vendors would have entered the market,  leveraging open source algorithms. These competitors would have an edge over SAS because of the faster pace of development in the open source ecosystems.   

Things could change, however. As the market for analytics continues to grow very rapidly, the demand for variety could also rise.  A few vendors with differentiated products might aim to serve niche market segments. SAS  would find it challenging to provide this level of product differentiation with the slow pace of its own development.

I understand that open source libraries are accessible from within SAS although I have not used them in that way.  This is really a safety valve, and the hope is that the programmer in need for something more esoteric would just write a patch in say R, and then return to programming in SAS language.    

It is difficult to imagine SAS embracing the open source movement more fully. But as has been pointed out, sometimes companies stick to their legacy business models longer than they should. If the resilience of SAS is based on a complex range of competencies, not easily imitable by the competition, and the company is innovating, as it seems to be doing with SAS Viya and other high-end products, it should re-examine its business model.

Comment by Phil Rack on May 22, 2017 at 8:35am

Bill, a fine article and thank you for writing it. I'd like to offer my view which is perhaps a bit different from most of the postings and responses that I have read so far. To be transparent, I resell a product called WPS which is a SAS language compatible software product. It's important for me to state that so everyone understands my perspectives and how I arrived at my conclusions and observations.

First, when I started to resell WPS, I thought that my customer base was going to be individuals, consultants and small businesses. Those are the people who don't have the financial resources to license SAS. However, that didn't end up being the case. I do sell a lot of software to individuals and consultants. What I didn't expect was that the mid-sized businesses were the least likely organizations to purchase WPS. Much to my surprise, my market quickly became large organizations that can be found in the Fortune 2000.

SAS and perhaps other companies have Data Service Provider upcharges. If you use their product in a B-to-B environment, you can expect to pay much more for the license than if you just used it internally. The largest organizations quickly saw WPS, R and Python as a way of moving away from such egregious and heavy handed licensing.

Regarding R and Python, I am asked often about whether we support these two languages. The answer is we do. The reason the question is important is that my customers tell me how difficult it is for them to find SAS Language developers coming out of college. They all have R and some Python so they want to accommodate these new hires. The problem they run into is large data sets that cannot be confined to existing memory. Problem number two is if they want them to use WPS or SAS and the new employees only have R and Python then there is an additional training cost to the organization.

We recently did a survey of our customers and we asked them to rank in importance why they licensed WPS. The number one response was compatibility with the Language of SAS. The second highest response was the license terms and conditions and the third highest response was cost. This surprised me because I would have thought that "cost" was the most important. When I think about that, it makes sense because the cost to rewrite software can be quite high. So, compatibility was a proxy for cost.

Regarding "drag and drop capabilities" I think these interfaces are meant to provide managers the ability to access data without really learning the language. SAS was smart developing EG but seems to have antagonized individuals in an organization who must use it (and are coders) as their primary means of developing SAS programs.

I don't see R and Python fading away, but I am much in agreement with your conclusion when it comes to large organizations who employ many data scientists, statisticians and data mangers. Production, stability, reproducibility are all important aspects when it comes to creating your BI stack. There’s been a tremendous amount of work done developing R and Python, but for the large organizations, I see these two languages as a companion to the Language of SAS and not a fierce competitor.

Comment by Shantanu Karve on May 20, 2017 at 10:04am

This issue goes beyond R v SAS/SPSS. In the broadest sense, it captures the evolution of IT - hardware and software in yet another domain. The barriers to entry of cost, of skills come crashing down, individuals and newer companies break in, as costs drop the application of the field expands rapidly as does the customers for this technology. Unfortunately, the niche vendors don't adapt their business models for the new reality and their tools, technology and skillsets diminish in mind and market share.

They are squeezed at both ends at that. I'd be willing to sell a customer propensity model for churn and retention for a thousand bucks or so, EXCEPT mobile payment vendors like Square ALREADY offer add-on Analytics solutions so why would the local micro-brewery ( yeah Westlake Brewing I'm looking at you ! )  want me ?  Most customer engagement, CRM, maildrop, web-session software vendors provide built-in, integrated analytics modules. Now I grant that "this isn't your father's analytics", but commodification is always like that and if it meets enough of the need of customers still getting used to the idea of using math not "seat of the pants" to run their business then that suffices.

Its happened in other domains for sure. I remember and worked on IBM and large company provided OFFICE Systems - PROFS, ALL-IN-1. They were decimated by rapidly falling hardware prices and the key functionality of email, word-processing, spreadsheets captured by PC based WordPerfect, Lotus 1-2-3. Then networking - the fancy priced token-ring networks, DECnet lost out to Novell in office LANs. 

Heck in our field once upon a time modeling was so expensive its use was very narrow and esoteric. In the 70s I worked for a "Seven Sisters" petrochemical company using fancy pricey NAG, Harwell algorithm libraries on computers where you had to book time in 1 hour slots ! And before my time, analytics was used in the Manhattan project, where after all Bayesian inference became practically solvable, albeit only in billion-dollar projects, via the Metropolis-Hastings algorithm.

Look where Analytics is now, the number of businesses where its used ! And the process is inexorable.

Sadly for the niche vendors, they are now squeezed not only from the bottom end but also from the top-end too. I've just registered for Google's offer of 1000 Cloud TPUs for selected researchers for FREE ! Will I get it ? Who knows but Azure and AWS won't let this pass, to capture mind-share and marketshare they will respond, and IMO this process will only accelerate.

I've seen vendor presentations in the last year, by SAS and IBM. Nope it hasn't impressed me. I'd sure they'll continue to service their installed base, who after all have sunk costs - of money, of expertise, of mature and performing productized models but for new businesses, for smaller businesses the value proposition isn't there.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service