Subscribe to DSC Newsletter

Summary:  Someone had to say it.  In my opinion R is not the best way to learn data science and not the best way to practice it either.  More and more large employers agree.

 

Someone had to say it.  I know this will be controversial and I welcome your comments but in my opinion R is not the best way to learn data science and not the best way to practice it either.

 

Why Should We Care What Language You Use For Data Science

Here’s why this rises to the top of my thoughts.  Recently my local Meetup had a very well attended hackathon to walk people through the Titanic dataset using R.  The turnout was much higher than I expected which was gratifying.  The result, not so much.

Not everyone in the audience was a beginner and many were folks who had probably been exposed to R at some point but were just out of practice.  What struck me was how everyone was getting caught up in the syntax of each command, which is reasonably complex, and how many commands were necessary for example, to run the simplest decision tree.

Worse, it was as though they were learning a programming language and not the data science.  There was little or no conversation or questioning around cleansing, prep, transforms, feature engineering, feature selection, model selection, and absolutely none about about hyperparameter tuning.  In short, I am convinced that group left thinking that data science is about a programming language whose syntax they had to master, and not about the underlying major issues in preparing a worthwhile model.

 

Personal Experience

I have been a practicing data scientist with an emphasis on predictive modeling for about 16 years.  I know enough R to be dangerous but when I want to build a model I reach for my SAS Enterprise Miner (could just as easily be SPSS, Rapid Miner or one of the other complete platforms). 

The key issue is that I can clean, prep, transform, engineer features, select features, and run 10 or more model types simultaneously in less than 60 minutes (sometimes a lot less) and get back a nice display of the most accurate and robust model along with exportable code in my selection of languages.

The reason I can do that is because these advanced platforms now all have drag-and-drop visual workspaces into which I deploy and rapidly adjust each major element of the modeling process without ever touching a line of code.

 

Some Perspective

Through the 90’s (and actually well before) up through the early-2000’s if you studied data science in school you learned it on SAS or SPSS and in the base code for those packages that actually looks a lot like R.  R wasn’t around then and for decades, SAS and SPSS recognized that the way to earn market share was to basically give away your product to colleges which used it to train.  Those graduates would gravitate back to what they knew when they got out into the paying world.

In the mid-2000s these two platforms probably had at least an 80% market share and even today they have 36% and 17% respectively.  This doesn’t even begin to reflect how dominant they are among the largest companies which is probably double these numbers.

By 2000 both these providers were offering advanced drag-and-drop platforms deemphasizing code.  The major benefit, and it was huge, is that it let learners focus on the major elements of the process and understand what went on within each module or modeling technique without having to code it.

At the time, and still today, you will find SAS and SPSS purists who grew up coding who still maintain hand coding shops.  It’s what you learned on that you carry forward into commercial life.

 

Then Why Is R Now So Popular

It’s about the money.  R is open source and free.  Although SAS and SPSS provided very deep discounts to colleges and universities each instructor had to pay several thousand dollars for the teaching version and each student had to pay a few hundred dollars (eventually there were student web based versions that were free but the instructor still had to pay).

The first stable beta version of R was released in 2000.  If you look at the TIOBE index of software popularity you’ll see that R adoption had its first uptick when Hadoop became open source (2007) and the interest in data science began to blossom.  In 2014 it started a strong upward adoption curve along with the now exploding popularity of data science as a career and its wide ranging adoption as the teaching tool of choice.

This was an economic boon for colleges but a step back for learners who now had to drop back into code mode.  The argument is common that R’s syntax is at least easier than others but that begs the question that drag-and-drop is not only orders of magnitude easier but makes the modeling process much more logical and understandable.

 

Do Employers Care

Here you have to watch out for what appears to be a logical contradiction.  Among those who do the hiring, the requirement that you know R (or Python) is strong and almost a go/no go factor.  Why?  Because those doing the hiring were very likely to have been taught in R and their going-in assumption is if I had to know it then so do you.

Here’s the catch.  The largest employers, those with the most data scientists are rapidly reconsolidating on packages like SAS and SPSS with drag-and-drop.  Gartner says this trend is particularly strong in the mid-size and large companies.  You need to have at least 10 data scientists to break into this club and the average large company has more like 50. 

We’re talking about the largest banks, mortgage lenders, insurance companies, retailers, brokerages, telecoms, utilities, manufacturers, transportation, and largest B2C services companies.  Probably where you’d like to work unless you’re in Silicon Valley.

Once you have this many data scientists to manage you rapidly become concerned about efficiency and effectiveness.  That’s a huge investment in high priced talent that needs to show a good ROI.  Also, in this environment it’s likely that you have from several hundred to thousands of models that direct core business functions to develop and maintain. 

It’s easy to see that if everyone is freelancing in R (or Python) that managing for consistency of approach and quality of outcome, not to mention the ability for collaboration around a single project is almost impossible.  This is what’s driving the largest companies to literally force their data science staffs (I’m sure in a nice way) onto common platforms with drag-and-drop consistency and efficiency.

 

Gartner Won’t Even Rate You Unless You Have Drag-and-Drop

Gartner’s ‘Magic Quadrant for Advanced Analytic Platforms’ and Forrester’s report on ‘Enterprise Insight Platform Suites’ are both well regarded ratings of comprehensive data science platforms.  The difference is that Gartner won’t even include you in their ranking unless you have a Visual Composition Framework (drag-and-drop). 

As a result Tibco, which ranks second in the 2016 Forrester chart was not even considered by Gartner because it lacks this particular feature.  Tibco users must work directly in code.  Salford Systems was also rejected by Gartner for the same reason.

Gartner is very explicit that working in code is incompatible with the large organization need for quality, consistency, collaboration, speed, and ease of use.  Large groups of data scientists freelancing in R and Python are very difficult to manage for these characteristics and that’s no longer acceptable.

Yes essentially all of these platforms do allow highly skilled data scientists to insert their own R or Python code into the modeling process.  The fact is however that the need for algorithms not already embedded in the platform is rapidly declining.  If you absolutely need something as exotic as XGboost you can import it.  But only if that level of effort is warranted by a need for an unusually high level of accuracy.  It’s now about efficiency and productivity.

 

Should You Be Giving Up R

If you are an established data scientist who learned in R then my hat’s off to you, don’t change a thing.  If you’re in a smaller company with only a few colleagues you may be able to continue that way.  If you move up into a larger company that wants you to use a standardized platform you shouldn’t have any trouble picking it up.

If you’re an early learner you are probably destined to use whatever tools your instructor demands.  Increasingly that’s R.  It’s not likely that you have a choice.  It’s just that in the commercial world the need to actually code models in R is diminishing and your road map to a deep understanding of predictive modeling is probably more complex than it needs to be.

 

A Quick Note About Deep Learning

A quick note about deep learning.  Most programming in Tensorflow is occuring in Python and if you know R you shouldn’t have a problem picking it up.  Right now, to the best of my knowledge, there are no drag-and-drops for deep learning.  For one thing, deep learning is still expensive to execute in terms of manpower, computing resource, and data acquisition.  The need for those skills here in 2017 are still pretty limited, albeit likely to grow rapidly.  Like core predictive modeling, when things are difficult I’m sure there’s someone out there focusing on making it easier and I bet that drag-and-drop is not far behind.

 

 

About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001.  He can be reached at:

[email protected]

 

Views: 87460

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Shantanu Karve on February 24, 2018 at 5:44am

Thanks Dubravko Dolic for the link to the 1998 Venables paper "Exegeses". I loved the comparison of SAS to Microsoft,  behemoths in 1998 in their respective domains, and the concern that  "the program defines the subject rather than the subject dictating what the program should do."

The impact of the FDA on pharma is also well stated. I've come across the same in the insurance field where model had to be "classical" not machine-learning because the regulators won't accept ML models. 

Luckily  its now 2018 and not 1998 and the field has sorted itself out. Those who want the Microsoft, SAS, Intel, Oracle, FDA world are free to do so. Luckily the hard work of the open-source community has paid off and there's a thriving, leading edge even, community served with great tools like Linux, R, Python-SKLearn, Hadoop/Spark, tensorflow and ahem! unregulated  "dietary supplements 

Live and let live.

Comment by Dubravko Dolic on February 24, 2018 at 2:56am

Only one approach?

First of all I think we have a different understanding of what a Data Scientist should know. From your article I understand, that a Data Scientist is mainly responsible to understand and apply methods of statistics and machine learning. Which is too less in my opinion. As Business analysis is the most important part of being able to apply models or methods on data I would conclude, that working with data is crucial for that task. As the situation in as well huge companies as well as in small one shows up, you need to spend a lot of time with getting these data, finding business-driven relations between them and understanding the whys and whats the business is working with data in the way they do. This means: you should be capable of doing any kind of programming. As we want to bring our succeful world of models/Machine learning into production some day, you also should have a good deal of architectural knowledge as well. And from that perspective it is vital not to stick to one language or software stack.

So generally it is good to learn more than one approach. Work with SAS, Python, R, KNIME, even such strange things like SAP Analytics or IBM could be helpful to understand architecture and data environments we have to live in.  

Saying that, I know that it is possible to access, understand, clean and prepare data with GUI based tools like Rapid Miner, Alteryx, SAS EM and such. But do any contest: a data driven analyst working with any kind of coding will be more efficient. Also it definitely does need a good deal of technical understanding to grasp how to transform, group, drill, project, filter and otherwise prepare data to match to a specific situation which can be asked by the business. 

Freedom (not only free) of Code

Coming back to the roots of R and therefore to the language itself. If you have a look on the early days (and I was involved since 2000 when was working in a field we called statistics those days) you'll find good signs, that R was driven by statisticians who dissented to the way SAS dominated how to define statistics and methods. To give one example just review the very insighful discussion on Linear Models lead by Venables (http://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf). So R came along not only as a cheaper alternative but also as a freer (as in freedom) tool to work with methods and statistics. 

And concerning the reservation of complicated code from some starter: there are many options nowadays to use one-line-code to prepare and clean your data (see #tidyverse). Above that I think it is always useful for a Data Scientist to understand concepts of programming with data. 

What about SAS?

While I still see some use in GUI based Software also in the field of Data Science I think, that the time for SAS is over. Yes, Finance still will work a while with this dinosaur, but hey: they still work with Cobol, so what...

Working with SAS for more than 15 years I know its pros but overall the costs for that dinosaur compared to these pros are too little.

Conclusion:

Don't restrict yourself by commiting to one paradigma to do Data Science or even only one Software Stack. Embrace the freedom, Open Source and the community can give you. Data Scientists need a good proportion of creativity in their job. This has to be reflected in their tooling and their knowledge about this. 

Let's invent...

Comment by Paul Bremner on January 4, 2018 at 7:39am

I would agree that for anyone who knows programming, it's quicker and easier to understand what's going on in a simple task by looking at 10-20 lines of a program rather than punching through several screens/tabs.  But getting to that point requires more than "five minutes" of learning. GUI's vastly simplify understanding and the ability to get things (and this is also true for experienced programmers when you're talking about data science projects that require hundreds/thousands of lines of code as would be the case when you're running hundreds of models in several different categories: regression, decision trees/random forests, and neural networks.

If GUI's/applications did not have this advantage of speed/simplification, why would we have any applications?  For example, why would we have Tableau and Power BI?  Why not just do all visualizations using R or Python (or C or SAS, or something else?)  Answer: it's much easier and more productive to use point and click applications. It's a waste of time to do every task with programming.  For the small number of capabilities not included in such applications, you can use R or Python as add-ons (in Tableau, for instance.)  Same is true for Data Science Platforms (and by the way, this is also true for SAS programming in SAS applications.  For most of a massive data science project, you'd only insert your own programming in the process flow for those few things that the application doesn't provide.)     

Comment by Richard g Lamb on January 4, 2018 at 7:06am

At some time in my career, I had to become comfortable with the front-end workings of a coded interface as compared to GUI. It was the look that was intimidating, not the difficulty. Could it be that modern professional in any discipline must take the five minutes to know how to work in a coded front-end. When they do we can pass script as txt files. In contrast when I pass a GUI result around, I have to explain step by step how to run it. In contrast, a script is drop and run, and the details of what are in the model are easy to "read" as the background explanation.

Comment by Ralph Winters on January 4, 2018 at 6:55am

The advantage of having a GUI based predictive modeling environment cannot be over emphasized if true collaboration is desired among all stakeholders in a data science project.  Often code based implementations can be siloed and create an artificial wall between those that know code and those that don't.    GUI's help communicate data flows to non programmers and help create dialogue. 

Comment by Richard g Lamb on January 4, 2018 at 3:53am

A beauty of R is that If I am working with a team, everyone can immediately have a top tier software on their computer and analytics can be passed to team members as a script they can run and explore once I have given them the understanding of the model.

Comment by Johnny Strings on January 2, 2018 at 9:05am

Certainly SAS offers some drag and drop convenience over raw R, however I would not recommend SAS over R by any means.  The bang is not worth the buck.  (1) SAS is very expensive and (2) has very complex licensing that even SAS salespeople don't understand, (3) base SAS is even harder to learn than R, with their macro language on top of "open code," never mind ds2 which is there but limited compared to any "real" language, (4) SAS can't use any proprietary DBMS without paying extra, and (5) cannot handle table names greater than 32 characters in length and doesn't handle long column names well either, (6) SAS doesn't support newer models, e.g. XGB or any deep learning, and (7) if you think you can live in E-Miner without eventually needing to get your hands dirty with raw code, then I want your job: that's no more realistic than the assumption in this article that one can get by without drag and drop, and finally, (8) E-Miner is a "live" environment: there is no "save" button - you are modifying the code/project on the fly: it is horrific in terms of actual code development in any sort of collaborative team environment.

That is my experience with SAS: run away as fast as you can. Maybe raw, open source "totally free" R and/or Python is limited, however there are myriad approaches to building a collaborative team-based approach with R and/or Python as the backbone.  I do not endorse any of them, but you mentioned SPSS and RapidMiner, which may be viable, and I'll add Alteryx, which offers an impressive combination of price, performance and ease of development, and there's KNIME similar to RapidMiner...

And finally, I'll throw Sense.io and Dataiku (perhaps others) out there as well: collaboration without the rigid requirement of drag-and-drop. I question the whole argument that drag-and-drop is a necessity. Anyone ever try to refactor something written in a drag-and-drop tool?  Good luck!  One example given in the article (paraphrased: isn't it nice to drop all this stuff on my canvas and go) is great for a single user with a client-only desktop E-Miner license, so that single user can understand their data faster. But it is *not* an example relevant to large firms. where the results of that analysis need to be formally fed into a process of operationalization.  Large firms need the ability to collaborate, operationalize, refactor, version, implement source control and audits... and SAS is either very limited or excruciatingly difficult to implement in these areas. Perhaps other offerings (e.g. Alteryx / SPSS / Rapidminer / KNIME) address this better, but raw code-based collaboration tools can do all of the above very elegantly, with all the flexibility that raw code offers, from the ground up.

Comment by Richard g Lamb on January 1, 2018 at 6:19am

I had the pleasure of using SAS's Enterprise Miner (EM) when I studied statistics. One cannot help but fall in love with R upon discovering that an EM license for the first year is somewhere around $140,000 and about half of that each subsequent year. At the same time I can answer all of the same questions (difference, relationship, time series, duration and apparency) with R.

Comment by James S Herford on December 31, 2017 at 8:59pm

I'd like to provide a quote by a data scientist colleague of mine. "Data science is NOT plug and play." What happens to creativity/ingenuity/innovation? What about other great, scalable architectures that R/Python can utilize like Hadoop and Spark? Are SAS/SPSS able to handle the Big Data realities of today?

Comment by Pedro Junqueira on November 29, 2017 at 3:51pm

Are you trying to criticize R or the audience of your meet-up?

The reason one use R or not is so "IT DEPENDS" scenario that the short article did not make justice to the Pros of using R.

The bottom line is you can use anything for data science from XLS to Expensive Corporate (with a "drag and drop") applications and to do good data science more important than the tool is the knowledge of the methods, algos and process and the tools is a personal preference and what each person or company can afford to get the job done.

Cheers

Pedro

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service