Subscribe to DSC Newsletter

Which one is best: R, SAS or Python, for data science?

This is of course the wrong question. I use R because I'm familiar with it, more than SAS or Python. And I use R mostly for graphics / visualization. Though things have changed, I consider R mostly as a tool to perform ad-hoc analysis or EDA (exploratory data analysis) rather than a component of enterprise analytic applications / production code running in batch mode or accessed via API's. Is there an enterprise version of R? Also R used to be limited by the amount of RAM, not sure how easy it is to go around this limitation. RHadoop is R for Hadoop, I suppose that's a possible solution for big data, though I'm not familiar with the product.

Picture from Kunal Jain's blog

I used SAS a while back, and I know it has significantly improved over the last 10 years, including offering a better sort, hash tables, and very fast SAS for really big data. If your client uses SAS, SAS is a great option. You also get support with SAS, more than with R.

My favorite would be Python, but since I code my own applications (as opposed to working with a team), I still use Perl for its automated memory allocation, nice string processing features (though many languages do as good as Perl now with NLP and regular expressions), and high flexibility. Clear, scalable, transportable code is more important than the choice of the language. But I definitely like programming (and scripting) languages more than R or SAS, because I develop proprietary techniques and don't like black boxes (you never know when they don't work, what kind of data make them fail - not an issue if you write your own code). Also speed of execution (fast C versus relatively slow Perl, R or Python) is not a big issue anymore with big data, as most of the computer time is not spent in running algorithms (if the algorithms are well optimized)  but instead in data transfers.

There are also many other tools for data mining, for instance RapidMiner or Mahout (Java code for machine learning). What about Excel? I actually use Perl to summarize data (big data processing), R for graphics, and Excel as the top layer.

What about you?

Views: 33169

Reply to This

Replies to This Discussion

The Data Science function is not fully formal yet in many large F100 / F1000 enterprises. There are few roles starting to appear on both business & IT domains of Data Science. Hence expecting to find a comprehensive enterprise class Data Science stack will be really hard. 

There's lot of hype and glamour surrounding Big Data, where both commercial and enterprise Hadoop stacks have been deployed and playing a successful role. Even the Big 3 (IBM, Microsoft & Oracle) are working on their flavors of Hadoop either organically or through partnerships. These Hadoop stacks enable conveniences for languages, connectors, APIs, storage management, DR, High Availability, High performance, real-time vs batch processing SLAs etc. But there's little or no emphasize on analytics / statistical modules bundled with them. Few exceptions are Mahout, MapR, but mostly a standard based open source extension.

As an Enterprise Architect, I prefer R (R Studio) for EDA with a more collaborative context on Git / SVN primarily for Data Scientists / Statisticians / Data Modellers. Even though I'm yet to do this, we are actively looking into Cloud based Hadoop stacks which has native support to run R large scale on Big Data sets. From an analytics consumption perspective, we are exploring Tableau (can really augment R graphs well) and our current landscape (MicroStrategy & Microsoft BI - Fancy Excel). 

Recently I discovered the language Julia (http://julialang.org) and It looks like quite promising. Does someone have any experience with Julia? What it his/her impression?

Thanks,

  

Here's an interesting comment. The full (very long) version can be found here. Not sure who wrote this, but it's not me.

Over the past two years, my scientific computing toolbox been steadily homogenizing. Around 2010 or 2011, my toolbox looked something like this:

  • Ruby for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development;
  • Python/Numpy (mostly) and MATLAB (occasionally) for numerical computing;
  • MATLAB for neuroimaging data analysis;
  • R for statistical analysis;
  • R for plotting and visualization;
  • Occasional excursions into other languages/environments for other stuff.

In 2013, my toolbox looks like this:

  • Python for text processing and miscellaneous scripting;
  • Ruby on Rails/JavaScript for web development, except for an occasional date with Django or Flask (Python frameworks);
  • Python (NumPy/SciPy) for numerical computing;
  • Python (Neurosynth, NiPy etc.) for neuroimaging data analysis;
  • Python (NumPy/SciPy/pandas/statsmodels) for statistical analysis;
  • Python (MatPlotLib) for plotting and visualization, except for web-based visualizations (JavaScript/d3.js);
  • Python (scikit-learn) for machine learning;
  • Excursions into other languages have dropped markedly.

I want to agree with Vincent's quote: a lot of heavy-duty processing tools have been developed for Python recently, and there are significant gains to be made from a single development environment in terms of language familiarity and ease of transferring data types. Heck, I was installing the Python scipy library the other day, and to get the install to run correctly I had to first install a number of Fortran development libraries.

SAS will probably always have a place in legacy systems and entrenched analytics platforms, but freshly developed analytics platforms and toolkits are likely to be picking up steam with Python. Especially since Python can also be used for general-purpose programming and even web development, it's no wonder that so many data analysts are picking up Python skills.

Another entry in the language wars. It's not going to end and there is no, one best language. This is a helpful article that has 330 reader comments as of this morning and spans from Fortran to aspirant Julia:

http://arstechnica.com/science/2014/05/scientific-computings-future...

No one language is supreme. Use what you need to use to get the job done. If you are a manager in an enterprise, do an analysis, and select the tools that best facilitate meeting your goals.

A thought I always keep in mind is that the time it takes to learn a new tool is time I cannot be productive in the tools I know.

What about non-language tools like Knime. Very useful and very awesome.

As with Perl for text manipulation and pulling files, there is supreme utility in Bash scripts. It's not just Python v R v  SAS. You have to have a skill stack that comprised of many languages. I can't imagine a shop or individual that uses only Python or all of any other, one language. 

For the aspiring analysts that might be reading this and trying to make decisions about what to spend their time learning, the most important decision is the platform...go with Linux/Unix.

Another tool you may want to look at is Revolution R; it's a supported version of R that can handle larger data sets than standard R and RStudio.  I also highly recommend Enthought Canopy for Python development in a Windows environment -- it too is a supported tool.  Python is so much easier to manage in a Linux environment than windows but Enthought makes it much easier in Windows.  These are products that you'll pay for (there are free full versions available for academics and students).

Can I consider myself a data scientist if I don't use python, JavaScript, hadoop, Julia, C++?
Seems like these days employers want computer science grads who build websites/software developers rather than statistical, machine learning, data mining practitioners.

Clancy, there are many data scientists that do not work for an employer. Many employers (at least, their recruiters) have a very narrow vision of what data science is. Many highly successful data scientists are not employees, not even consultants, and have none of these skills (Python, JavaScript, Hadoop, Julia, C++) though they blend computer science, machine learning, statistics, software engineering, product development, computational marketing, and business hacking.

Might be a little late to add but this adds a little more data driven context to this topic. I performed a search on Google trends for the following keywords "r data science" , "python data science" and "sas data science". The growth trend for R and Python are similar.

SAS programmer here.  I used it on small data in clinical trials.  I think SAS is into big data too with Grid computing.  There is a "for dummies" book on that subject published in 2013.  It is a booklet and a quick read.  Authors are Tim Bates, Tom Keefer, and Steven Sober.  Given this, SAS can do all the layers you mention in one product.  

SAS is ruling the market now. So, from immediate job perspective: SAS - R - Python, would be my rating. But I think with time, more structured data will be in place and then R & Python will be at par or may be more demanded.

All three softwares are fun to learn. Also if you become comfortable with one software, you can do coding on other softwares with ease. So, learn any of them as per your interest, convenience and usage. All three are best in its own way!

Happy Learning!

Reply to Discussion

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service