This article presents various ways of measuring the popularity or market share of software for analytics including: Alpine, Alteryx, Angoss, C / C++ / C#, BMDP, FICO, IBM SPSS Statistics, IBM SPSS Modeler, InfoCentricity Xeno, Java, JMP, KNIME, Lavastorm, Mathworks’ MATLAB, Megaputer’s PolyAnalyst, Minitab, NCSS, Python, R, RapidMiner, SAS, SAS Enterprise Miner, Salford Predictive Modeler (SPM) etc., SAP KXEN, TIBCO Spotfire, Stata, Statistica, Systat, WEKA / Pentaho.
Figure 1: Number of R packages in each new release - chart from Robert's article
I believe that adding new methods in statistical packages, to the point that each package now offers hundreds of functions (dozens of regressions, dozens of classifiers, dozens of time series methods and so on), is a bad idea. Most of these functions are never used. It only confuses the high-level user, and makes these packages not suitable for automated or black-box data science by non-statisticians (engineers, economists). If you really need that level of sophistication and fine-tuning, you are better off writing your own code in Perl, Python, or R or some other programming language.
Dr Granville is currently working on a new approach to statistical software development. It consists of producing very few, global methods with few parameters (one method per core problem, e.g. one generic clustering technique, one generic regression technique etc.) with focus on automation (algorithms run in batch mode and/or automatically scheduled), streaming data, black-box data processing by non-statisticians, and ability to process large data while avoiding the curse of big data at the same time. These methods are designed for robustness, simplicity and scalability, with minimum accuracy loss over traditional methods, and can be integrated as modules in existing production-mode machine learning applications, large and small. The new methods, initially designed in Data Science Central's research lab, are in the process of being made easy to implement, with code and explanations provided. It started with model-free confidence intervals 2-3 weeks ago, including hypothesis testing. This week it will be about predictive power (a metric for feature selection), followed in september by Hidden Decision Trees blended with Jackknife regression. The results will be presented in an upcoming book, Automated Data Science.
For another article about software comparison, click here.
This is a very long article written (I believe in 2013) by Robert Muenchen. Read the full version, with all the reports and charts. The following metrcis are used for software comparison, in Robert's article.
- Job Advertisements – these are rich in information and are backed by money so they are perhaps the best measure of how popular each software is now, and what the trends are up to this point.
- Scholarly Articles – these are also rich in information and backed by significant amounts of effort. Since a large proportion come out of academia, the source of new college graduates, they are perhaps the best measurement of new trends in analytics.
- Books – the number of books that include a software’s name in its title is a particularly useful information since it requires a significant effort to write one and publishers do their own study of market share before taking the risk of publishing. However, it can be difficult to do searches to find books that use general-purpose languages which also focus only on analytics.
- Website Popularity – the PageRank measure is objective data, and for sites that clearly focus on analytics, it’s unbiased and especially useful for weeding out the weaker software. However, so much market consolidation has occurred that now focused analytic tools like SPSS are listed under corporations with much broader interests (IBM in that case). In addition, for general-purpose software like Java, many sites that discuss programming point to http://www.java.com, that have nothing to do with its use for analytics.
- Blogs – the number of bloggers writing about analytics software is an interesting measure. Blog posts contain a great deal of information about their topic, and although it’s not as time consuming as a book to write, maintaining a blog certainly requires effort. Unfortunately, this measure is very hard to collect except where sites exist to maintain such lists.
- Surveys of Use – these add additional perspective, but they are commonly done using “snowball sampling” in which the survey taker tries to widely distribute the link and then vendors vie to see who can get the most of their users to participate. So long as they all do so with equal effect, the results can be useful. However, the information is often low, because the questions are short and precise (e.g. “tools data mining” or “program languages for data mining”) and responding requires but a few mouse clicks, rather than the commitment required to place an advertisement or publish an article.
- Discussion Forum Activity – these web sites or email-based discussion lists can be a very useful source of information because so many people participate, generating many tens of thousands of questions, answers and other commentary for popular software and virtually nothing for others. While talk may be cheap, it’s still a good indicator of popularity.
- Programming Activity – some software development is focused into repositories such as GitHub. That allows people to count the number lines of programming code done for each project in a given time period. This is an excellent measure of popularity since writing programs or changing them requires substantial commitment. However, very popular commercial software may not have much user development activity.
- Popularity Measures – some sites exist that combine several of the measures discussed here into an overall composite score or rank. In particular, they use programming activity and discussion forums.
- IT Research Firm Reports – these firms study the analytics market, interview corporate clients regarding how their needs are being met and/or changing, and write reports describing their take on where each software is now and where they’re headed.
- Sales or Download Measures – the commercial analytics field has undergone a major merger and acquisition phase so that now it is hard to separate out the revenue that comes specifically from analytics. Open source software plays a major role and even the few packages that offer download figures are dicey at best.
- Competition Use – organizations that sponsor analytic competitions occasionally report what the winners tend to use. Unfortunately this information is only sporadically available.
- Growth in Capability – while programming activity (mentioned above) is required before growth in capability can occur, actual growth in capability is a measure of how many new methods of analysis a software package can perform; programming activity can include routine maintenance of existing capability. Unfortunately, most software vendors don’t track this measure and, of course, simply counting the number of new things does not mean they are widely useful new things. I have only been able to collect this data for R, but the results have been very interesting.