Home

Linear and logistic regression in Excel and R: try this free add-in (RegressIt)

Excel is often poorly regarded as a platform for regression analysis. The regression add-in in its Analysis Toolpak has not changed since it was introduced in 1995, and it was a flawed design even back then.  (See this link for a discussion.)    That’s unfortunate, because an Excel file can be a very good place in which to build regression models, compare and refine them, create high-quality editable tables and charts, share and present the results, and teach regression to those constituencies of students and practitioners for whom Excel is the only analytic tool they may ever use on a regular basis. 

Over the last 10 years I’ve developed an alternative, a free add-in called RegressIt, which is designed to take maximal advantage of the Excel environment and support good practices of data analysis.  Its home page is regressit.com, and a set of slides that gives a helicopter tour of its features is here.   I’ve used it for teaching an advanced course on regression and time series analysis to grad students in business and engineering, but it’s intended for use in teaching at all levels and in applications.  It was first released to the public in 2014 and has undergone major enhancements recently.  I urge you to take a look and give it a test drive.  It performs both linear and logistic regression in Excel, producing highly interactive model worksheets with well-designed outputs.  It also has some novel tools for navigating the model space, keeping an audit trail, and providing instruction as the user goes along.  The logistic model worksheets are particularly interesting:  they include a lot of tables and charts with spinners that can be used to play with their parameters.  For example, you can dial the cutoff value up and down after fitting a model, while watching what happens in classification tables and tracking your position on the ROC curve.  This feature does not require the program to be running, so  a single model worksheet is a self-contained demonstration tool for properties of a logistic model.  See the Titanic example on the web site.

And… it has an interface with R that allows R to be used as a computational engine for producing results in both environments.   See this short video for a demonstration.  This tool provides more analysis options and allows large data sets to be handled.   The full dataset does not need to fit in Excel.  It suffices to have matching variable names there. This allows Excel to provide a menu-driven front end for performing regression analysis in R that does not require the user to write any code.  The outputs in R include some custom tables and charts that resemble the ones that Excel produces for the same models, and the output that R sends back to Excel has most of the same interactive features as the native Excel output (color coding of coefficients by sign and significance, sorting of coefficient tables, deletion of insignificant variables directly from the coefficient table, inclusion in the model comparison table and other audit trail views, etc.).  The analysis options in R include a number of different kinds of testing (a fixed test set, iterated random test sets, k-fold cross-validation, and simultaneously fitting separate models to disjoint subsets of data), and they are accompanied by very detailed comparative statistics.  Stepwise variable selection can also be combined with them.  A nice (and often sobering) exercise is to run 10 or 20 iterations of a randomly chosen training set (say, 2/3 of the data) with stepwise variable selection in order to see what a range of models you may get.

I hope that many of you will find this synthesis of Excel and R to be useful, even if its implementation is not elegant by R standards.  The VBA code in the add-in writes a script file (which is verbose) and places a line of code to run the script on the clipboard.  The user just needs to hit Ctrl-V and Enter in the console in RStudio in order to run the script, which will produce output there and also send it back to Excel.  (The user also needs only to hit Ctrl-V and Enter in RStudio when importing data from Excel and loading packages. The entire R session can be controlled with those two keystrokes and mouse clicks.  Of course, the user can also do more playing around with the model by typing code.)  So, most of the program logic resides in Excel, and two-way communication with R takes place via the clipboard and text files.  It uses about a dozen existing R packages for various options, rather than providing a new package of its own. The system of generating a separate script file for each model leaves an audit trail:  there is a separate script behind every R model worksheet in the Excel file.  Usually the scripts will never be looked at, though, because the models can be instantly re-created at any time from their worksheets in the Excel file, which is where the user will probably spend the most time. If you are someone who has never used R before, you can install it and start fitting regression models in RStudio with this tool in about 10 minutes–instructions are on the web site. 

There are a lot of other distinctive features in the program.  It keeps a multi-threaded audit trail that includes a journal-style model comparison table by default and the ability to search through models on the basis of parent-child relationships (for those fitted with Excel).  It includes tools for evaluating and verifying the originality of work submitted by students, and it contains around 10,000 words of internal documentation and teaching notes. A regression model worksheet may contain very detailed teaching notes in the form of cell comments, and these could be customized to suit an instructor if you don’t like mine  More details of the teaching aids are here.  It also has a deep menu of variable transformations and a descriptive analysis procedure that provides rich graphical output for many variables at once, including scatterplot matrices in which each element is an editable Excel chart.  A very novel feature of the program (in fact, the most striking feature of the menu interface) is a set of tools for navigating among models, controlling the display of the output, and drilling down to see the layers of information stored within cells, which is designed to make it easy for users to explore the model space in a thoughtful fashion, look at the right outputs, and make good choices.

I invite you to take a look at it, regardless of your current level of fondness of Excel as a data analysis tool, and I welcome any feedback or questions.  This is free software that I am offering as a public service, and it is intended to serve as a complement, not necessarily a substitute, for whatever you and your colleagues and students may already be using for regression.

Tags: