Subscribe to DSC Newsletter

New, fast Excel to process billions of rows via the cloud

This could be a new startup idea: creating an Excel-compatible software, working just like Excel, but able to handle bigger datasets, much faster.

Like most data scientists, I've been using Excel a lot in my career, and it definitely has some very powerful features. Probably the greatest one is its ability to help design, test, and update models in real time: just changing the value of a few core parameters, and all your cells, thousands of them, and all you charts, get updated at once. If you are not familiar with this, download our most recent spreadsheet to see how it works.

Other nice features includes Excel ability to connect to databases or the Internet (e.g. to the Bing API) and extract useful information, and summarize it in cubes and pivot tables. Although cubes and pivot tables have a very strong feel of old-fashioned, SQL relational database environment. But it is still useful in many contexts. Not sure if Excel can easily retrieve data via the Internet, from non-Microsoft API's, e.g. Google API's. It should.

Yet the main drawback is Excel slowness. It is slow in ways that are unexpected. If you sort 500,000 observations (one column) it's actually quite fast. But let's say that you simultaneously sort two columns: A, and B, where B is the log of A values. So B contains a formula in each row. This dramatically slows down the sort. It is much faster to sort A alone and leave B as is. The formulas in B will correctly update all the values, very fast, and automatically.

As a general rule, any operation that involves re-computing (say) 200,000+ rows across multiple cells linked via dependence relationships, is done very slowly if:

  • one of the columns includes functions such as VLOOKUP at each cell 
  • SORT or other sub-linear processes are required (by sub-linear, I mean processes that are O(n log n) or worse)

There's a very easy, efficient (but ugly) way around this, and I'm wondering why it's not built-in in Excel, and transparent to the user. For instance:

  • Replace VLOOKUP formulas by hard-coded values, perform the update on hard-coded values, then put back the formulas 
  • Perform SORT only on the minimum number of columns where necessary, then update cells in columns involving formulas

Also, I don't know why VLOOKUP is so slow in the first place. I use it all the time to join a (say) 3,000,000 row dataset with a 200,000 lookup table in Perl, and it's very fast. In Excel, the dataset and the lookup table would be stored in two separate worksheets within the same spreadsheet, and it would take days if it was possible to perform this "join" (it isn't). Maybe Excel indirectly performs a full join, thus exponentially slowing down the operation? I almost never do, and people almost never do a full join on large datasets. This is an area where significant improvements could be done.

Finally, not sure if Excel leverages the cloud, but this would be a great way to further speed computations and process data sets far bigger than 1 million rows. Microsoft should allow the user to export the data on some (Microsoft?) cloud in a transparent way, in one click (e.g. just click on "export to cloud") then allow the user to simulate the time-consuming operations on the local Excel version on her desktop (this amounts to using your local Excel spreadsheet as a virtual spreadsheet), and when done, click on "retrieve from cloud" and get your spreadsheet updated. The "retrieve from cloud" would:

  1. Send your Excel formulas to the cloud via the Internet
  2. Apply your formula to your cloud version of your data, leveraging Map Reduce as needed
  3. Get the processed data back to your local spreadsheet on your desktop

Another painfully slow process is when you need to apply a formula to a whole column with 500,000 cells. Fortunately, there is a trick. Let's say you want to store the product of A and B in column C.

  • Firstly, select the whole Column C.
  • Enter the formula of =A1*B1
  • Press the Ctrl key and Enter key together.

I wish it would be easier than that, something like ==A1*A1 (formula with double equal to indicate that the formula applies to the entire column, not just one cell). This is another example of how Excel is not user-friendly. Many times, there are some obscure ways to efficiently do something. We'll see an another example in my next spreadsheet, which will teach you how to write a formula that returns multiple cells - in particular with LINEST which returns the regression coefficients associated with a linear regression. Yes you can do it in basic Excel, without add-in!

For those interested in creating a column with 1,000,000 values (e.g. to test the speed of some Excel computations), here's how to proceed - this would indeed be a good job interview question:

  • Start with 200 cells.
  • Duplicate these cells, now you have 400.
  • Dupicate these cells, now you have 800.

Another 10 iterations of this process, and you'll be at 800,000 cells. It's much faster than the naive approach. And if your initial 200 numbers consist of the formula =RAND(), at the end you'll end up with one million pseudo-random numbers, though of poor quality.

Finally, I use Perl, Python, R, C, Sort, Grep and other languages to do data summarization, before feeding stuff to Excel. But if Excel came with the features I just discussed, much more could be done within Excel. Users would spend more time in Excel.

And one weird question to finish: why does Microsoft not show ads in Excel? They would get advertising dollars that could be used to improve Excel. They could then ship a version of Excel that transparently uses the cloud, a version that even the non technical guys could use, with all the benefits that I mentioned.

Related articles:

Views: 37574

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Luisa Pimentel on April 19, 2016 at 8:34am

Hello! I started watching some Youtube videos about Power BI and it looks a lot similar to Tableau, I'm a bit new in the world of creating dashboards and I would like to know if they are indeed similar (Power BI / Tableau)

Comment by Jamie Thomson on May 29, 2015 at 2:54am

If you really desire to manipulate/summarize data in Excel rather than first doing it in "Perl, Python, R, C, Sort, Grep and other languages" then I recommend you check out PowerQuery. Its an Excel add-in from the Power* family of products (of which PowerPivot is one) that enable transformation of your data before landing it into an Excel worksheet. Another way it might be described as a "desktop ETL tool". Well worth a look.

Comment by Dr. Z on October 4, 2013 at 4:11am

Analytica and BigSheets (BigInsights) are pretty much what you are describing. The former has been around for a while now.

Comment by Jed Alpert on April 23, 2013 at 11:30am

We agree that there should be a cloud-based version of Excel that can handle billions of rows of data.  In fact, that’s why we, at www.1010data.com, have developed the Trillion Row Spreadsheet which allows companies to use a spreadsheet interface to perform data discovery and analyze trillions of rows of data all through a cloud-based GUI.  In addition to basic spreadsheet calculations, the Trillion Row Spreadsheet lets you run advanced statistical analyses and create machine learning algorithms, all using the familiar spreadsheet paradigm. Best of all, this is done without the need to write MapReduce code. For those that want to bring their data back into Excel, we have an Excel add-in that works seamlessly with 1010data.  Full disclosure, I work for 1010data.

Comment by Vincent Granville on April 18, 2013 at 7:12am

James Gallman says: I have been using PowerPivot for a few months now. The challenge is that MS has decided to take a product that was once free and put it into the a very expensive edition of the software. That will likely be the death knell of the product. They do this to justify the cost of the OPP or Office365 versions. 

I have not used it for billions of rows. But millions on a laptop running 64 bit Windows and 64 bit office with 8 GB of RAM. It was pretty good.

Comment by Peter Scobbie on April 18, 2013 at 6:28am

The latest version of PowerPivot has extended external data connectors like OData, Azure marketplace, and so on. I've been using PowerPivot for a couple of years now and its like having a 'proof of concept' version of SQL Server analysis server inside of Excel. I agree with Peter Bruce's sentiment about the internal politics within Microsoft product teams but that 'barrier to progress' seems to be breaking down now - Office 13 is a good example.

Have a look at www.powerpivotpro.com/about/ and then journey off on the usual link trail to other interesting stuff

Comment by Vincent Granville on April 17, 2013 at 12:36pm

Can Excel detect if all cells in a column are based on the same formula? You could have a column where 50% of the cells use one formula and the other 50% use a different formula. But in practice, columns built on a formula almost always consist of cells with the exact same formula throughout the column.  

Comment by Vincent Granville on April 17, 2013 at 8:36am

 Michael Hausenblas says that my idea is already implemented as Google spreadsheet + BigQuery.

Comment by Vincent Granville on April 17, 2013 at 7:08am

Here are a few additional comments from our readers:

  • MS HDInsights with MS Excel, IBM Bigsheets, Datameer are already doing that over Hadoop Sanjay Sharma
  •  Excel can be used with large datasets on MS HPC Server 2008. While not as common as a Linux or Unix based cluster, a HPC Server 2008 cluster can be set up to use Excel. Look at chapter 7 in Step by Step Windows HPC Server 2008 R2.http://resourcekit.windowshpc.net/doc31.htm - Stephen Veronneau
  • Great idea. If you preserve Excel's VBA and add-in capabilities, you can do almost anything. For instance, to address the limitations of Excel's random function, incorporate NtRand's implementation of the Mersenne Twister as an add-in. If you need a special function, just code it in VBA. With some work, you should be able to make Excel a front-end for R and the possibilities are limitless. - John Wood
  • Suggest you test Gnumerics, looks like the answer. - Sten Carlsen
  • MS has already provided Excel-based solution which can process billions of rows. It is called PowerPivot and you can learn more about it here: https://www.microsoft.com/en-us/bi/powerpivot.aspx - Ivan Peev

Comment by Tatyana Yakushev on April 17, 2013 at 6:46am

You should use PowerPivot add-in from Microsoft to work with large datasets. It can be installed for Excel 2010 from powerpivot.com.

If you have Office 2013, PowerPivot is included with Office Professional Plus edition (it is disabled by default so you will need to enable it).

If you want to do predictive analysis in Excel, you should look at Predixion Insight (www.predixionsoftware.com). It was developed by the same group of people who did data mining add-ins at Microsoft. Developer edition of Predixion Insight is free. Predixion Insight has very nice integration with PowerPivot.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2016   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service