This could be a new startup idea: creating an Excel-compatible software, working just like Excel, but able to handle bigger datasets, much faster.
Like most data scientists, I've been using Excel a lot in my career, and it definitely has some very powerful features. Probably the greatest one is its ability to help design, test, and update models in real time: just changing the value of a few core parameters, and all your cells, thousands of them, and all you charts, get updated at once. If you are not familiar with this, download our most recent spreadsheet to see how it works.
Other nice features includes Excel ability to connect to databases or the Internet (e.g. to the Bing API) and extract useful information, and summarize it in cubes and pivot tables. Although cubes and pivot tables have a very strong feel of old-fashioned, SQL relational database environment. But it is still useful in many contexts. Not sure if Excel can easily retrieve data via the Internet, from non-Microsoft API's, e.g. Google API's. It should.
Yet the main drawback is Excel slowness. It is slow in ways that are unexpected. If you sort 500,000 observations (one column) it's actually quite fast. But let's say that you simultaneously sort two columns: A, and B, where B is the log of A values. So B contains a formula in each row. This dramatically slows down the sort. It is much faster to sort A alone and leave B as is. The formulas in B will correctly update all the values, very fast, and automatically.
As a general rule, any operation that involves re-computing (say) 200,000+ rows across multiple cells linked via dependence relationships, is done very slowly if:
There's a very easy, efficient (but ugly) way around this, and I'm wondering why it's not built-in in Excel, and transparent to the user. For instance:
Also, I don't know why VLOOKUP is so slow in the first place. I use it all the time to join a (say) 3,000,000 row dataset with a 200,000 lookup table in Perl, and it's very fast. In Excel, the dataset and the lookup table would be stored in two separate worksheets within the same spreadsheet, and it would take days if it was possible to perform this "join" (it isn't). Maybe Excel indirectly performs a full join, thus exponentially slowing down the operation? I almost never do, and people almost never do a full join on large datasets. This is an area where significant improvements could be done.
Finally, not sure if Excel leverages the cloud, but this would be a great way to further speed computations and process data sets far bigger than 1 million rows. Microsoft should allow the user to export the data on some (Microsoft?) cloud in a transparent way, in one click (e.g. just click on "export to cloud") then allow the user to simulate the time-consuming operations on the local Excel version on her desktop (this amounts to using your local Excel spreadsheet as a virtual spreadsheet), and when done, click on "retrieve from cloud" and get your spreadsheet updated. The "retrieve from cloud" would:
Another painfully slow process is when you need to apply a formula to a whole column with 500,000 cells. Fortunately, there is a trick. Let's say you want to store the product of A and B in column C.
I wish it would be easier than that, something like ==A1*A1 (formula with double equal to indicate that the formula applies to the entire column, not just one cell). This is another example of how Excel is not user-friendly. Many times, there are some obscure ways to efficiently do something. We'll see an another example in my next spreadsheet, which will teach you how to write a formula that returns multiple cells - in particular with LINEST which returns the regression coefficients associated with a linear regression. Yes you can do it in basic Excel, without add-in!
For those interested in creating a column with 1,000,000 values (e.g. to test the speed of some Excel computations), here's how to proceed - this would indeed be a good job interview question:
Another 10 iterations of this process, and you'll be at 800,000 cells. It's much faster than the naive approach. And if your initial 200 numbers consist of the formula =RAND(), at the end you'll end up with one million pseudo-random numbers, though of poor quality.
Finally, I use Perl, Python, R, C, Sort, Grep and other languages to do data summarization, before feeding stuff to Excel. But if Excel came with the features I just discussed, much more could be done within Excel. Users would spend more time in Excel.
And one weird question to finish: why does Microsoft not show ads in Excel? They would get advertising dollars that could be used to improve Excel. They could then ship a version of Excel that transparently uses the cloud, a version that even the non technical guys could use, with all the benefits that I mentioned.
Related articles:
Comment
Hello! I started watching some Youtube videos about Power BI and it looks a lot similar to Tableau, I'm a bit new in the world of creating dashboards and I would like to know if they are indeed similar (Power BI / Tableau)
If you really desire to manipulate/summarize data in Excel rather than first doing it in "Perl, Python, R, C, Sort, Grep and other languages" then I recommend you check out PowerQuery. Its an Excel add-in from the Power* family of products (of which PowerPivot is one) that enable transformation of your data before landing it into an Excel worksheet. Another way it might be described as a "desktop ETL tool". Well worth a look.
Analytica and BigSheets (BigInsights) are pretty much what you are describing. The former has been around for a while now.
We agree that there should be a cloud-based version of Excel that can handle billions of rows of data. In fact, that’s why we, at www.1010data.com, have developed the Trillion Row Spreadsheet which allows companies to use a spreadsheet interface to perform data discovery and analyze trillions of rows of data all through a cloud-based GUI. In addition to basic spreadsheet calculations, the Trillion Row Spreadsheet lets you run advanced statistical analyses and create machine learning algorithms, all using the familiar spreadsheet paradigm. Best of all, this is done without the need to write MapReduce code. For those that want to bring their data back into Excel, we have an Excel add-in that works seamlessly with 1010data. Full disclosure, I work for 1010data.
James Gallman says: I have been using PowerPivot for a few months now. The challenge is that MS has decided to take a product that was once free and put it into the a very expensive edition of the software. That will likely be the death knell of the product. They do this to justify the cost of the OPP or Office365 versions.
I have not used it for billions of rows. But millions on a laptop running 64 bit Windows and 64 bit office with 8 GB of RAM. It was pretty good.
The latest version of PowerPivot has extended external data connectors like OData, Azure marketplace, and so on. I've been using PowerPivot for a couple of years now and its like having a 'proof of concept' version of SQL Server analysis server inside of Excel. I agree with Peter Bruce's sentiment about the internal politics within Microsoft product teams but that 'barrier to progress' seems to be breaking down now - Office 13 is a good example.
Have a look at www.powerpivotpro.com/about/ and then journey off on the usual link trail to other interesting stuff
Can Excel detect if all cells in a column are based on the same formula? You could have a column where 50% of the cells use one formula and the other 50% use a different formula. But in practice, columns built on a formula almost always consist of cells with the exact same formula throughout the column.
Michael Hausenblas says that my idea is already implemented as Google spreadsheet + BigQuery.
Here are a few additional comments from our readers:
You should use PowerPivot add-in from Microsoft to work with large datasets. It can be installed for Excel 2010 from powerpivot.com.
If you have Office 2013, PowerPivot is included with Office Professional Plus edition (it is disabled by default so you will need to enable it).
If you want to do predictive analysis in Excel, you should look at Predixion Insight (www.predixionsoftware.com). It was developed by the same group of people who did data mining add-ins at Microsoft. Developer edition of Predixion Insight is free. Predixion Insight has very nice integration with PowerPivot.
© 2018 Data Science Central™ Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central