** **

By *JIANG Buxing, Data Scientist *

Multidimensional analysis, commonly called OLAP, is an interactively data analysis process that performs operations, including rotation, slice and dice, drilldown etc., on a data cube. The structure of its back-end computation is simple, as shown by the following SQL:

SELECT *D*,..., SUM(*M*), ... FROM *C* WHERE *D'*=*d'* AND ... GROUP BY *D*,...

The statement aggregates some measures by some dimensions. *C* is a data cube, *D,…* represents dimensions to be selected, and *M,…* represents measures to be aggregated. Except for SUM, we can use other aggregate functions. *D’* is a slice dimension. The criteria for a dicing operation is phrased as *D IN (d,…)*. We can also define a rule over a measure in the WHERE clause to select values within a certain range.

OLAP analysis requires instant response, so high-performance is critical. Despite the statement’s simple structure, a lot of computations may be involved when we deal with large data cubes. Before we find a way to optimize them, the analysis is slow. The following lists common ways in which back-end performance can be improved for multidimensional analysis.

**Pre-aggregation**

Pre-aggregation is commonly adopted by early OLAP products, a way to trade storage consumption for efficiency. The approach is to pre-compute the aggregate values (over the measures defined in the SELECT query) by some of the dimensions or all of them (defined in GROUP BY clause) and store them. These intermediate results can be directly used in later computations, or generate new computations. This way the performance can be much improved.

The pre-aggregated results occupy a large space. Typically, there are a dozen of, even dozens of, dimensions and each dimension’s value range is from a one-digit number to a two-digit number. A simple math shows that the pre-aggregated result will be from several times to dozens of times larger than the original data cube (that is the ratio of *(k1+1)*(k2+1)*...* to *k1*k2*...* with various types of aggregate functions taken into consideration). Though a data cube won’t be too large in order to obtain an instant response, an amount of data dozens of times larger is unfeasible.

A compromise is to compute aggregate values only by some of the dimensions. Since there are only a few grouping dimensions (defined in the GROUP BY clause) will be displayed in the OLAP interface, we can perform aggregation by those *m* dimensions. If the value of *m* is not greater than 5, the storage consumption will be within a sensible range and most of the user operations will get a quick response.

Yet the partial aggregation can’t handle the slice criteria imposed on other dimensions. The drilldown, however, is based on slicing. To make things worse, even the all-inclusive aggregation is unable to handle a criterion over a measure (say, to get sales amounts that are more than ¥1,000), which is not uncommon for the multidimensional analysis. It’s also probably that an aggregate function contains a criterion too (say, to total the costs under ¥100 only). The pre-aggregated results are useless in all of these scenarios.

Pre-aggregation can only handle the most commonly seen scenarios, a minority in all types of multidimensional analysis scenarios. Full traversal is still needed in most scenarios.

**Segment-based parallel processing**

Essentially, the multidimensional analysis is data filtering and grouping, which are easily performed in parallel. Steps are divide data into multiple segments, process them separately, and collect the processing results for aggregation, during which the subtasks are independent of each other. From multi-threaded processing on a single machine, multi-node cluster computing, to a combination of both, each is not difficult to implement.

The result of multidimensional analysis is for viewing. But the data we can view with our eyes is far less than the data a modern computer’s memory can hold. For a data set that is small enough to be easily loaded into the memory, there’s no need to swap it over between memory and disk. Programming is relatively easy, and performance is great. A big data set generated during the computing process will be reported directly to the interface, and then the computation will be aborted.

According to our tests, if all subtasks in a multithreaded processing merge their results into a same result set, performance may be seriously affected due to the synchronized action for the multiple threads using a single resource, though, seemingly, the memory footprint is reduced by using a shared final data set.

And more threads are not necessarily better. It becomes ineffective when the threads are more than the CPU cores. For data stored in the external storage, tests are needed to get the actual result of the multithreaded processing, because the hard disk’s concurrency ability, which is generally smaller than the number of CPU cores, needs to be factored in.

It’s easy to divide static data according to the number of records and by marking the ending of each segment. But it’s a hassle to split dynamic data evenly. More will be discussed about this in a future article.

For a single computing task, parallel processing can bring a several-time increase in performance. As the OLAP operation is basically a concurrent transaction, the increased performance could be offset even when the number of users is small.

A better way is needed.

**Sorted index**

An aggregate operation without slicing always involves the whole data cube. Little we can do to reduce the computations except perform pre-aggregation. But with a slicing operation (drilldown), it’s not necessary to do a full-traverse if the data cube is already sorted.

If we can create an index for dimension *D*, which means sorting its values in a certain order with the sequence numbers of their corresponding records tied up, then we can quickly locate records meeting the criterion on the slice that containing dimension *D*. This is a simple binary search. Without a full-traverse on all data, the computations will decrease by several orders of magnitude (which also depends on the value range of *D*). Theoretically, we can create an index for each dimension. It isn’t expensive. Performance will then be greatly improved when relative slices are involved.

But a multi-field index on dimensions *D _{1}* and

Unfortunately, this primal approach is only applicable to the handle of in-memory data, which allows the frequent, small-amount access. In most cases, the data set to be handled is quite large and needs to be stored in the disk. But, even if with an index, retrieving a large number of unordered records has little impact in performance increase. Performance can only be obviously improved when the data is truly ordered and records in each slice are stored continuously.

As data needs to be duplicated for each sorting by a dimension, the cost is rather high.

A solution is creating two copies of data: one where data is sorted by dimensions *D _{1}, …, Dn*, and the other where data is sorted by dimensions

** **

**Compressed column storage**

A great tool for dealing with multidimensional analysis is the columnar storage.

Typically there are a large number of fields (dimensions and measures), from scores of to hundreds of them, in a data cube undergoing multidimensional analysis. But the useful ones are not that many, usually about 5 or fewer, if the slice dimension isn’t taken into account. Since a slice can be handled with an index, then only a few fields need to be traversed.

In view of this, the columnar storage could deliver advantages. During the external storage computing, the I/O operation takes overwhelming time. So, compared with reducing the amount of computations, there’s more meaning in cutting down on data to be retrieved for performance increase. For a data cube with 100 fields, the I/O consumption would fall to 1/20 of the original if only 5 fields are retrieved, causing a performance spike by orders of magnitude.

Another advantage of columnar storage is that it supports data compression. In sorting and storing data by dimensions *D _{1},…, Dn*, we find that

There are some issues we need to have in mind when using columnar storage

Since columnar storage won’t reduce the amount of computations, it helps little when manipulating the in-memory data. But a compressed storage scheme is useful in reducing memory consumption.

The columnar storage will complicate the segment-based parallel processing and the creation of indexes. The segmentation of columns needs to be kept in alignment with each other. An index should reference all the columns simultaneously and correctly. And there is more hassle when the compressed column storage is employed. Despite these tedious issues, generally it’s not difficult to use column storage on static data (only to make sure you won’t forget to handle them).

The use of column storage will risk the increase of concurrency pressure. It loses the edge when the total number of fields is not many or we need to retrieve a lot of fields. With the HDD, an additional use of parallel processing will further increase the concurrency pressure, and probably results in a decreased performance. It’s more applicable to the SSD, which supports concurrency much better.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central