By JIANG Buxing, Data Scientist
Multidimensional analysis, commonly called OLAP, is an interactively data analysis process that performs operations, including rotation, slice and dice, drilldown etc., on a data cube. The structure of its back-end computation is simple, as shown by the following SQL:
SELECT D,..., SUM(M), ... FROM C WHERE D'=d' AND ... GROUP BY D,...
The statement aggregates some measures by some dimensions. C is a data cube, D,… represents dimensions to be selected, and M,… represents measures to be aggregated. Except for SUM, we can use other aggregate functions. D’ is a slice dimension. The criteria for a dicing operation is phrased as D IN (d,…). We can also define a rule over a measure in the WHERE clause to select values within a certain range.
OLAP analysis requires instant response, so high-performance is critical. Despite the statement’s simple structure, a lot of computations may be involved when we deal with large data cubes. Before we find a way to optimize them, the analysis is slow. The following lists common ways in which back-end performance can be improved for multidimensional analysis.
Pre-aggregation is commonly adopted by early OLAP products, a way to trade storage consumption for efficiency. The approach is to pre-compute the aggregate values (over the measures defined in the SELECT query) by some of the dimensions or all of them (defined in GROUP BY clause) and store them. These intermediate results can be directly used in later computations, or generate new computations. This way the performance can be much improved.
The pre-aggregated results occupy a large space. Typically, there are a dozen of, even dozens of, dimensions and each dimension’s value range is from a one-digit number to a two-digit number. A simple math shows that the pre-aggregated result will be from several times to dozens of times larger than the original data cube (that is the ratio of (k1+1)*(k2+1)*... to k1*k2*... with various types of aggregate functions taken into consideration). Though a data cube won’t be too large in order to obtain an instant response, an amount of data dozens of times larger is unfeasible.
A compromise is to compute aggregate values only by some of the dimensions. Since there are only a few grouping dimensions (defined in the GROUP BY clause) will be displayed in the OLAP interface, we can perform aggregation by those m dimensions. If the value of m is not greater than 5, the storage consumption will be within a sensible range and most of the user operations will get a quick response.
Yet the partial aggregation can’t handle the slice criteria imposed on other dimensions. The drilldown, however, is based on slicing. To make things worse, even the all-inclusive aggregation is unable to handle a criterion over a measure (say, to get sales amounts that are more than ¥1,000), which is not uncommon for the multidimensional analysis. It’s also probably that an aggregate function contains a criterion too (say, to total the costs under ¥100 only). The pre-aggregated results are useless in all of these scenarios.
Pre-aggregation can only handle the most commonly seen scenarios, a minority in all types of multidimensional analysis scenarios. Full traversal is still needed in most scenarios.
Segment-based parallel processing
Essentially, the multidimensional analysis is data filtering and grouping, which are easily performed in parallel. Steps are divide data into multiple segments, process them separately, and collect the processing results for aggregation, during which the subtasks are independent of each other. From multi-threaded processing on a single machine, multi-node cluster computing, to a combination of both, each is not difficult to implement.
The result of multidimensional analysis is for viewing. But the data we can view with our eyes is far less than the data a modern computer’s memory can hold. For a data set that is small enough to be easily loaded into the memory, there’s no need to swap it over between memory and disk. Programming is relatively easy, and performance is great. A big data set generated during the computing process will be reported directly to the interface, and then the computation will be aborted.
According to our tests, if all subtasks in a multithreaded processing merge their results into a same result set, performance may be seriously affected due to the synchronized action for the multiple threads using a single resource, though, seemingly, the memory footprint is reduced by using a shared final data set.
And more threads are not necessarily better. It becomes ineffective when the threads are more than the CPU cores. For data stored in the external storage, tests are needed to get the actual result of the multithreaded processing, because the hard disk’s concurrency ability, which is generally smaller than the number of CPU cores, needs to be factored in.
It’s easy to divide static data according to the number of records and by marking the ending of each segment. But it’s a hassle to split dynamic data evenly. More will be discussed about this in a future article.
For a single computing task, parallel processing can bring a several-time increase in performance. As the OLAP operation is basically a concurrent transaction, the increased performance could be offset even when the number of users is small.
A better way is needed.
An aggregate operation without slicing always involves the whole data cube. Little we can do to reduce the computations except perform pre-aggregation. But with a slicing operation (drilldown), it’s not necessary to do a full-traverse if the data cube is already sorted.
If we can create an index for dimension D, which means sorting its values in a certain order with the sequence numbers of their corresponding records tied up, then we can quickly locate records meeting the criterion on the slice that containing dimension D. This is a simple binary search. Without a full-traverse on all data, the computations will decrease by several orders of magnitude (which also depends on the value range of D). Theoretically, we can create an index for each dimension. It isn’t expensive. Performance will then be greatly improved when relative slices are involved.
But a multi-field index on dimensions D1 and D2 is of little use. It can’t quickly locate a slice containing only dimension D2. It works well only when both D1 and D2 are contained in a slice. After locating records of a slice containing a dimension with the biggest value range, the amount of computations will be already reduced a lot. Then we can traverse the other slices by their dimensions.
Unfortunately, this primal approach is only applicable to the handle of in-memory data, which allows the frequent, small-amount access. In most cases, the data set to be handled is quite large and needs to be stored in the disk. But, even if with an index, retrieving a large number of unordered records has little impact in performance increase. Performance can only be obviously improved when the data is truly ordered and records in each slice are stored continuously.
As data needs to be duplicated for each sorting by a dimension, the cost is rather high.
A solution is creating two copies of data: one where data is sorted by dimensions D1, …, Dn, and the other where data is sorted by dimensions Dn, …, D1. The resulting data amount is simply two times than the original amount, which is acceptable. With the two dimension sequences, now a slice dimension will always find itself fall in the first half, making sure the data in a slice with this dimension is roughly continuous and thus securing a better performance enhancement.
Compressed column storage
A great tool for dealing with multidimensional analysis is the columnar storage.
Typically there are a large number of fields (dimensions and measures), from scores of to hundreds of them, in a data cube undergoing multidimensional analysis. But the useful ones are not that many, usually about 5 or fewer, if the slice dimension isn’t taken into account. Since a slice can be handled with an index, then only a few fields need to be traversed.
In view of this, the columnar storage could deliver advantages. During the external storage computing, the I/O operation takes overwhelming time. So, compared with reducing the amount of computations, there’s more meaning in cutting down on data to be retrieved for performance increase. For a data cube with 100 fields, the I/O consumption would fall to 1/20 of the original if only 5 fields are retrieved, causing a performance spike by orders of magnitude.
Another advantage of columnar storage is that it supports data compression. In sorting and storing data by dimensions D1,…, Dn, we find that D1 has the same value in a continuous run of records; and so has D2, in a fewer number of consecutive records; and so on, in a fewer and fewer number of consecutive records. There’s almost no such continuity for Dn. Considering there’s no need to store those continuous same values repeatedly, we can store them once and record their number. By reducing the space the data occupies, we reduce the I/O access of external storage and enhance the performance.
There are some issues we need to have in mind when using columnar storage
Since columnar storage won’t reduce the amount of computations, it helps little when manipulating the in-memory data. But a compressed storage scheme is useful in reducing memory consumption.
The columnar storage will complicate the segment-based parallel processing and the creation of indexes. The segmentation of columns needs to be kept in alignment with each other. An index should reference all the columns simultaneously and correctly. And there is more hassle when the compressed column storage is employed. Despite these tedious issues, generally it’s not difficult to use column storage on static data (only to make sure you won’t forget to handle them).
The use of column storage will risk the increase of concurrency pressure. It loses the edge when the total number of fields is not many or we need to retrieve a lot of fields. With the HDD, an additional use of parallel processing will further increase the concurrency pressure, and probably results in a decreased performance. It’s more applicable to the SSD, which supports concurrency much better.