Previously, we discussed the role of Amazon Redshift's sort keys and compared how both compound and interleaved keys work in theory. Throughout that post we used some dummy data and a set of Postgres queries in order to explore the Z-order curve and interleaved sorting without getting bogged down in implementation details. In this post, we will explore some of these implementation details, discuss a common tactic that can benefit from using compound and interleaved sort keys together, and run some benchmark queries against a data set with billions of rows. Throughout this post we will link to code that can be used to recreate our results.
In Part I, we discussed how Redshift eschews the B-tree in favor of another secondary data structure called a zone map. The zone maps for each table, as well as additional metadata, is stored in the
stv_blocklist system table. If we take a look at the definition of the
stv_blocklist table, we can see it has a width of about 100B. This corresponds with our estimates from Part I regarding the small total size of this secondary data structure relative to the tables it describes.
Taking a look at
stv_blocklist for a table with a compound key confirms what we've learned about zone maps for compound keys. Below we can see that the zone map for
c_custkey, the leading column of the sort key for the
orders_compound table, would prove helpful in pruning irrelevant blocks. However, the zone map for any of the other columns in the key would be useless.
The zone map for a compound sort key (code).
If we take a look at the zone map for
orders_interleaved_4, which has an interleaved sort key, we can see that the zone maps for all of the sort key columns look pretty useful.
The zone map for an interleaved sort key (code).
However, it's not the minimum and maximum column values that we are seeing here for each block but instead the minimum and maximum sort key values. Continue reading here.