A single query optimization tip can boost your database performance by 100x. At one point, we advised one of our customers that had a 10TB database to use a date-based multi-column index. As a result, their date range query sped up by 112x. In this post,we share five simple but still powerful tips for PostgreSQL query optimization.
To keep it easy, we ran examples for this article on a test dataset. Although it doesn’t show the actual performance improvement, you will see that our tips solve the significant set of optimization problems and work well in real-world case scenarios.
Postgres has a cool extension to the well-known `EXPLAIN` command, which is called `EXPLAIN ANALYZE`. The difference is that `EXPLAIN` shows you query cost based on collected statistics about your database, and `EXPLAIN ANALYZE` actually runs it to show the processed time for every stage.
We highly recommend you use `EXPLAIN ANALYZE` because there are a lot of cases when `EXPLAIN` shows a higher query cost, while the time to execute is actually less and vice versa.
Tip: The most important thing is that the EXPLAIN command will help you to understand if a specific index is used and how.
The ability to see indexes is the first step to learning PostgreSQL query optimization.
Indexes are materialized copies of your table. They contain only specific columns of the table, so you can quickly find data based on the values in these columns. Indexes in Postgres also store row identifiers or row addresses used to speed up the original table scans.
It’s always a trade-off between storage space and query time, and a lot of indexes can introduce overhead for DML operations. However, when read query performance is a priority, as is the case with business analytics, it’s usually a well-working approach.
Tip: Create one index per unique query for better performance.
Look further in this post to learn how to create indexes for specific queries.
Using multiple columns in index
Let’s review the explain analyze plan of the following simple query without indexes:
EXPLAIN ANALYZE SELECT line_items.product_id, SUM(line_items.price)
WHERE product_id > 80
GROUP BY 1
An explain analyze returns:
HashAggregate (cost=13.81..14.52 rows=71 width=12) (actual time=0.137..0.141 rows=20 loops=1)
Group Key: product_id
-> Seq Scan on line_items (cost=0.00..13.25 rows=112 width=8) (actual time=0.017..0.082 rows=112 loops=1)
Filter: (product_id > 80)
Rows Removed by Filter: 388
Planning time: 0.082 ms
Execution time: 0.187 ms
This query scans all of the line items to find a product with an id that is greater than 80, and then sums up all the values grouped by that product id.
Now we’ll add the index to this table:
CREATE INDEX items_product_id ON line_items(product_id)
We created a B-tree index, which contains only one column: `product_id`. After reading lots of articles about the benefits of using index, one can expect a query boost from such an operation. Sorry, bad news.
As we need to sum up the price column in the query above, we still need to scan the original table. Depending on the table statistics, Postgres will choose to scan the original table instead of the index. The thing is, index lacks a `price` column.
We can tweak this index by adding a price column as follows:
CREATE INDEX items_product_id_price ON line_items(product_id, price)
If we rerun the explain plan, we’ll see our index is the forth line:
GroupAggregate (cost=0.27..7.50 rows=71 width=12) (actual time=0.034..0.090 rows=20 loops=1)
Group Key: product_id
-> Index Only Scan using items_product_id_price on line_items (cost=0.27..6.23 rows=112 width=8) (actual time=0.024..0.049 rows=112 loops=1)
Index Cond: (product_id > 80)
Heap Fetches: 0
Planning time: 0.271 ms
Execution time: 0.136 ms
But how would putting the price column first affect the PostgreSQL query optimization?
Well, we figured out that a multicolumn index is used in the previous query because we included both columns. The interesting thing is that we can use another order for these columns while defining the index:
CREATE INDEX items_product_id_price_reversed ON line_items(price, product_id)
If we rerun explain analyze, we’ll see that `items_product_id_price_reversed` is not used. That’s because this index is sorted firstly on `price` and then on `product_id`. Using this index will lead to its full scan, which is nearly equivalent to scanning the table. That’s why Postgres opts to use scan for an original table.
Tip: Put in the first place columns, which you use in filters with the biggest number of unique values.
It’s time to figure out what the best set of indexes is for a specific join query, which also has some filter conditions. Usually, you can achieve optimal results by trial and error.
Tip: As in the case of simple filtering, choose the most restrictive filtering condition and add an index for it.
Let’s consider an example:
SELECT orders.product_id, SUM(line_items.price)
LEFT JOIN orders ON line_items.order_id = orders.id
WHERE line_items.created_at BETWEEN '2018-01-01' and '2018-01-02'
GROUP BY 1
Here we have join on `order_id` and filter on `created_at`. This way, we can create a multicolumn index that will contain `created_at` in the first place and `order_id` in the second:
CREATE INDEX line_items_created_at_order_id ON line_items(created_at, order_id)
We’ll get the following explain plan:
GroupAggregate (cost=16.62..16.64 rows=1 width=12) (actual time=0.026..0.026 rows=1 loops=1)Group Key: orders.product_id-> Sort (cost=16.62..16.62 rows=1 width=8) (actual time=0.023..0.023 rows=1 loops=1)Sort Key: orders.product_id
Sort Method: quicksort Memory: 25kB
-> Nested Loop Left Join (cost=0.56..16.61 rows=1 width=8) (actual time=0.014..0.016 rows=1 loops=1)
-> Index Scan using line_items_created_at_order_id on line_items (cost=0.27..8.29 rows=1 width=8) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: ((created_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2018-01-02 00:00:00'::timestamp without time zone)) -> Index Scan using orders_pkey on orders (cost=0.29..8.30 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=1)
Index Cond: (line_items.order_id = id)
Planning time: 0.269 ms
Execution time: 0.065 ms
As you can see, `line_items_created_at_order_id` is used to reduce scan by date condition. After that, it’s joined with orders using the `orders_pkey` index scan.
Tip: Date filters are usually one of the best candidates for the first column in a multicolumn index as it reduces scanning throughput in a predictable manner.
Our tips for PostgreSQL query optimization will help you to speed up queries 10-100x for multi-GB databases. They can solve most of your performance bottlenecks in an 80/20 manner. It doesn’t mean you shouldn’t double check your queries with `EXPLAIN` for real-world case scenarios.