PostgreSQL with modern storage: what about a lower random_page_cost?

Rmag Breaking News

The topic was widely discussed last week, in favor of a lower value for random_page_cost:

100x Faster Query in Aurora Postgres with a lower random_page_cost
when we migrated ~1TB DB from heroku -> AWS, everything broke
Use random_page_cost = 1.1 on modern servers

The random_page_cost parameter accounts for the latency incurred by random reads, particularly from indexes with poor correlation factors. Its default value is 4. On the other hand, the seq_page_cost parameter accounts for the lower latency incurred by contiguous page reads, where many pages are read without additional seek. Its default value is 1. With modern storage, should you set a lower value for random_page_cost as those messages suggest?

As I’m more concerned by the predictability of performance with stable execution plans rather than tuning magic global parameters, I’m in favor of not changing the default. Let’s explain why.

HDD seek latency

It is commonly believed that default values were set based on disk latency when mechanical HDDs were prevalent.

Let’s consider an example. If your storage has a latency (seek) of 10 milliseconds and a bandwidth (throughput) of 4KB/s, then reading an 8k block would take 10+(8/4)=12 milliseconds. If you read 10 consecutive pages, it would take 10+10*(8/4)=30 milliseconds. Thus, it would be equivalent to 3 milliseconds per page. We can conclude that reading random pages is 4 times more expensive than sequential reads (12/3=4).

This could be the reason why default values were chosen in this manner:

postgres=# dconfig *page_cost

List of configuration parameters
Parameter | Value
——————+——-
random_page_cost | 4
seq_page_cost | 1

The input/output (IO) specifications may appear low, but they are not too far from what 3600 RPM disks could achieve at that time. Oracle still employs these numbers, namely, 10ms IO seek time and 4kB/second IO transfer time when no system statistics are gathered, and this is the recommended practice. Oracle’s cost model accounts for 8 contiguous blocks in I/O calls, and then random reads are estimated to be 3.7 times more expensive. This estimate is not much different from the default setting of 4 employed by PostgreSQL.

I’m comparing with Oracle Database on purpose as this has been discussed a lot, attempting to make index access more desirable by lowering optimizer_index_cost_adj or gathering system statistics to calibrate the storage. Finally, there’s a consensus that it has brought more problems than solutions.

SSD random reads

When people are using SSDs with no mechanical latency, they tend to want to change the values since random reads are just as fast as sequential reads.

They often consider reducing the value of random_page_cost to be closer to the value of seq_page_cost. However, I believe that this is not a good practice for three reasons.

Firstly, modern storage systems are complex and have multiple levels of cache and network, which may increase or decrease the latency experienced by the database.
Secondly, it is better to have a stable and consistent model rather than having a different model for each system.
Lastly, a better outcome can be achieved by looking at the index design to get predictable performance.

Storage hierarchy

In modern systems, it is incorrect to assume that the pages that your query reads are equivalent to read requests to your HDD or SSD. In fact, the process is much more complex and involves several layers:

Firstly, the query reads pages from the PostgreSQL shared buffers.
If there are cache misses, it reads from Linux filesystem buffers.
If there are further cache misses, it prepares an I/O call.
The kernel may read more data when it detects sequential reads (readahead).
It sends an I/O call through the network (SAN, NAS, Cloud Storage).
It reads from the storage cache.
Finally, if there are still cache misses, it reads from the disk (HDD or SSD).

Is it reasonable to use the physical read latency factor of SSD or HDD at the lowest level of the chain to determine the random_page_cost applied to the estimated pages at the highest level?

There is no latency when accessing blocks stored in memory at different levels of cache, which means that the value of random_page_cost should be equal to seq_page_cost for them. In OLTP applications, a high cache hit ratio is expected, and this justifies a low value of random_page_cost, regardless of the disk latency. But there is more to consider.

PostgreSQL reads data in single blocks. Reading multiple contiguous blocks in one I/O call has a benefit due to Linux read-ahead. This advantage becomes more significant when there are delays in I/O, which historically was due to the rotation of HDDs. Back then, disks were directly attached to servers. But even if modern storage uses SSDs with lower latency, they are often behind network calls, and cloud instance limitations. As a result, the “random_page_cost” should be higher than “seq_page_cost”.

Index with good correlation factor

I have created a table with a strong correlation factor for the primary key ‘demo_key’, but a weak correlation factor for ‘demo_val’:

c
drop table if exists demo;
create table demo ( id bigserial primary key , val float default random() );
create index demo_val on demo (val asc);
insert into demo select generate_series(1,100000);
vacuum analyze demo;

The table contains one hundred thousand rows, which are spread across 541 pages:

postgres=# select reltuples, relpages
from pg_class where relname = ‘demo’
;
reltuples | relpages
———–+———-
100000 | 541

When selecting less than 50% of rows (50000 out of 100000), PostgreSQL uses an Index Scan:

postgres=# explain (analyze, buffers, summary off) select * from demo where id <= 50000;
QUERY PLAN
—————————————————————————————————————————-
Index Scan using demo_pkey on demo (cost=0.29..1697.91 rows=49864 width=16) (actual time=0.013..8.382 rows=50000 loops=1)
Index Cond: (id <= 50000)
Buffers: shared hit=409
Planning:
Buffers: shared hit=3

This has actually read 409 pages out of a 541 pages table. When reading more rows, it will be equivalent to read all the pages.

When I choose a search criteria that returns over 60% of the data in PostgreSQL, the query planer selects a Sequential Scan instead of using an index:

postgres=# explain (analyze, buffers, summary off) select * from demo where id < 60000;
QUERY PLAN
———————————————————————————————————-
Seq Scan on demo (cost=0.00..1791.00 rows=59968 width=16) (actual time=0.013..8.414 rows=59999 loops=1)
Filter: (id < 60000)
Rows Removed by Filter: 40001
Buffers: shared hit=541

What changes can be expected if the value of random_page_cost is similar to that of seq_page_cost?

postgres=# set random_page_cost=1.1;
SET

postgres=# explain (analyze, buffers, summary off)
select * from demo where id < 50000
;
QUERY PLAN
—————————————————————————————————————————-
Index Scan using demo_pkey on demo (cost=0.29..1299.66 rows=50021 width=12) (actual time=0.022..8.115 rows=49999 loops=1)
Index Cond: (id < 50000)
Buffers: shared hit=409

postgres=# explain (analyze, buffers, summary off)
select * from demo where id < 60000
;
QUERY PLAN
—————————————————————————————————————————-
Index Scan using demo_pkey on demo (cost=0.29..1554.88 rows=59879 width=12) (actual time=0.023..9.898 rows=59999 loops=1)
Index Cond: (id < 60000)
Buffers: shared hit=490

postgres=# explain (analyze, buffers, summary off)
select * from demo where id < 70000
;
QUERY PLAN
———————————————————————————————————-
Seq Scan on demo (cost=0.00..1791.00 rows=70015 width=12) (actual time=0.013..8.799 rows=69999 loops=1)
Filter: (id < 70000)
Rows Removed by Filter: 30001
Buffers: shared hit=541

By setting random_page_cost=1.1, the query planner chose an Index Scan over a Sequential Scan for a larger number of rows.

Does it matter? Probably not. I had to query thousands of rows, which is half of the table, to see a change in the plan. Based on those numbers, I think it would be more scalable to add columns to the index to achieve an Index Only Scan with fewer pages to read rather than switch to a full table scan on a large table that will grow. Alternatively, if this query reads a large portion of the table that isn’t expected to grow, I would prefer a sequential scan even for a smaller result, to get predictable performances.

Index with bad correlation factor

When there is poor correlation between the index entries and the physical rows in a heap table, PostgreSQL utilizes a Bitmap Scan to prevent reading the same page multiple times. In my example, the inflection point where a Sequential Scan becomes preferable is between 38% and 39%:

postgres=# set random_page_cost=4;
SET
postgres=# explain (analyze, buffers, summary off)
select * from demo where val < 38000
;
QUERY PLAN
—————————————————————————————————————————
Bitmap Heap Scan on demo (cost=742.71..1765.04 rows=38506 width=12) (actual time=1.856..6.096 rows=38090 loops=1)
Recheck Cond: (val < 38000)
Heap Blocks: exact=541
Buffers: shared hit=651
-> Bitmap Index Scan on demo_val (cost=0.00..733.09 rows=38506 width=0) (actual time=1.792..1.792 rows=38090 loops=1)
Index Cond: (val < 38000)
Buffers: shared hit=110

postgres=# explain (analyze, buffers, summary off)
select * from demo where val < 39000
;
QUERY PLAN
———————————————————————————————————-
Seq Scan on demo (cost=0.00..1791.00 rows=39502 width=12) (actual time=0.014..8.974 rows=39079 loops=1)
Filter: (val < 39000)
Rows Removed by Filter: 60921
Buffers: shared hit=541

If random_page_cost is set close to seq_page_cost, a distinct pattern emerges. A Bitmap Index Scan is used when reading less than 16% of rows, and a Seq Scan is used when reading more than 57% of rows:

postgres=# set random_page_cost=1.1;
SET

postgres=# explain (analyze, buffers, summary off)
select * from demo where val < 16000
;
QUERY PLAN
—————————————————————————————————————————
Bitmap Heap Scan on demo (cost=178.02..922.28 rows=16261 width=12) (actual time=0.736..3.692 rows=16091 loops=1)
Recheck Cond: (val < 16000)
Heap Blocks: exact=541
Buffers: shared hit=588
-> Bitmap Index Scan on demo_val (cost=0.00..173.95 rows=16261 width=0) (actual time=0.677..0.677 rows=16091 loops=1)
Index Cond: (val < 16000)
Buffers: shared hit=47

postgres=# explain (analyze, buffers, summary off)
select * from demo where val < 17000
;
QUERY PLAN
————————————————————————————————————————–
Index Scan using demo_val on demo (cost=0.29..953.18 rows=17302 width=12) (actual time=0.022..8.620 rows=17082 loops=1)
Index Cond: (val < 17000)
Buffers: shared hit=17103

postgres=# explain (analyze, buffers, summary off)
select * from demo where val < 57000
;
QUERY PLAN
———————————————————————————————————-
Seq Scan on demo (cost=0.00..1791.00 rows=57467 width=12) (actual time=0.012..9.650 rows=57286 loops=1)
Filter: (val < 57000)
Rows Removed by Filter: 42714
Buffers: shared hit=541

Between those two thresholds, an Index Scan was introduced, when the number of rows exceeded a certain threshold, making it too large to build a bitmap, but not large enough to do a sequential scan.

Is that better? It’s unlikely that having three execution plans instead of two will improve plan stability. In a test scenario where everything is in cache, the time taken doesn’t matter. However, the number of buffers read shows a problem: the Index Scan has read 30 times more blocks than a Seq Scan, and these blocks are random reads that cannot be optimized with read-ahead. If, for any reason, more memory is used on your system and the buffer reads become cache misses, the response time can increase by several orders of magnitude.

Find the inflection point

I didn’t try and guess to see the different plans. Here is the code that I used:

c
set random_page_cost=1.1;
create temporary table if not exists plans( val float, sql text, plan json );
do $$
declare
sql text;
plan json;
begin
for i in 1..100 loop
sql=format(‘explain (format json) select * from demo where val < %s’,i*1000);
execute sql into plan;
insert into plans (val, sql, plan) values (i, sql, plan);
end loop;
end;
$$;
with plans as (
select val, plan->0->‘Plan’->>‘Node Type’ scan, (plan->0->‘Plan’->>‘Total Cost’)::float cost, sql from plans
) select scan, min(val) min_val,max(val) max_val, min(cost) min_cost, max(cost) max_cost from plans group by scan order by min_cost ;

This had the settings and the query for the last case and here was the result:

scan | min_val | max_val | min_cost | max_cost
——————+———+———+———-+———-
Bitmap Heap Scan | 1 | 16 | 537.64 | 922.28
Index Scan | 17 | 56 | 953.18 | 1762.64
Seq Scan | 57 | 100 | 1791 | 1791
(3 rows)

It can be useful to test your query for different selectivity and see when it can go wrong.

What about your data model and indexing strategy?

This test doesn’t clearly indicate the most reasonable default for random_page_cost. Therefore, I suggest sticking with the default value because it offers a significant advantage: it’s what most people use. By doing so, you can avoid any unexpected surprises that may arise from a custom configuration.

Based on the test results, it appears that the index on the val column is not well-suited for queries that read a large number of rows. When multiple rows are read, all pages from the table are read, which is not expected to remain in cache for larger tables. With the non-default random_page_cost setting, the situation becomes even worse, as it can switch to a plan that reads 30 times more pages than necessary.

There is an inflection point where the query planner has to choose between using an index or scanning the entire table. This decision may change when a single new row is added to a large table. If the table is near this point, the query’s performance will depend on several factors, such as available memory for cache, IOPS, and bandwidth limits. These factors cannot be accurately predicted because they depend on the total workload. You can modify this point using random_page_cost, but it’s best to avoid this unpredictable zone altogether.

I recommend to spend more time on looking at execution plans and less time on global parameter tuning.

Using high random_page_cost as a red flag for bad indexes

If you notice that changing the random_page_cost parameter results in different response times, it could be an indication that some of your access patterns do not have optimal indexes. In this case, the query planner is forced to come up with the “least bad” solution. Instead of decreasing the parameter to get more indexes used, you should focus on making those indexes more appealing to the query planner. When your index is used with a high value of random_page_cost and provides the expected performance, you can be confident that you will avoid unpleasant surprises when your users query slightly more rows.

If you plan on altering the random_page_cost to a lower value, it is crucial to understand why it would enhance the performance rather than basing it on just one test. With the growth of your data and increased workload, the index access that may provide a quick response time now may take much longer if the memory available for cache reduces. Thus, it is essential to evaluate the performance impact of any such change and ensure that it remains an improvement over time, rather than a short-term fix.

I believe that the decision between using an Index Scan and a Sequential Scan depends on scalability rather than response time. The response time of an Index Scan is proportional to the size of the result set, while the response time of a Sequential Scan is proportional to the size of the table being scanned. It is important to understand which one will grow to make an informed decision and provide indexes that will be chosen by the query planner without depending on dubious a global setting.

Leave a Reply

Your email address will not be published. Required fields are marked *