postgres buffer cache - ghdrako/doc_snipets GitHub Wiki

Buffer cache is located in the server’s shared memory and is accessible to all the processes. It takes the major part of the shared memory and is surely one of the most important and complex data structures in PostgreSQL.It consists of an array of buffers. Each buffer reserves a memory chuck that can accommodate a single data page together with its header.

A header contains some information about the buf f er and the page in it, such as:

  • physical location of the page (file ID, fork, and block number in the fork)
  • the attribute showing that the data in the page has been modified and sooner or later has to be written back to disk (such a page is called dirty)
  • buffer usage count
  • pin count (or reference count)

To get access to a relation’s data page, a process requests it from the buf f er man-ager1and receives the ID of the buffer that contains this page. Then it reads the cached data and modif i es it right in the cache if needed. While the page is in use, its buffer is pinned. Pins forbid eviction of the cached page and can be applied together with other locks. Each pin increments the usage count as well.

As long as the page is cached, its usage does not incur any file operations.

dirty page: it has been modif i ed, but is not written to disk yet.

Cache Hits

When the buffer manager has to read a page,it first checks the buffer cache. All buffer IDs are stored in a hash table,which is used to speed up their search.

Buffer cache uses the extendible table that resolves hash collisions by chaining.

A hash key consists of the ID of the relation file, the type of the fork, and the ID of the page within this fork’s file. Thus, knowing the page, PostgreSQL can quickly find the buffer containing this page or make sure that the page is not currently cached.

If the hash table contains the required buffer ID, the buffer manager pins this buffer and returns its ID to the process. Then this process can start using the cached page without incurring any I/O traffic. To pin a buffer, PostgreSQL has to increment the pin counter in its header; a buffer can be pinned by several processes at a time. While its pin counter is greater than zero, the buffer is assumed to be in use, and no radical changes in its contents are allowed.

The pg_statio_all_tables view contains the complete statistics on buffer cache usage by tables:

=> SELECT heap_blks_read, heap_blks_hit FROM pg_statio_all_tables

PostgreSQL provides similar views for indexes and sequences. They can also display statistics on I/O operations, but only if offtrack_io_timing is enabled.

The usage count is incremented each time the buffer is accessed (that is, pinned), and reduced when the buf fer manager is searching for pages to evict. The first unpinned buffer with the zero count found by the clock hand will be cleared. As a result, the least recently used pages are evicted first, while those that have been accessed more often will remain in the cache longer.

If all the buf f ers have a non-zero usage count, the clock hand has to complete more than one full circle before any of them f i nally reaches the zero value. To avoid running multiple laps, PostgreSQL limits the usage count by 5. Once the buffer to evict is found, the reference to the page that is still in this buffer must be removed from the hash table. But if this buffer is dirty, that is, it contains some modified data, the old page cannot be simply thrown away—the buffer manager has to write it to disk first.

Explain plan shows Hits(read from cache) and Reads(reads from disk). Jesli wspolczynnik jest niski za pierwszym razem a gdy powtarzamy zapytanie wysoki to nalezy rozwazyc zwiekszenie buffer cache. Wymaga to restartu instancji.

shared_buffers

The shared_buffers parameter controls the size of the shared memory, accessible to all backend processes. If any backend needs to retrieve a page, it first checks the shared buffers. If the page is present - it's a hit. No disk I/O is needed. Miss? Read it from disk (or OS cache) and store it in shared buffers for next time.

However, there's more to the buffer pool than just the pages themselves. PostgreSQL needs to track metadata and provide fast lookups, so the shared memory area is organized into three components:

  • buffer blocks - the actual 8KB pages where the data lives
  • buffer descriptors - a parallel array of ~64-byte structures, one per slot.
  • hash table used for mapping page identifiers to individual buffer slots.

Each descriptor then tracks which page is cached in the slot (tag), flags about the state (dirty, valid and I/O in-progress), and pin/usage counters. O(1) = constant time, regardless of buffer pool size. The hash table enables fast lookups. When a backend needs a specific page, it hashes the page identifier and jumps directly to the right bucket—no need to scan all 16,384 slots. This keeps buffer lookups at O(1) regardless of pool size.

When a backend needs page N of table orders, it hashes the identifier, looks up the hash table - which drives the hit/miss logic.

ring buffers

mall, private buffer pools for bulk operations. Instead of using the shared buffer pool, certain operations get their own limited ring.

The individual cases are:

Sequential scans on large tables over 1/4 of shared_buffers use a dedicated 256 KB ring buffer. Pages cycle through this tiny ring and never touch the main cache.

-- perform a sequential scan over a large table
EXPLAIN (ANALYZE, BUFFERS) SELECT count(1) FROM ring_buffer_test;

Which would show you Buffers: shared read=127285 instead of hit.

Bulk writes (COPY, CREATE TABLE AS) use a 16MB capped ring buffer, large enough for efficient batching, yet small enough not to pollute the shared buffer pool.

As VACUUM touches every page and shouldn't evict the hot data, it uses its own dedicated ring buffer. Historically it was set to 256KB, but since PostgreSQL 17 it can be configured using vacuum_buffer_usage_limit (while previous two are given).

Local buffers

The second kind of exception to shared buffer pool, is the session based temporary tables. In this case as the concurrency is out of question, each backend has its own local buffer pool, controlled by temp_buffers (8MB default).

Local buffers are faster than shared buffers because they have simpler locking. There is no need for the heavy cross-process coordination required in the main cache.

Many developers tend to default to complex CTE logic for intermediate data, but using temporary tables offers a distinct advantage in the form of lower I/O as the changes to temporary tables are not WAL logged and by their nature reduce pollution of shared buffer pools.

If your workload involves large temporary tables, increasing temp_buffers can help keep those operations purely in RAM. However, remember that this memory is per-connection, so it multiplies across all backends.

OS Page Cache

he OS also provides read-ahead - detecting sequential access patterns and pre-loading pages before PostgreSQL requests them.

This relationship explains the classic advice: set shared_buffers to 25% of RAM. You're deliberately leaving space for the OS to act as a safety net.

-- on dedicated servers with large RAM, 40% can work well
-- but always leave room for OS cache and other processes
ALTER SYSTEM SET shared_buffers = '8GB';

PostgreSQL needs to know about this combined cache for query planning. That's where effective_cache_size comes in - it tells the planner how much total cache (shared buffers + OS) to assume when estimating costs. This parameter allocates no memory - it's purely a hint for cost estimation.

-- estimate of total cache available (shared + OS)
SHOW effective_cache_size;

A higher value encourages the planner to favor index scans, assuming data is likely cached somewhere even if not in shared buffers.