postgres WAL - ghdrako/doc_snipets GitHub Wiki

obraz When a session connects and spawns a transaction and the transaction modifies data, a wall record is created and stored in the wall buffer when the transaction issues a commit.Postgres ensures that the wall records generated by the transaction are flushed from the wall buffer to the wall segments on the disk. LSN (log sequence number) points to a specific position in a particular log segment.LSN helps track the order of events and ensures consistency. It is a unique identifier assigned to each transaction log entry, allowing the database to keep track of the sequence of changes made to the database. By default, the wall segment is 16 MB in size.Each segment contains a sequence of wall records. When one segment is full, a new segment is created. Think of wall segment as a notebook. An LSN can be thought of as a page number within the notebook, which is used to keep a track of the sequence of changes made to the database. The wall file segment name ends with 075, and the current lsn is a pointer to a location within this segment.

Find Current LSN

select pg_current_wal_lsn();

Find segment

select pg_walfile_name(pg_current_wal_lsn());

obraz

LSN format is of x, y. The x represents the log file sequence number, and y is the offset of the transaction within the lsn file.The wall segment name can be divided into three sections. The first section represents the timeline ID. The second section is the log file sequence number.In this case it is one and it matches with the Lsn. The third segment is the physical segment. In this case it is 075, which we saw was the last wall segment generated.To further drill down to correlate between both lsn and wall segments, if we convert Y portion of the list to a decimal number, and then we divide that number to the 16 MB in bytes, it gives us 075 in hexadecimal.

WAL files are saved in $PGDATA/pg_wal by default

Example - log file 000000010000000000000001 is archived to the $PGDATA/archive directory

$ls $PGDATA/pg_wal
![obraz](https://github.com/user-attachments/assets/2764b533-cf0e-46d4-8d68-ebc26ab5ec12)

000000010000000000000001  000000010000000000000002   
archive_status
$ls $PGDATA/pg_wal/archive_status
000000010000000000000001.done
$ls $PGDATA/archive
000000010000000000000001

PostgreSQL stores transaction log information in the pg_wal (in PostgreSQL 9.6 and earlier: pg_xlog) directory. The pg_wal directory contains a sequence of log files, each of which is called a WAL segment. The transaction log records the changes to the database in chronological order, including information about each transaction such as the type of transaction, the date and time of the transaction, the user who initiated the transaction, and the data that was changed.

In PostgreSQL 15, the WAL system uses a segmented architecture for improved performance and reliability. Each WAL segment is a fixed-size file on disk (default is 16 MB), and the WAL files are written sequentially in a circular buffer.

WAL configuration

SHOW wal_level;   
SHOW max_wal_size;   
SHOW min_wal_size;

Checkpoint configuration

SHOW checkpoint_timeout;  
SHOW checkpoint_warning;   
SHOW checkpoint_completion_target;  
SHOW checkpoint_flush_after;

WAL monitoring

Monitor database performance metrics such as transaction throughput, checkpoint frequency, and disk I/O.

select * from pg_settings where name like '%wal%';
select * from pg_ls_waldir();
select * FROM pg_stat_archiver;

Use tools like pg_stat_bgwriter__ and pg_stat_progress_vacuum`` to analyze the checkpoint activity and vacuuming process.

WAL size

-- Postgres 10+


select pg_size_pretty(sum(size)) from pg_ls_waldir();

-- as superuser
select * from pg_ls_dir('pg_xlog');

select count(*) * pg_size_bytes(current_setting('wal_segment_size')) as total_size
from pg_ls_dir('pg_xlog') as t(fname)
where fname <> 'archive_status';

--Alternatively you can use pg_stat_file() to return information about the files:

select sum((pg_stat_file('pg_wal/'||fname)).size) as total_size
from pg_ls_dir('pg_xlog') as t(fname);

--Pg 10 also provides for pg_current_wal_flush_lsn(), pg_current_wal_insert_lsn(), pg_current_wal_lsn() which you can feed to pg_walfile_name()

WAL archive - buckup - parameters

archive_mode defaul off Wlaczenie

wal_level      = replica        # must be replica or higher
archive_mode   = on

archive_command - A shell command (or script) that PostgreSQL runs each time a WAL segment is filled and ready.
- In the command string:
  - %p is replaced with the full path of the WAL file to archive.
  - %f is replaced with just the filename.
  - %% yields a literal %.
- The command must return exit code 0 on success.
- Example (copy to an “archivedir”):

archive_command = 'test ! -f /var/lib/pgsql/wal_archive/%f && \
                   cp %p /var/lib/pgsql/wal_archive/%f'

archive_timeout - Forces PostgreSQL to switch to a new WAL file and invoke archive_command after the specified number of seconds, even if the current one isn’t full.
- Useful for low-traffic systems so you get regular archives.
- Value is in seconds; 0 (the default) means “only archive when full.”
- Example:

archive_timeout = 300   # at least every 5 minutes

restore_command (required for PITR/archive-recovery):
- Runs every time the server needs a WAL segment during recovery or a standby needs it in streaming replication. %f is the WAL filename to fetch; %p is the path where PostgreSQL wants to place it.

restore_command = 'cp /var/lib/pgsql/wal_archive/%f %p'

recovery_end_command (optional):
- Runs once at the end of recovery (e.g. right before promoting a standby). %r is the last restart-point WAL filename.

recovery_end_command = '/usr/local/bin/cleanup_after_recovery %r'

WAL

WAL entries are often smaller than the page size. Actions are recorded in WAL:

page modifications performed in the buffer cache—since writes are deferred
transaction commits and rollbacks—since the status change happens in �� buffers and does not make it to disk right away
file operations (like creation and deletion of files and directories when tables get added or removed)—since such operations must be in sync with data changes

WAL Structure

WAL is a stream of log entries of variable length. Each entry contains

header
data

WAL files take up special buffers in the server’s shared memory. The size of the cache used by WAL is defined by the wal_buffers parameter. By default, this size is chosen automatically as 1/ 32 of the total buffer cache size. WAL cache is quite similar to buffer cache, but it usually operates in the ring buf f er mode: new entries are added to its head, while older entries are saved to disk start-ing at the tail. If WAL cache is too small, disk synchronization will be performed more often than necessary.

Why WAL growing

https://www.depesz.com/2011/07/14/write-ahead-log-understanding-postgresql-conf-checkpoint_segments-checkpoint_timeout-checkpoint_warning/ normally, PostgreSQL will keep this directory to be between min_wal_size and max_wal_size.

Additionally, Pg checks value of wal_keep_size. This can be set to arbitrarily large value, but whatever it will be set to, it will be static, so while it can explain why wal dir is large, it can't explain why it's growing.

Generally there are three cases that cause it:

archiving is failing

SELECT pg_walfile_name( pg_current_wal_lsn() ), * FROM pg_stat_archiver \gx

Most important pg_walfile_name, which is name of current wal file, and last_archived_wal. Example 1

pg_walfile_name 00000001000123D100000076
last_archived_wal 00000001000123D10000006E

In this case we have 8 wal segments of difference, and since you can't archive current wal (because it's being written to), it looks that we have archive lag of 7.Each file is 16MB, so my archiving lag on this system is/was 112MB.

Example 2

pg_walfile_name 000000020000C6AE00000000
last_archived_wal 000000020000C6AD000000FF

do to hex math, you first change “000000020000C6AE00000000" to “000000020000C6AE00" and “000000020000C6AD000000FF" to “000000020000C6ADFF", and then subtract them from each other.

Check logs of PostgreSQL, or the tool you use for archiving.

archiving is slow pgBackRest
replication slots

SELECT pg_current_wal_lsn(), pg_current_wal_lsn() - restart_lsn, * FROM pg_replication_slots \gx
pg_current_wal_lsn  │ A9C/97446FA8
slot_name           │ my_slot
restart_lsn         │ A9C/97091D60

In here we can see situation where wal is at location A9C/97446FA8 (which means it's in wal file XXXXXXXX00000A9C00000097), but my_slot requires (column restart_lsn) wal since location A9C/97091D60 (which is still in the same WAL file as current one. Sa its ok.

let's consider case where pg_current_wal_lsn is 1A4A5/602C8718 and restart_lsn is 1A4A4/DB60E3B0. In this case replication lag on this slot is 2227938152 bytes (a bit over 2GB). This means that even when archiving is working 100% OK, and min_wal_size, max_wal_size, and wal_keep_size will all be below 200MB, we will still have 2GB of wal files.

If such case happens for you then find out what this slot is being used for, and either fix the replication, so WAL will start being consumed, or just drop the replication slot (using pg_drop_replication_slot() function).

Config

=> ALTER SYSTEM SET full_page_writes = on;
=> ALTER SYSTEM SET wal_compression = on;
=> SELECT pg_reload_conf();

select sum(size) from pg_ls_waldir();
select pg_size_pretty(sum(size)) from pg_ls_waldir()
select * from pg_current_wal_flush_lsn();
select * from pg_current_wal_insert_lsn();
SELECT pg_size_pretty('0/3BE87658'::pg_lsn - '0/3A69C530'::pg_lsn);  # show size in MB
select * from pg_walfile_name() ???
select * from pg_walfile_name(pg_current_wal_lsn()); --run and the wal file that PostgreSQL is currently using

WAL increase

WAL is considered unneeded and to be removed after a Checkpoint is made. This is why we check it first. Postgres has a special system view called pg_stat_bgwriter that has info on checkpoints:

checkpoints_timed — is a counter of checkpoints triggered due that the time elapsed from the previous checkpoint is more than pg setting checkpoint_timeout. These are so called scheduled checkpoints.
checkpoints_req — is a counter of checkpoints ran due to uncheckpointed WAL size grew to more than max_wal_size setting — requested checkpoints.

WAL Bloats #1 - Long lasting transaction
WAL Bloats #2 — Archiving

https://www.postgresql.org/docs/current/continuous-archiving.html

WAL Bloats #3 — Replication Slots

pg_replication_slots

Write Ahead Log (WAL) (also called redo logs or transaction logs in other products)

As the DB is operating, blocks of data are first written serially and synchronously as WAL files, then some time later, usually a very short time later, written to the DB data files. Once the data contained in these WAL files has been flushed out to their final destination in the data files, they are no longer needed by the primary.

When Postgres needs to make a change to data, it first records this change in the Write-Ahead-Log (WAL) and fsyncs the WAL to the disk. Fsync is a system call that guarantees that the data that was written has actually been written fully to persistent storage, not just stored in the filesystem cache.

ALTER SYSTEM SET fsync = 'on'; # Enable wal_sync_method setting to fsync
ALTER SYSTEM SET wal_sync_method = 'fsync'; # Options may include fdatasync, open_datasync, fsync, fsync_writethrough, open_sync

By default fsync is enable.

Disabling fsync boosts insertion speed, but at the expense of data durability. Re-enabling fsync with the open_sync option strikes a balance, providing a compromise between performance and data safety. It is crucial to consider these results when configuring fsync based on the specific requirements and priorities of your application.

At some future point, the database will issue a CHECKPOINT, which flushes all of the modified buffers to disk and stores the changes permanently in the data directory. When a CHECKPOINT has been completed, the database records the WAL position at which this occurs and knows that all of this data up to this point has been successfully written to disk. (It is worth noting that any changes that are written to disk via the CHECKPOINT already had their contents recorded previously in the WAL stream, so this is just a way of ensuring that the changes recorded by the WAL data are in fact applied to the actual on-disk relation files.)

WAL is stored in on-disk files in 16MB chunks; as the database generates WAL files, it runs the archive_command on them to handle storing these outside of the database. The archive_command’s exit code will determine whether Postgres considers this to have succeeded or not. Once the archive_command completes successfully, Postgres considers the WAL segment to have been successfully handled and can remove it or recycle it to conserve space. If the archive_command fails, Postgres keeps the WAL file around and tries again indefinitely until this command succeeds. Even if a particular segment file is failing, Postgres will continue to accumulate WAL files for all additional database changes and continue retrying to archive the previous file in the background.

recommend keeping secondary copies of the WAL files in another location using a WAL archiving mechanism. This is known as WAL archiving and is done by ensuring archive_mode is turned on and a value has been set for the archive_command.

WAL archiving and backups are typically used together since this then provides point-in-time recovery (PITR) where you can restore a backup to any specific point in time as long as you have the full WAL stream available between all backups.

Streaming Replication Slots

A replication slot is a persistent communication channel between a primary and a standby server. It is used to keep track of the WAL segments that the standby server has received and applied, and to ensure that the primary server does not remove any WAL segments that are still needed by the standby server.

Streaming replication slots are a feature available since PostgreSQL 9.4. Using replication slots will cause the primary to retain WAL files until the primary has been notified that the replica has received the WAL file.

If the replica is unavailable for any reason, WAL files will accumulate on the primary until it can deliver them. On a busy database, with replicas unavailable, disk space can be consumed very quickly as unreplicated WAL files accumulate. This can also happen on a busy DB with relatively slow infrastructure (network and storage).

Log Sequence Number

Transactions in PostgreSQL create WAL records which are ultimately appended to the WAL log (file). The position where the insert occurs is known as the Log Sequence Number (LSN). The values of LSN (of type pg_lsn) can be compared to determine the amount of WAL generated between two different offsets (in bytes). When using in this manner, it is important to know the calculation assumes the full WAL segment was used (16MB) if multiple WAL logs are used. A similar calculation to the one used here is often used to determine latency of a replica.

The LSN is a 64-bit integer, representing a position in the write-ahead log stream. This 64-bit integer is split into two segments (high 32 bits and low 32 bits). It is printed as two hexadecimal numbers separated by a slash (XXXXXXXX/YYZZZZZZ). The 'X' represents the high 32-bits of the LSN and ‘Y’ is the high 8 bits of the lower 32-bits section. The 'Z' represents the offset position in the file. Each element is a hexadecimal number. The 'X' and 'Y' values are used in the second part of the WAL file on a default PostgreSQL deployment.

# Verify the current LSN from Primary server  
SELECT pg_current_wal_lsn();
# Verify the LAST Receive and Replay LSN from Delay Standby server 
 select now(), 
        pg_is_in_recovery(),
        pg_is_wal_replay_paused(), 
        pg_last_wal_receive_lsn(), 
        pg_last_wal_replay_lsn();

pg_wal_lsn_diff() function indicates the replication lag in bytes between the primary server and the standby server.

WAL File

The WAL file name is in the format TTTTTTTTXXXXXXXXYYYYYYYY. Here 'T' is the timeline, 'X' is the high 32-bits from the LSN, and 'Y' is the low 32-bits of the LSN.

SELECT pg_switch_wal()

WAL Structure

pg_waldump /var/lib/postgresql/data/pg_wal/000000010000000000000001

...
rmgr: Standby     len (rec/tot):     42/    42, tx:       1738, lsn: 0/01938698, prev 0/01938668, desc: LOCK xid 1738 db 5 rel 18065 
rmgr: Heap        len (rec/tot):    203/   203, tx:       1738, lsn: 0/019386C8, prev 0/01938698, desc: INSERT off: 12, flags: 0x00, blkref #0: rel 1663/5/1259 blk 25
rmgr: Btree       len (rec/tot):     64/    64, tx:       1738, lsn: 0/01938798, prev 0/019386C8, desc: INSERT_LEAF off: 394, blkref #0: rel 1663/5/2662 blk 4
rmgr: Btree       len (rec/tot):    112/   112, tx:       1738, lsn: 0/019387D8, prev 0/01938798, desc: INSERT_LEAF off: 110, blkref #0: rel 1663/5/2663 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:       1738, lsn: 0/01938848, prev 0/019387D8, desc: INSERT_LEAF off: 229, blkref #0: rel 1663/5/3455 blk 4
rmgr: Heap2       len (rec/tot):    176/   176, tx:       1738, lsn: 0/01938888, prev 0/01938848, desc: MULTI_INSERT ntuples: 1, flags: 0x02, offsets: [36], blkref #0: rel 1663/5/1249 blk 113
rmgr: Btree       len (rec/tot):     80/    80, tx:       1738, lsn: 0/01938938, prev 0/01938888, desc: INSERT_LEAF off: 61, blkref #0: rel 1663/5/2658 blk 28
rmgr: Btree       len (rec/tot):     64/    64, tx:       1738, lsn: 0/01938988, prev 0/01938938, desc: INSERT_LEAF off: 319, blkref #0: rel 1663/5/2659 blk 18
rmgr: Heap        len (rec/tot):    197/   197, tx:       1738, lsn: 0/019389C8, prev 0/01938988, desc: INSERT off: 23, flags: 0x00, blkref #0: rel 1663/5/2610 blk 11
rmgr: Btree       len (rec/tot):     64/    64, tx:       1738, lsn: 0/01938A90, prev 0/019389C8, desc: INSERT_LEAF off: 200, blkref #0: rel 1663/5/2678 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:       1738, lsn: 0/01938AD0, prev 0/01938A90, desc: INSERT_LEAF off: 180, blkref #0: rel 1663/5/2679 blk 2
rmgr: Heap2       len (rec/tot):     85/    85, tx:       1738, lsn: 0/01938B10, prev 0/01938AD0, desc: MULTI_INSERT ntuples: 1, flags: 0x02, offsets: [132], blkref #0: rel 1663/5/2608 blk 26
rmgr: Btree       len (rec/tot):     72/    72, tx:       1738, lsn: 0/01938B68, prev 0/01938B10, desc: INSERT_LEAF off: 233, blkref #0: rel 1663/5/2673 blk 20
rmgr: Btree       len (rec/tot):     72/    72, tx:       1738, lsn: 0/01938BB0, prev 0/01938B68, desc: INSERT_LEAF off: 43, blkref #0: rel 1663/5/2674 blk 12
rmgr: XLOG        len (rec/tot):     49/   209, tx:       1738, lsn: 0/01938BF8, prev 0/01938BB0, desc: FPI , blkref #0: rel 1663/5/18065 blk 1 FPW
rmgr: XLOG        len (rec/tot):     49/   137, tx:       1738, lsn: 0/01938CD0, prev 0/01938BF8, desc: FPI , blkref #0: rel 1663/5/18065 blk 0 FPW
rmgr: Heap        len (rec/tot):    188/   188, tx:       1738, lsn: 0/01938D60, prev 0/01938CD0, desc: INPLACE off: 12, blkref #0: rel 1663/5/1259 blk 25
rmgr: Transaction len (rec/tot):    242/   242, tx:       1738, lsn: 0/01938E20, prev 0/01938D60, desc: COMMIT 2024-10-02 21:47:28.453226 UTC; inval msgs: catcache 55 catcache 54 catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 relcache 18065 relcache 17640 snapshot 2608 relcache 17640 relcache 18065
rmgr: Storage     len (rec/tot):     42/    42, tx:          0, lsn: 0/01938F18, prev 0/01938E20, desc: CREATE base/5/18066
rmgr: Standby     len (rec/tot):     42/    42, tx:       1739, lsn: 0/01938F48, prev 0/01938F18, desc: LOCK xid 1739 db 5 rel 18066 
rmgr: Heap        len (rec/tot):    203/   203, tx:       1739, lsn: 0/01938F78, prev 0/01938F48, desc: INSERT off: 13, flags: 0x00, blkref #0: rel 1663/5/1259 blk 25
...
- Structure: Each line represents a WAL record, containing information about database operations.

- Components of each record:

    - rmgr: Resource manager (e.g., Heap, Btree, Transaction)
    - len: Length of the record
    - tx: Transaction ID
    - lsn: Log Sequence Number
    - prev: Previous LSN
    - desc: Description of the operation

- Types of operations visible:

    - INSERT operations (Heap and Btree)
    - MULTI_INSERT operations (Heap2)
    - COMMIT transactions
    - File operations (CREATE)
    - Full Page Writes (FPW)

- Specific examples:

    - Table inserts: `rmgr: Heap len (rec/tot): 203/203, tx: 1738, lsn: 0/019386C8, prev 0/01938698, desc: INSERT off: 12, flags: 0x00, blkref #0: rel 1663/5/1259 blk 25`
    - Index updates: `rmgr: Btree len (rec/tot): 64/ 64, tx: 1738, lsn: 0/01938798, prev 0/019386C8, desc: INSERT_LEAF off: 394, blkref #0: rel 1663/5/2662 blk 4`
    - Transaction commit: `rmgr: Transaction len (rec/tot): 242/ 242, tx: 1738, lsn: 0/01938E20, prev 0/01938D60, desc: COMMIT 2024-10-02 21:47:28.453226 UTC;`

This WAL output provides a detailed view of the database operations, allowing for analysis of transaction flow, data modifications, and system activities. It's particularly useful for understanding database behavior, troubleshooting, and in some cases, for point-in-time recovery.

pg_resetwal

pg_resetwal command that can be used to remove WAL files, for example, in case they become corrupted.

pg_controldata

pg_controldata can be used to check control information about WAL and timelines in a PostgreSQL database cluster.