postgres WAL - ghdrako/doc_snipets GitHub Wiki
- https://github.com/ghdrako/doc_snipets/wiki/postgres-vacume
- https://www.depesz.com/2011/07/14/write-ahead-log-understanding-postgresql-conf-checkpoint_segments-checkpoint_timeout-checkpoint_warning/
- https://pgpedia.info/l/LSN-log-sequence-number.html
WAL files are saved in "$PGDATA/pg_wal" by default
PostgreSQL stores transaction log information in the pg_wal
(in PostgreSQL 9.6 and earlier: pg_xlog
) directory. The pg_wal
directory contains a sequence of log files, each of which is called a WAL segment. The transaction log records the changes to the database in chronological order, including information about each transaction such as the type of transaction, the date and time of the transaction, the user who initiated the transaction, and the data that was changed.
In PostgreSQL 15, the WAL system uses a segmented architecture for improved performance and reliability. Each WAL segment is a fixed-size file on disk (default is 16 MB), and the WAL files are written sequentially in a circular buffer.
WAL configuration
SHOW wal_level;
SHOW max_wal_size;
SHOW min_wal_size;
Checkpoint configuration
SHOW checkpoint_timeout;
SHOW checkpoint_warning;
SHOW checkpoint_completion_target;
SHOW checkpoint_flush_after;
WAL monitoring
Monitor database performance metrics such as transaction throughput, checkpoint frequency, and disk I/O.
- https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-BACKUP
- script https://bucardo.org/check_postgres/
- pgmetrics
- pgDash
- views
select * from pg_settings where name like '%wal%';
select * from pg_ls_waldir();
select * FROM pg_stat_archiver;
Use tools like pg_stat_bgwriter__ and
pg_stat_progress_vacuum`` to analyze the checkpoint activity and vacuuming process.
WAL size
-- Postgres 10+
select pg_size_pretty(sum(size)) from pg_ls_waldir();
-- as superuser
select * from pg_ls_dir('pg_xlog');
select count(*) * pg_size_bytes(current_setting('wal_segment_size')) as total_size
from pg_ls_dir('pg_xlog') as t(fname)
where fname <> 'archive_status';
--Alternatively you can use pg_stat_file() to return information about the files:
select sum((pg_stat_file('pg_wal/'||fname)).size) as total_size
from pg_ls_dir('pg_xlog') as t(fname);
--Pg 10 also provides for pg_current_wal_flush_lsn(), pg_current_wal_insert_lsn(), pg_current_wal_lsn() which you can feed to pg_walfile_name()
WAL
WAL entries are often smaller than the page size. Actions are recorded in WAL:
- page modifications performed in the buffer cache—since writes are deferred
- transaction commits and rollbacks—since the status change happens in ���� buffers and does not make it to disk right away
- file operations (like creation and deletion of files and directories when tables get added or removed)—since such operations must be in sync with data changes
WAL Structure
WAL is a stream of log entries of variable length. Each entry contains
- header
- data
WAL files take up special buffers in the server’s shared memory. The size of the cache used by WAL is defined by the wal_buffers
parameter. By default, this size is chosen automatically as 1/
32 of the total buffer cache size.
WAL cache is quite similar to buffer cache, but it usually operates in the ring buf f er mode: new entries are added to its head, while older entries are saved to disk start-ing at the tail. If WAL cache is too small, disk synchronization will be performed more often than necessary.
Why WAL growing
- https://www.depesz.com/2011/07/14/write-ahead-log-understanding-postgresql-conf-checkpoint_segments-checkpoint_timeout-checkpoint_warning/ normally, PostgreSQL will keep this directory to be between min_wal_size and max_wal_size.
Additionally, Pg checks value of wal_keep_size. This can be set to arbitrarily large value, but whatever it will be set to, it will be static, so while it can explain why wal dir is large, it can't explain why it's growing.
Generally there are three cases that cause it:
- archiving is failing
SELECT pg_walfile_name( pg_current_wal_lsn() ), * FROM pg_stat_archiver \gx
Most important pg_walfile_name
, which is name of current wal file, and last_archived_wal
.
Example 1
pg_walfile_name 00000001000123D100000076
last_archived_wal 00000001000123D10000006E
In this case we have 8 wal segments of difference, and since you can't archive current wal (because it's being written to), it looks that we have archive lag of 7.Each file is 16MB, so my archiving lag on this system is/was 112MB.
Example 2
pg_walfile_name 000000020000C6AE00000000
last_archived_wal 000000020000C6AD000000FF
do to hex math, you first change “000000020000C6AE00000000" to “000000020000C6AE00" and “000000020000C6AD000000FF" to “000000020000C6ADFF", and then subtract them from each other.
Check logs of PostgreSQL, or the tool you use for archiving.
- archiving is slow pgBackRest
- replication slots
SELECT pg_current_wal_lsn(), pg_current_wal_lsn() - restart_lsn, * FROM pg_replication_slots \gx
pg_current_wal_lsn │ A9C/97446FA8
slot_name │ my_slot
restart_lsn │ A9C/97091D60
In here we can see situation where wal is at location A9C/97446FA8 (which means it's in wal file XXXXXXXX00000A9C00000097), but my_slot requires (column restart_lsn) wal since location A9C/97091D60 (which is still in the same WAL file as current one. Sa its ok.
let's consider case where pg_current_wal_lsn is 1A4A5/602C8718 and restart_lsn is 1A4A4/DB60E3B0. In this case replication lag on this slot is 2227938152 bytes (a bit over 2GB). This means that even when archiving is working 100% OK, and min_wal_size, max_wal_size, and wal_keep_size will all be below 200MB, we will still have 2GB of wal files.
If such case happens for you then find out what this slot is being used for, and either fix the replication, so WAL will start being consumed, or just drop the replication slot (using pg_drop_replication_slot() function).
Config
=> ALTER SYSTEM SET full_page_writes = on;
=> ALTER SYSTEM SET wal_compression = on;
=> SELECT pg_reload_conf();
select sum(size) from pg_ls_waldir();
select pg_size_pretty(sum(size)) from pg_ls_waldir()
select * from pg_current_wal_flush_lsn();
select * from pg_current_wal_insert_lsn();
SELECT pg_size_pretty('0/3BE87658'::pg_lsn - '0/3A69C530'::pg_lsn); # show size in MB
select * from pg_walfile_name() ???
select * from pg_walfile_name(pg_current_wal_lsn()); --run and the wal file that PostgreSQL is currently using
WAL increase
WAL is considered unneeded and to be removed after a Checkpoint is made. This is why we check it first. Postgres has a special system view called pg_stat_bgwriter
that has info on checkpoints:
checkpoints_timed
— is a counter of checkpoints triggered due that the time elapsed from the previous checkpoint is more than pg setting checkpoint_timeout. These are so called scheduled checkpoints.checkpoints_req
— is a counter of checkpoints ran due to uncheckpointed WAL size grew to more than max_wal_size setting — requested checkpoints.
- WAL Bloats #1 - Long lasting transaction
- WAL Bloats #2 — Archiving
- WAL Bloats #3 — Replication Slots
pg_replication_slots
Write Ahead Log (WAL) (also called redo logs or transaction logs in other products)
As the DB is operating, blocks of data are first written serially and synchronously as WAL files, then some time later, usually a very short time later, written to the DB data files. Once the data contained in these WAL files has been flushed out to their final destination in the data files, they are no longer needed by the primary.
When Postgres needs to make a change to data, it first records this change in the Write-Ahead-Log (WAL) and fsyncs the WAL to the disk. Fsync is a system call that guarantees that the data that was written has actually been written fully to persistent storage, not just stored in the filesystem cache.
ALTER SYSTEM SET fsync = 'on'; # Enable wal_sync_method setting to fsync
ALTER SYSTEM SET wal_sync_method = 'fsync'; # Options may include fdatasync, open_datasync, fsync, fsync_writethrough, open_sync
By default fsync is enable.
Disabling fsync boosts insertion speed, but at the expense of data durability. Re-enabling fsync with the open_sync option strikes a balance, providing a compromise between performance and data safety. It is crucial to consider these results when configuring fsync based on the specific requirements and priorities of your application.
At some future point, the database will issue a CHECKPOINT, which flushes all of the modified buffers to disk and stores the changes permanently in the data directory. When a CHECKPOINT has been completed, the database records the WAL position at which this occurs and knows that all of this data up to this point has been successfully written to disk. (It is worth noting that any changes that are written to disk via the CHECKPOINT already had their contents recorded previously in the WAL stream, so this is just a way of ensuring that the changes recorded by the WAL data are in fact applied to the actual on-disk relation files.)
WAL is stored in on-disk files in 16MB chunks; as the database generates WAL files, it runs the archive_command on them to handle storing these outside of the database. The archive_command’s exit code will determine whether Postgres considers this to have succeeded or not. Once the archive_command completes successfully, Postgres considers the WAL segment to have been successfully handled and can remove it or recycle it to conserve space. If the archive_command fails, Postgres keeps the WAL file around and tries again indefinitely until this command succeeds. Even if a particular segment file is failing, Postgres will continue to accumulate WAL files for all additional database changes and continue retrying to archive the previous file in the background.
recommend keeping secondary copies of the WAL files in another location using a WAL archiving mechanism. This is known as WAL archiving and is done by ensuring archive_mode is turned on and a value has been set for the archive_command.
WAL archiving and backups are typically used together since this then provides point-in-time recovery (PITR) where you can restore a backup to any specific point in time as long as you have the full WAL stream available between all backups.
Streaming Replication Slots
A replication slot is a persistent communication channel between a primary and a standby server. It is used to keep track of the WAL segments that the standby server has received and applied, and to ensure that the primary server does not remove any WAL segments that are still needed by the standby server.
Streaming replication slots are a feature available since PostgreSQL 9.4. Using replication slots will cause the primary to retain WAL files until the primary has been notified that the replica has received the WAL file.
If the replica is unavailable for any reason, WAL files will accumulate on the primary until it can deliver them. On a busy database, with replicas unavailable, disk space can be consumed very quickly as unreplicated WAL files accumulate. This can also happen on a busy DB with relatively slow infrastructure (network and storage).
Log Sequence Number
Transactions in PostgreSQL create WAL records which are ultimately appended to the WAL log (file). The position where the insert occurs is known as the Log Sequence Number (LSN). The values of LSN (of type pg_lsn) can be compared to determine the amount of WAL generated between two different offsets (in bytes). When using in this manner, it is important to know the calculation assumes the full WAL segment was used (16MB) if multiple WAL logs are used. A similar calculation to the one used here is often used to determine latency of a replica.
The LSN is a 64-bit integer, representing a position in the write-ahead log stream. This 64-bit integer is split into two segments (high 32 bits and low 32 bits). It is printed as two hexadecimal numbers separated by a slash (XXXXXXXX/YYZZZZZZ). The 'X' represents the high 32-bits of the LSN and ‘Y’ is the high 8 bits of the lower 32-bits section. The 'Z' represents the offset position in the file. Each element is a hexadecimal number. The 'X' and 'Y' values are used in the second part of the WAL file on a default PostgreSQL deployment.
# Verify the current LSN from Primary server
SELECT pg_current_wal_lsn();
# Verify the LAST Receive and Replay LSN from Delay Standby server
select now(),
pg_is_in_recovery(),
pg_is_wal_replay_paused(),
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn();
pg_wal_lsn_diff()
function indicates the replication lag in bytes between the primary server and the standby server.
WAL File
The WAL file name is in the format TTTTTTTTXXXXXXXXYYYYYYYY. Here 'T' is the timeline, 'X' is the high 32-bits from the LSN, and 'Y' is the low 32-bits of the LSN.
SELECT pg_switch_wal()
WAL Structure
pg_waldump /var/lib/postgresql/data/pg_wal/000000010000000000000001
...
rmgr: Standby len (rec/tot): 42/ 42, tx: 1738, lsn: 0/01938698, prev 0/01938668, desc: LOCK xid 1738 db 5 rel 18065
rmgr: Heap len (rec/tot): 203/ 203, tx: 1738, lsn: 0/019386C8, prev 0/01938698, desc: INSERT off: 12, flags: 0x00, blkref #0: rel 1663/5/1259 blk 25
rmgr: Btree len (rec/tot): 64/ 64, tx: 1738, lsn: 0/01938798, prev 0/019386C8, desc: INSERT_LEAF off: 394, blkref #0: rel 1663/5/2662 blk 4
rmgr: Btree len (rec/tot): 112/ 112, tx: 1738, lsn: 0/019387D8, prev 0/01938798, desc: INSERT_LEAF off: 110, blkref #0: rel 1663/5/2663 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 1738, lsn: 0/01938848, prev 0/019387D8, desc: INSERT_LEAF off: 229, blkref #0: rel 1663/5/3455 blk 4
rmgr: Heap2 len (rec/tot): 176/ 176, tx: 1738, lsn: 0/01938888, prev 0/01938848, desc: MULTI_INSERT ntuples: 1, flags: 0x02, offsets: [36], blkref #0: rel 1663/5/1249 blk 113
rmgr: Btree len (rec/tot): 80/ 80, tx: 1738, lsn: 0/01938938, prev 0/01938888, desc: INSERT_LEAF off: 61, blkref #0: rel 1663/5/2658 blk 28
rmgr: Btree len (rec/tot): 64/ 64, tx: 1738, lsn: 0/01938988, prev 0/01938938, desc: INSERT_LEAF off: 319, blkref #0: rel 1663/5/2659 blk 18
rmgr: Heap len (rec/tot): 197/ 197, tx: 1738, lsn: 0/019389C8, prev 0/01938988, desc: INSERT off: 23, flags: 0x00, blkref #0: rel 1663/5/2610 blk 11
rmgr: Btree len (rec/tot): 64/ 64, tx: 1738, lsn: 0/01938A90, prev 0/019389C8, desc: INSERT_LEAF off: 200, blkref #0: rel 1663/5/2678 blk 1
rmgr: Btree len (rec/tot): 64/ 64, tx: 1738, lsn: 0/01938AD0, prev 0/01938A90, desc: INSERT_LEAF off: 180, blkref #0: rel 1663/5/2679 blk 2
rmgr: Heap2 len (rec/tot): 85/ 85, tx: 1738, lsn: 0/01938B10, prev 0/01938AD0, desc: MULTI_INSERT ntuples: 1, flags: 0x02, offsets: [132], blkref #0: rel 1663/5/2608 blk 26
rmgr: Btree len (rec/tot): 72/ 72, tx: 1738, lsn: 0/01938B68, prev 0/01938B10, desc: INSERT_LEAF off: 233, blkref #0: rel 1663/5/2673 blk 20
rmgr: Btree len (rec/tot): 72/ 72, tx: 1738, lsn: 0/01938BB0, prev 0/01938B68, desc: INSERT_LEAF off: 43, blkref #0: rel 1663/5/2674 blk 12
rmgr: XLOG len (rec/tot): 49/ 209, tx: 1738, lsn: 0/01938BF8, prev 0/01938BB0, desc: FPI , blkref #0: rel 1663/5/18065 blk 1 FPW
rmgr: XLOG len (rec/tot): 49/ 137, tx: 1738, lsn: 0/01938CD0, prev 0/01938BF8, desc: FPI , blkref #0: rel 1663/5/18065 blk 0 FPW
rmgr: Heap len (rec/tot): 188/ 188, tx: 1738, lsn: 0/01938D60, prev 0/01938CD0, desc: INPLACE off: 12, blkref #0: rel 1663/5/1259 blk 25
rmgr: Transaction len (rec/tot): 242/ 242, tx: 1738, lsn: 0/01938E20, prev 0/01938D60, desc: COMMIT 2024-10-02 21:47:28.453226 UTC; inval msgs: catcache 55 catcache 54 catcache 7 catcache 6 catcache 32 catcache 55 catcache 54 relcache 18065 relcache 17640 snapshot 2608 relcache 17640 relcache 18065
rmgr: Storage len (rec/tot): 42/ 42, tx: 0, lsn: 0/01938F18, prev 0/01938E20, desc: CREATE base/5/18066
rmgr: Standby len (rec/tot): 42/ 42, tx: 1739, lsn: 0/01938F48, prev 0/01938F18, desc: LOCK xid 1739 db 5 rel 18066
rmgr: Heap len (rec/tot): 203/ 203, tx: 1739, lsn: 0/01938F78, prev 0/01938F48, desc: INSERT off: 13, flags: 0x00, blkref #0: rel 1663/5/1259 blk 25
...
- Structure: Each line represents a WAL record, containing information about database operations.
- Components of each record:
- rmgr: Resource manager (e.g., Heap, Btree, Transaction)
- len: Length of the record
- tx: Transaction ID
- lsn: Log Sequence Number
- prev: Previous LSN
- desc: Description of the operation
- Types of operations visible:
- INSERT operations (Heap and Btree)
- MULTI_INSERT operations (Heap2)
- COMMIT transactions
- File operations (CREATE)
- Full Page Writes (FPW)
- Specific examples:
- Table inserts: `rmgr: Heap len (rec/tot): 203/203, tx: 1738, lsn: 0/019386C8, prev 0/01938698, desc: INSERT off: 12, flags: 0x00, blkref #0: rel 1663/5/1259 blk 25`
- Index updates: `rmgr: Btree len (rec/tot): 64/ 64, tx: 1738, lsn: 0/01938798, prev 0/019386C8, desc: INSERT_LEAF off: 394, blkref #0: rel 1663/5/2662 blk 4`
- Transaction commit: `rmgr: Transaction len (rec/tot): 242/ 242, tx: 1738, lsn: 0/01938E20, prev 0/01938D60, desc: COMMIT 2024-10-02 21:47:28.453226 UTC;`
This WAL output provides a detailed view of the database operations, allowing for analysis of transaction flow, data modifications, and system activities. It's particularly useful for understanding database behavior, troubleshooting, and in some cases, for point-in-time recovery.
pg_resetwal
pg_resetwal command that can be used to remove WAL files, for example, in case they become corrupted.
pg_controldata
pg_controldata can be used to check control information about WAL and timelines in a PostgreSQL database cluster.