Postgres monitoring - ghdrako/doc_snipets GitHub Wiki
Monitoring tools
- https://wiki.postgresql.org/wiki/Monitoring
- https://dev.to/bolajiwahab/progress-reporting-in-postgresql-1i0d
Examle of tools:
- pgBadger
- pgCluu
- sar
- pg_buffercache
- Nagios
- Zabbix
- datadog
Frequent Monitoring
The following database information needs frequent monitoring every minute or 5 minutes:
- Active session
- Inactive session
- Long-running query (session query running time)
- Locks
- Waiting sessions
- SQL queries being run
- CPU and memory occupied by each session
- Number of connections
- Primary/standby delay
- Any errors in logs
Daily or Weekly Basis
Other database-related information that can be monitored on a daily or weekly basis is:
- Database size
- Tablespace size
- Object (table/index) size
- Last vacuum/autovacuum
- Last analyze/auto analyzed
- Bloat on table/indexes
- Number of checkpoints
- Number of wal files generated
Long running query
SELECT
pid,
now() - pg_stat_activity.query_start AS duration,
usename,
query,
state,
wait_event_type,
wait_event
FROM pg_stat_activity
WHERE state = 'active' and (now() - pg_stat_activity.query_start) > interval '1 minute';
The wait_event_type and wait_event columns will show IO and DataFileRead if the query is reading from disk.
List Databases and Sizes
SELECT pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database;
Show Running Queries in Postgres
SELECT * FROM pg_stat_activity;
Show Blocking Locks
SELECT bl.pid AS blocked_pid,
a.usename AS blocked_user,
kl.pid AS blocking_pid,
ka.usename AS blocking_user
FROM pg_catalog.pg_locks bl
JOIN pg_catalog.pg_stat_activity a ON bl.pid = a.pid
JOIN pg_catalog.pg_locks kl
JOIN pg_catalog.pg_stat_activity ka ON kl.pid = ka.pid ON bl.transactionid = kl.transactionid AND bl.pid != kl.pid
WHERE NOT bl.granted ;
Show Table Usage
If you want to know accesses or I/O per table or index you can use the pg_stat_tables and pg_statio_tables relations. For example:
SELECT * FROM pg_statio_user_tables;
to show the I/O caused by your relations. Or for the number of accesses and scan types and tuples fetched:
SELECT * FROM pg_stat_user_tables;
Show table size
monitor_table_size.sh:
#!/bin/bash
# Get table sizes in MB
psql -U your_username -d your_database -c "SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) AS size FROM pg_tables ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC;" >> /path/to/table_sizes.log
Monitor replication lag
monitor_replication_lag.sh:
#!/bin/bash
# Get replication lag in bytes
psql -U your_username -d your_database -c "SELECT
pg_is_in_recovery() AS is_slave,
pg_last_wal_receive_lsn() AS receive_lsn,
pg_last_wal_replay_lsn() AS replay_lsn,
pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() AS is_synced,
EXTRACT(EPOCH FROM now()) - EXTRACT(EPOCH FROM pg_last_xact_replay_timestamp()) AS replication_lag_seconds,
pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS replication_lag_bytes
FROM
pg_stat_replication;ยป >> /path/to/replication_lag.log
Long running query
monitor_long_running_queries.sh:
#!/bin/bash
# Set the threshold for long-running queries in seconds
THRESHOLD=60
# Log queries exceeding the threshold
psql -U your_username -d your_database -c "SELECT * FROM pg_stat_activity WHERE state = 'active' AND now() - pg_stat_activity.query_start >= interval '$THRESHOLD seconds';" >> /path/to/long_running_queries.log
Monitoring progres operations
operation | view |
---|---|
ANALYZE | pg_stat_progress_analyze |
CLUSTER or VACUUM FULL | pg_stat_progress_cluster |
COPY | pg_stat_progress_copy |
CREATE INDEX or REINDEX | pg_stat_progress_create_index |
VACUUM | pg_stat_progress_vacuum |
pg_basebackup | pg_stat_progress_basebackup |
Monitoring the progress of create index, create index concurrently, reindex, reindex concurrently
SELECT p.datname AS database_name,
p.pid,
clock_timestamp() - a.xact_start AS duration_so_far,
a.application_name,
a.client_addr,
a.usename,
coalesce(a.wait_event_type ||'.'|| a.wait_event, 'false') AS waiting,
p.command,
trim(trailing ';' from a.query) AS query,
a.state,
p.index_relid::regclass AS index_name,
p.relid::regclass AS table_name,
pg_size_pretty(pg_relation_size(p.relid)) AS table_size,
p.phase,
CASE p.phase
WHEN 'initializing' THEN '1 of 12'
WHEN 'waiting for writers before build' THEN '2 of 12'
WHEN 'building index: scanning table' THEN '3 of 12'
WHEN 'building index: sorting live tuples' THEN '4 of 12'
WHEN 'building index: loading tuples in tree' THEN '5 of 12'
WHEN 'waiting for writers before validation' THEN '6 of 12'
WHEN 'index validation: scanning index' THEN '7 of 12'
WHEN 'index validation: sorting tuples' THEN '8 of 12'
WHEN 'index validation: scanning table' THEN '9 of 12'
WHEN 'waiting for old snapshots' THEN '10 of 12'
WHEN 'waiting for readers before marking dead' THEN '11 of 12'
WHEN 'waiting for readers before dropping' THEN '12 of 12'
END AS phase_progress,
format(
'%s (%s of %s)',
coalesce(round(100.0 * p.blocks_done / nullif(p.blocks_total, 0), 2)::text || '%', 'not applicable'),
p.blocks_done::text,
p.blocks_total::text
) AS scan_progress,
format(
'%s (%s of %s)',
coalesce(round(100.0 * p.tuples_done / nullif(p.tuples_total, 0), 2)::text || '%', 'not applicable'),
p.tuples_done::text,
p.tuples_total::text
) AS tuples_loading_progress,
format(
'%s (%s of %s)',
coalesce((100 * p.lockers_done / nullif(p.lockers_total, 0))::text || '%', 'not applicable'),
p.lockers_done::text,
p.lockers_total::text
) AS lockers_progress,
format(
'%s (%s of %s)',
coalesce((100 * p.partitions_done / nullif(p.partitions_total, 0))::text || '%', 'not applicable'),
p.partitions_done::text,
p.partitions_total::text
) AS partitions_progress,
p.current_locker_pid,
trim(trailing ';' from l.query) AS current_locker_query
FROM pg_stat_progress_create_index AS p
JOIN pg_stat_activity AS a ON a.pid = p.pid
LEFT JOIN pg_stat_activity AS l ON l.pid = p.current_locker_pid
ORDER BY clock_timestamp() - a.xact_start DESC;
Monitoring the progress of vacuum, vacuum analyze
SELECT p.datname AS database_name,
p.pid,
clock_timestamp() - a.xact_start AS duration_so_far,
a.application_name,
a.client_addr,
a.usename,
coalesce(a.wait_event_type ||'.'|| a.wait_event, 'false') AS waiting,
trim(trailing ';' from a.query) AS query,
a.state,
p.relid::regclass AS table_name,
pg_size_pretty(pg_relation_size(p.relid)) AS table_size,
pg_size_pretty(pg_total_relation_size(p.relid)) AS total_table_size,
CASE
WHEN ltrim(a.query) ~* '^autovacuum.to prevent wraparound' THEN 'wraparound'
WHEN ltrim(a.query) ~ '^vacuum' THEN 'user'
ELSE 'regular'
END AS mode,
p.phase,
CASE p.phase
WHEN 'initializing' THEN '1 of 7'
WHEN 'scanning heap' THEN '2 of 7'
WHEN 'vacuuming indexes' THEN '3 of 7'
WHEN 'vacuuming heap' THEN '4 of 7'
WHEN 'cleaning up indexes' THEN '5 of 7'
WHEN 'truncating heap' THEN '6 of 7'
WHEN 'performing final cleanup' THEN '7 of 7'
END AS vacuum_phase_progress,
format(
'%s (%s of %s)',
coalesce(round(100.0 * p.heap_blks_scanned / nullif(p.heap_blks_total, 0), 2)::text || '%', 'not applicable'),
p.heap_blks_scanned::text,
p.heap_blks_total::text
) AS vacuum_scan_progress,
format(
'%s (%s of %s)',
coalesce(round(100.0 * p.heap_blks_vacuumed / nullif(p.heap_blks_total, 0), 2)::text || '%', 'not applicable'),
p.heap_blks_vacuumed::text,
p.heap_blks_total::text
) AS vacuum_progress,
p.index_vacuum_count,
p.max_dead_tuples,
p.num_dead_tuples
FROM pg_stat_progress_vacuum AS p
JOIN pg_stat_activity AS a ON a.pid = p.pid
ORDER BY clock_timestamp() - a.xact_start DESC;
Monitoring the progress of vacuum full, cluster
SELECT p.datname AS database_name,
p.pid,
clock_timestamp() - a.xact_start AS duration_so_far,
a.application_name,
a.client_addr,
a.usename,
coalesce(a.wait_event_type ||'.'|| a.wait_event, 'false') AS waiting,
p.command,
trim(trailing ';' from a.query) AS query,
a.state,
p.relid::regclass AS table_name,
p.phase,
CASE p.phase
WHEN 'initializing' THEN '1 of 8'
WHEN 'seq scanning heap' THEN '2 of 8'
WHEN 'index scanning heap' THEN '3 of 8'
WHEN 'sorting tuples' THEN '4 of 8'
WHEN 'writing new heap' THEN '5 of 8'
WHEN 'swapping relation files' THEN '6 of 8'
WHEN 'rebuilding index' THEN '7 of 8'
WHEN 'performing final cleanup' THEN '7 of 8'
END AS vacuum_phase_progress,
cluster_index_relid::regclass AS cluster_index,
format(
'%s (%s of %s)',
coalesce(round(100.0 * p.heap_blks_scanned / nullif(p.heap_blks_total, 0), 2)::text || '%', 'not applicable'),
p.heap_blks_scanned::text,
p.heap_blks_total::text
) AS heap_scan_progress,
format(
'%s (%s of %s)',
coalesce(round(100.0 * p.heap_tuples_written / nullif(p.heap_tuples_scanned, 0), 2)::text || '%', 'not applicable'),
p.heap_tuples_written::text,
p.heap_tuples_scanned::text
) AS heap_tuples_written_progress,
p.index_rebuild_count
FROM pg_stat_progress_cluster AS p
JOIN pg_stat_activity AS a ON a.pid = p.pid
ORDER BY clock_timestamp() - a.xact_start DESC;
Monitoring the progress of analyze
SELECT p.datname AS database_name,
p.pid,
clock_timestamp() - a.xact_start AS duration_so_far,
a.application_name,
a.client_addr,
a.usename,
coalesce(a.wait_event_type ||'.'|| a.wait_event, 'false') AS waiting,
trim(trailing ';' from a.query) AS query,
a.state,
p.relid::regclass AS table_name,
pg_size_pretty(pg_relation_size(p.relid)) AS table_size,
pg_size_pretty(pg_total_relation_size(p.relid)) AS total_table_size,
CASE
WHEN ltrim(a.query) ~* '^analyze' THEN 'user'
ELSE 'regular'
END AS mode,
p.phase,
CASE p.phase
WHEN 'initializing' THEN '1 of 6'
WHEN 'acquiring sample rows' THEN '2 of 6'
WHEN 'acquiring inherited sample rows' THEN '3 of 6'
WHEN 'computing statistics' THEN '4 of 6'
WHEN 'computing extended statistics' THEN '5 of 6'
WHEN 'finalizing analyze' THEN '6 of 6'
END AS phase_progress,
format(
'%s (%s of %s)',
coalesce(round(100.0 * p.sample_blks_scanned / nullif(p.sample_blks_total, 0), 2)::text || '%', 'not applicable'),
p.sample_blks_scanned::text,
p.sample_blks_total::text
) AS scan_progress,
format(
'%s (%s of %s)',
coalesce((100 * p.ext_stats_computed / nullif(p.ext_stats_total, 0))::text || '%', 'not applicable'),
p.ext_stats_computed::text,
p.ext_stats_total::text
) AS extended_statistics_progress,
format(
'%s (%s of %s)',
coalesce((100 * p.ext_stats_computed / nullif(p.child_tables_total, 0))::text || '%', 'not applicable'),
p.child_tables_done::text,
p.child_tables_total::text
) AS child_tables_progress,
current_child_table_relid::regclass AS current_child_table
FROM pg_stat_progress_analyze AS p
JOIN pg_stat_activity AS a ON a.pid = p.pid
ORDER BY clock_timestamp() - a.xact_start DESC;
Monitoring the progress of copy
SELECT p.datname AS database_name,
p.pid,
clock_timestamp() - a.xact_start AS duration_so_far,
a.application_name,
a.client_addr,
a.usename,
coalesce(a.wait_event_type ||'.'|| a.wait_event, 'false') AS waiting,
p.command,
p.type,
trim(trailing ';' from a.query) AS query,
a.state,
p.relid::regclass AS table_name,
coalesce(pg_size_pretty(pg_relation_size(p.relid)), '-') AS table_size,
coalesce(pg_size_pretty(pg_total_relation_size(p.relid)), '-') AS total_table_size,
format(
'%s (%s of %s)',
coalesce(round(100.0 * p.bytes_processed / nullif(p.bytes_total, 0), 2)::text || '%', 'not applicable'),
p.bytes_processed::text,
p.bytes_total::text
) AS bytes_progress,
p.tuples_processed,
p.tuples_excluded
FROM pg_stat_progress_copy AS p
JOIN pg_stat_activity AS a ON a.pid = p.pid
ORDER BY clock_timestamp() - a.xact_start DESC;
Monitoring the progress of basebackup
SELECT a.datname AS database_name,
p.pid,
clock_timestamp() - a.query_start AS duration_so_far,
a.application_name,
a.client_addr,
a.usename,
coalesce(a.wait_event_type ||'.'|| a.wait_event, 'false') AS waiting,
trim(trailing ';' from a.query) AS query,
a.state,
p.phase,
CASE p.phase
WHEN 'initializing' THEN '1 of 6'
WHEN 'waiting for checkpoint to finish' THEN '2 of 6'
WHEN 'estimating backup size' THEN '3 of 6'
WHEN 'streaming database files' THEN '4 of 6'
WHEN 'waiting for wal archiving to finish' THEN '5 of 6'
WHEN 'transferring wal files' THEN '6 of 6'
END AS phase_progress,
format(
'%s (%s of %s)',
coalesce(round(100.0 * p.backup_streamed / nullif(p.backup_total, 0), 2)::text || '%', 'not applicable'),
p.backup_streamed::text,
coalesce(p.backup_total::text, '0')
) AS backup_progress,
format(
'%s (%s of %s)',
coalesce((100 * p.tablespaces_streamed / nullif(p.tablespaces_total, 0))::text || '%', 'not applicable'),
p.tablespaces_streamed::text,
p.tablespaces_total::text
) AS tablespace_progress
FROM pg_stat_progress_basebackup AS p
JOIN pg_stat_activity AS a ON a.pid = p.pid
ORDER BY clock_timestamp() - a.query_start DESC;