DB2mon details - stanislawbartkowski/wikis GitHub Wiki

Introduction

This page contains details about DB2 monitoring tool, sample and an explanation

Environments

Env1

Property Value
Hardware Commodity notebook
DB2 version DB2 11.5.6
CPU 8 cores
RAM 32 GB
Storage SSD disk
Workload (homegrown DB2 XML performance tool, intensive JMeter workload updating XML column)

Env2

Property Value
Hardware IBM Integrated Analytics System (IIAS)
DB2 version DB2 11.5.5
CPU 192 cores
RAM 500 GB
Storage Appropriate
Workload OLAP workload, loading

(sample metrics for coordinator node: member 0)

DB2Mon

https://www.ibm.com/support/pages/performance-monitor-data-collection-and-reports-using-db2mon

https://www.ibm.com/docs/en/db2/11.5?topic=tuning-collecting-reporting-performance-monitor-data

Common concepts

Concept Description Link
Activity Every DML,DDL,CALL,LOAD. Only execution of the access plan https://www.ibm.com/docs/en/db2/11.5?topic=identification-activities
The package cache An area of memory that DB2 uses to store access plans https://datageek.blog/en/2016/09/20/db2-memory-area-in-depth-the-package-cache/
"Compile" Explicit compiles are compilations directly initiated by a user request such as a bind, rebind, prepare or execute immediate https://www.ibm.com/docs/en/db2/9.7?topic=elements-total-compile-time-total-compile-time
"Section" A section is the DB2 UDB executable version of a SQL statement https://www.progress.com/tutorials/odbc/db2-udb
Latches The internal DB2 locks used to serialize access to shared data structures https://www.ibm.com/support/pages/identifying-bufferpool-latch-contention https://www.ibm.com/support/pages/measuring-latching-overhead-monitoring-functions-db2
CF Cluster caching facility PureScale concept https://www.ibm.com/docs/en/db2/11.5?topic=ct-db2-purescale-feature-cluster-caching-facility-cf-configuration
Prefetch Prefetch is a mechanism for reading a set of pages into the buffer pool with only one asynchronous I/O operation. https://www.ibm.com/docs/en/db2-for-zos/12?topic=explain-prefetch-access-paths-prefetchd-s-l-u
FCM Fast Communication Manager https://www.ibm.com/docs/en/db2/11.5?topic=allocation-fcm-buffer-pool-memory-requirements
Request (internal) DB2 UDB processing actually works in terms of SQL requests rather than SQL statements https://www.progress.com/tutorials/odbc/db2-udb
Asynchronous/synchronous Synchronous read - CPU wait for the result. Asynchronous - opposite. Sequential Prefetch,Dynamic prefetch https://www.toolbox.com/tech/oracle/question/synchronous-i-o-vs-asynchronous-i-o-030714/

Find the database throughput section near the beginning to identify work done by the system:

Throughput metrics at database level

Metric Meaning Sample value Env1 Env2
TS_DELTA db2mon time parameter 35 121 190
ACT_PER_S Number of completed activities per second 22629.7 641.7 11.5
CMT_PER_S Number of commits per second 2361.1 641.4 6.8
RB_PER_S Rollback per second 0.0 0.0 1.1
DDLCK_PER_S Deadlock per second 0.0 0.0 0.0
SEL_P_S Selects per second 13089.6 0.0 5.0
UID_P_S Update/Insert/Delete statements per second 9540.0 641.6 1.9
ROWS_INS_P_S Rows inserted per second 4364.0 0.0 0.6
ROWS_UPD_P_S Rows updated per second 641.4 0.1 (403.1 members)
ROWS_RET_P_S Rows returned per second 0.0 13436.9
ROWS_MOD_P_S Rows modified per second 648.0 2500.0
PKG_CACHE_INS_P_S The total number of times that a requested section was not available for use and had to be loaded into the package cache / per second 0.2 0.6
P_RD_PER_S Number of physical reads from containers to buffer pools per second 26.5 0.0 (1467 members)

Analysis of processing time for the database

Time breakdown at database level (wait + processing)

PCT_COMPILE: Spotting high compile times (over 15%) on a transactional system can be a sign of an application that is not following the best practice of using parameter markers and is instead using literals. Comment:

  • Env1 : transactional, most of the time is committing

  • Env2 : OLAP, no committing, also column organized tables

Metric Meaning Sample value Env1 Env2
TOTAL_RQST_TM The total amount of time spent working on requests. 1035374 2209856 6701006
PCT_COMPILE Percentage of request time spent on explicit compile 0.04 0.05 0.13
PCT_SECTION Percentage of request time spent on executing section 78.59 2.21 98.78
PCT_SORT Percentage of request time spent on executing sorting inside section 0.74 0.0 0.02
PCT_COL Percentage of request time spent on accessing column data in columnar tables 0.0 0.0 25.47 (members 68.54)
PCT_COL_SYNOP Percentage of request time spent accessing column synopsis tables 0.0 0.0 0.00
PCT_COMMIT Percentage of request time spent on commits 9.38 91.59 0.01
PCT_RBACK Percentage of request time spent on rollbacks 0.0 0.0 0.00
PCT_CONN Percentage of request time spent on connects or reconnects 0.0 0.0 0.33
PCT_RTN_USR_CODE Percentage of request time spent on total amount of time spent executing in routines outside DB2, usually user code routines 0.0 0.0
PCT_BACKUP The percentage of request time spent on online backups 0.0 0.0 0.00
PCT_IDX_BLD The percentage of request time spent on index creation or recreation 0.0 0.0 0.00
PCT_RUNSTATS The percentage of request time spent on creating stat, running runstats 0.0 0.0 0.02
PCT_REORG The percentage of request time spent on performing reorg 0.0 0.0 0.00
PCT_LOAD The percentage of request time spent on loading 0.0 0.0 0.00

Analysis of time spent waiting by the database. This report is often the most important top-level information on performance health

Wait times at database level

In this example, each member spends 70% of each request waiting (PCT_RQST_WAIT). The majority of this wait time is spent performing CF communication (PCT_CF), followed by log writes (PCT_LG_DST) and lock wait (PCT_LOCK).

  • Env1: PCT_LG_DSK, transactional workload, most of the time transaction log writing

  • Env2: PCT_FCM, PCT_LTCH, Distributed, MMP, most of the time communication between nodes, PC

Metric Meaning Sample value Env1 Env2 (coord) Env2 (member)
TOTAL_RQST_TM The total amount of time spent working on requests (The same as above) 1035374 2209856 6701006 5608004
TOTAL_WAIT_TM The total amount of time spent on waiting 732450 2176203 6367786 3107040
PCT_RQST_WAIT The percentage of total time spent on waiting 70.74 98.47 95.02 55.40
PCT_LOCK The percentage of total time spent on waiting on locks 1.31 0.00 0.00 3.33
PCT_GLB_LOCK The percentage of time spent on global lock waits, the lock being on a remote member 0.57 0.0 0.00 0.00
PCT_LTCH The percentage of time spent on latches 0.89 0.11 0.01 11.28
PCT_LG_DSK The percentage of time spent on waiting for log records to be flushed to disk 4.28 91.52 0.00 0.00
PCT_LG_BUF The percentage of time spent on s waiting for space into the log buffer 0.00 0.00 0.00
PCT_RCLM (PureScale) The percentage of time spent on waiting on page locks, where the lock request caused a page to be reclaimed 0.0 0.00 0.00
PCT_CF (PureScale) The percentage of time spent on communicating with the cluster caching facility. 56.26 0.0 0.00 0.00
PCT_PFTCH The percentage of time spent on waiting for an I/O server(prefetcher) to finish loading pages into the buffer pool. 0.0 0.00 0.79
PCT_DIAG The percentage of time spent on waiting on a write to the db2diag logfile. 0.0 0.00 0.00
PCT_POOL_R The percentage of time spent on reading data and index pages from the table space containers (physical) for all types of tablespaces 0.82 0.00 4.53
PCT_DIR_R The percentage of time spent required to perform the direct reads, outside buffer pool 0.6 0.29 0.00
PCT_DIR_W The percentage of time required to perform the direct writes, outside buffer pool 0.00 0.41 0.38
PCT_FCM The percentage of time spent on waiting to receive data through FCM or blocking on an FCM send operation 0.0 91.61 35.77
PCT_TCPIP The percentage of time spent on blocking on a TCP/IP send to the client or waiting for an incoming client request over TCP/IP excluding idle time 5.93 0.00 0.00

Identify the busiest statements

Top SQL statements by execution time

Metric Meaning Sample value Env1 Env2
NUM_EXEC Number of executions 61729 19399 2
COORD_STMT_EXEC_TIME Execution time for statement by coordinator agent monitor element 301979 15940 802533
AVG_COORD_EXEC_TIME Average coordinator execution time per single statement execution 4.89 33.03 401266.50
PCT_COORD_STMT_EXEC_TIME Percentage of coordinator statement time per this statement 15.09 47.00
TOTAL_CPU_TIME Total CPU time spent on statement 29928189 10449038 4981
AVG_CPU_TIME Average CPU time spend on a single execution of the statement 484 538 2490
PCT_WAIT_TIME The percentage of total CPU time spent on waiting 53.67 36.31 95.73
AVG_SECT_TIME The average section time spent in statement 4.89 0.82 0.00
AVG_COL_TIME The average elapsed time spent accessing columnar data in a query on column-organized tables. 0.00 0.00 0.00
Metric Meaning Sample value Env1 Env2
PCT_WAIT The percentage of total CPU time spent on waiting (as above) 53.67 36.31 95.73
PCT_LG_DSK The percentage of time spent on waiting for log records to be flushed to disk 0.00 0.00 0.55
PCT_LG_BUF The percentage of time spent on waiting for space into the log buffer 0.00 0.00 0.00
PCT_LOCK The percentage of total time spent on waiting on locks 2.37 0.00 0.00
PCT_GL_LK The percentage of time spent on global lock waits, the lock is on a remote member 0.36 6.84 0.00
PCT_LTCH The percentage of time spent on latches 2.34 0.00 0.00
PCT_RCLM (PureScale) The percentage of time spent on waiting on page locks, where the lock request caused a page to be reclaimed 0.00 0.00 0.00
PCT_CF (PureScale) The percentage of time spent on communicating with the cluster caching facility. 47.55 0.00 0.00
PCT_PFTCH The percentage of time spent on waiting for an I/O server(prefetcher) to finish loading pages into the buffer pool. 0.00 0.00 0.00
PCT_DIAG The percentage of time spent on waiting on a write to the db2diag logfile. 0.00 0.00 0.00
PCT_POOL_R The percentage of time spent on reading data and index pages from the tablespace containers (physical) for all types of tablespaces 1.40 29.47 0.00
PCT_DIR_R The percentage of time spent required to perform the direct reads, outside buffer pool 0.00 0.00 0.55
PCT_DIR_W The percentage of time required to perform the direct writes, outside buffer pool 0.00 0.00
PCT_FCM The percentage of time spent on waiting to receive data through FCM or blocking on an FCM send operation 0.00 25.97
Metric Meaning Sample value Env1 Env2
NUM_EXEC Number of executions (the same as above) 61729 19339 2
AVG_D_LRD The average number of data pages that have been requested from the buffer pool (logical) for regular and large tablespaces 0.0 5.1 4
AVG_D_PRD The average number of data pages read in from the tablespace containers (physical) for regular and large table spaces 0.0 0.0 0
AVG_I_LRD The average number of index pages that have been requested from the buffer pool (logical) for regular and large tablespaces. 607.4 3.0 3
AVG_I_PRD The average number of index pages read in from the tablespace containers (physical) for regular and large tablespaces. 0.1 0.0 0
AVG_TD_PRD The average number of data pages read in from the tablespace containers (physical) for temporary tablespaces 0.0 0.0 0
AVG_TI_PRD The average number of index pages read in from the tablespace containers (physical) for temporary tablespaces 0.0 0.0
AVG_COL_LRD The average number of column-organized pages read in from the tablespace containers (physical) for regular and large table spaces 0.0 0.0 0
AVG_COL_PRD The average number of column-organized pages read in from the tablespace containers (physical) for regular and large table spaces 0.0 0.0 0
AVG_DIR_R_RQS The average number of requests to perform a direct read of one or more sectors of data 0.0 0.0 2
AVG_DIR_W_RQS The average number of requests to perform a direct-write of one or more sectors of data 0.0 0.0 0

Examining Log Disk Performance

Database log write times

A high percentage of log disk wait time (PCT_LG_DSK in db2mon output) is common for transactional applications.

ENV1/LOG_WRITE_TIME_PER_IO_MS : log write is very slow, a bottleneck

Metric Meaning Sample value Env1 Env2 coord Env2 member
NUM_LOG_WRITE_IO The number of I/O requests issued by the logger for writing logdata to the disk 57237 8756 867 4334
LOG_WRITE_IO_PER_S The number of I/O log requests per second 1788.6562 1788.6562 4.5631 22.8105
LOG_WRITE_MB_PER_S The size of writing log data to disk in MB 7.0000 5.2809 0.0421 0.3421
LOG_WRITE_TIME The total elapsed time spent by the logger writing log data to the disk 25306 110772 650 4217
LOG_WRITE_TIME_PER_IO_MS The time elapsed by the logger writing log data per a single I/O request 0.4421 12.6509 0.7497 1.1421
NUM_LOG_BUFFER_FULL Number of times full log buffer caused agents to wait monitor element 0 0 0 0

Buffer pool read time

High percentage bufferpool read time (PCT_POOL_R) is a common bottleneck, especially when the database is warming up (e.g. shortly after database activation or when a new application starts making connections). Average pool read times (AVG_READ_TIME) are reported at bufferpool and tablespace levels. Similarly, a high percentage for direct I/O time (for Large Objects) can be tracked in AVG_DRCT_READ_TIME and AVG_DRCT_WRITE_TIME.

Metric Meaning Sample value Env1 Env2 coord Env2 member
BP_NAME BP16
POOL_DATA_L_READS The number of data pages that have been requested from the buffer pool (logical) for regular and large table spaces 3037899 359590 58506 167737
POOL_DATA_P_READS Indicates the number of data pages read in from the tablespace containers (physical) for regular and large table spaces 44159 3218 0 2723
POOL_INDEX_L_READS Indicates the number of index pages that have been requested from the buffer pool (logical) for regular and large tablespaces. 45043859 232830 56719 102246
POOL_INDEX_P_READS Indicates the number of index pages read in from the tablespace containers (physical) for regular and large table spaces 27427 0 0 75
POOL_COL_L_READS Indicates the number of column-organized pages requested from the buffer pool (logical) for regular and large tablespaces 0 0 0 909672
POOL_COL_P_READS Indicates the number of column-organized pages read in from the table space containers (physical) for regular and large table spaces 0 0 0 257798
POOL_READ_TIME Indicates the total amount of time spent reading in data and index pages from the table space containers (physical) for all types of table spaces 19619 18273 0 221163
AVG_READ_TIME Average pool read time per a single pool read request 5.67 0.27 0 0.84
Metric Meaning Sample value Env1 Env2
TBSP_NAME Tablespace name TBS_OM ----
NUM_READS Number of all read from tablespace: data, index and columnar 29235 3217 329906
AVG_READ_TIME The average read time for all requests 0.27 5.68 0.86
DIRECT_READ_REQS The number of requests to perform a direct read of one or more sectors of data 0 0 0
AVG_DRCT_READ_TIME The average direct read time 0 0 0
NUM_WRITES The number of all types of buffer pool data/index/columnar page was physically written to disk 77301 74
AVG_WRITE_TIME The average write time of all types of data 2.06 0.1
DIRECT_WRITE_REQS The number of direct-write requests 0 0
AVG_DRCT_WRITE_TIME The average time od direct write request 0 0

Round-trip time for messages between the members and CFs (PureScale)

For reports that show high percentage CF wait time (PCT_CF), the following report shows performance data on messages exchanged between the Db2 member and the CFs.

Round-trip CF command execution counts and average response times

Metric Meaning Sample value Env1
CF_CMD_NAME Command name
TOTAL_CF_REQUESTS
AVG_CF_REQUEST_TIME_MICRO

Reclaims of Index and Data pages (Pure Scale)

Metric Meaning Sample value Env1
TABSCHEMA
TABNAME
OBJTYPE
PAGE_RECLAIMS The number of times a page related to the object was reclaimed by another member in the Db2pureScale
RECLAIM_WAIT_TIME The amount of time spent waiting on page locks,where the lock request caused a page to be reclaimed.

Reclaims of SMP pages (PureScale)

The following example shows reclaims of SMP pages, which are often due to heavy inserts into a table with a too-small tablespace extent size.

Metric Meaning Sample value Env1
TABSCHEMA
TABNAME
OBJTYPE
SMP_PAGE_RECLAIMS The number of times a page related to a space map pagewas reclaimed by another member in the Db2pureScale® instancebefore its planned release
SMP_PAGE_RECLAIM_WAIT_TIME The amount of time spent waiting on page locks,where the lock request caused a page to be reclaimed.

Latch wait time

Metric Meaning Sample value Env1 Env2
LATCH_NAME SQLO_LT_SQLB_BPD__bpdLatch_SX SQLO_LT_ibm_cde__query__TableQueue__mTQMutex
TOT_EXT_LATCH_WAIT_TIME_MS The amount of time, in milliseconds, spent in extended latch waits 7182 2588 601828
TOT_EXT_LATCH_WAITS The number of extended latch waits 4454 453 101
TIME_PER_LATCH_WAIT_MS The average amount of time spent on single wait 1.61 5.71 5958.69

Example for ENV1

MEMBER LATCH_NAME                                                   TOT_EXT_LATCH_WAIT_TIME_MS TOT_EXT_LATCH_WAITS  TIME_PER_LATCH_WAIT_MS
------ ------------------------------------------------------------ -------------------------- -------------------- ----------------------
     0 SQLO_LT_SQLB_BPD__bpdLatch_SX                                                      2588                  453                   5.71
     0 SQLO_LT_SQLP_DBCB__sqlp_dbcb_sem                                                    343                   28                  12.25
     0 SQLO_LT_SQLD_PAGE_CACHE__pageCacheLatch                                              20                  537                   0.03
     0 SQLO_LT_sqlo_waitpost__cs                                                             6                    5                   1.20
     0 SQLO_LT_SQLP_LHSH__xhshlatch                                                          1                    1                   1.00
     0 SQLO_LT_sqlo_WaitPostThreshold__cs                                                    1                    1                   1.00
     0 SQLO_LT_sqmPartitionedRequestMetrics__metricsSyncPoint                                1                    1                   1.00
     0 SQLO_LT_NO_IDENTITY                                                                   0                    8                   0.00
     0 SQLO_LT_SMemPool__MemLatchType__latch                                                 0                    1                   0.00
     0 SQLO_LT_sqmObjMetricsTable__m_objMetricsLatch                                         0                    4                   0.00
     0 SQLO_LT_sqlra_workspace_sibling__workspace_sibling_latch                              0                    2                   0.00

Example for doc sample

MEMBER LATCH_NAME                                                   TOT_EXT_LATCH_WAIT_TIME_MS TOT_EXT_LATCH_WAITS  TIME_PER_LATCH_WAIT_MS
------ ------------------------------------------------------------ -------------------------- -------------------- ----------------------
 0 SQLO_LT_SQLB_BPD_bpdLatch_SX                                                       7182                 4454                   1.61  
 3 SQLO_LT_SQLB_BPD_bpdLatch_SX                                                       4085                 1965                   2.07  
 2 SQLO_LT_SQLB_BPD_bpdLatch_SX                                                       3086                 2088                   1.47  
 1 SQLO_LT_SQLB_BPD_bpdLatch_SX                                                       1764                 1724                   1.02  
 0 SQLO_LT_SQLB_BPD_WARLatch                                                          1272                 1783                   0.71  
 2 SQLO_LT_SQLB_BPD_WARLatch                                                          1168                 2637                   0.44  
 3 SQLO_LT_SQLB_BPD_WARLatch                                                           706                 2110                   0.33  
 1 SQLO_LT_SQLB_BPD_WARLatch                                                           509                 1870                   0.27  
 0 SQLO_LT_SQLP_LHSH_hshlatch                                                          263                  758                   0.34  
 3 SQLO_LT_SQLI_INX_PAGE_CACHE_ipcLatch                                                240                   36                   6.66  
 2 SQLO_LT_SQLI_INX_PAGE_CACHE_ipcLatch                                                221                   45                   4.91  
 1 SQLO_LT_SQLI_INX_PAGE_CACHE_ipcLatch                                                193                   21                   9.19  
 0 SQLO_LT_SQLI_INX_PAGE_CACHE_ipcLatch                                                175                   25                   7.00

Example for Env2

MEMBER LATCH_NAME                                                   TOT_EXT_LATCH_WAIT_TIME_MS TOT_EXT_LATCH_WAITS  TIME_PER_LATCH_WAIT_MS
------ ------------------------------------------------------------ -------------------------- -------------------- ----------------------
    15 SQLO_LT_SQLKT_CB__multiReaderLatch                                              2760789                   28               98599.60
     7 SQLO_LT_ibm_cde__query__TableQueue__mTQMutex                                     601828                  101                5958.69
    12 SQLO_LT_ibm_cde__query__TableQueue__mTQMutex                                     596051                   96                6208.86
    16 SQLO_LT_ibm_cde__query__TableQueue__mTQMutex                                     559837                   98                5712.62
     2 SQLO_LT_SQLB_BPD__bpdLatch_SX                                                      4592                  310                  14.81
     1 SQLO_LT_SQLB_BPD__bpdLatch_SX                                                      4574                  263                  17.39