DB2mon details - stanislawbartkowski/wikis GitHub Wiki
Introduction
This page contains details about DB2 monitoring tool, sample and an explanation
Environments
Env1
Property | Value |
---|---|
Hardware | Commodity notebook |
DB2 version | DB2 11.5.6 |
CPU | 8 cores |
RAM | 32 GB |
Storage | SSD disk |
Workload | (homegrown DB2 XML performance tool, intensive JMeter workload updating XML column) |
Env2
Property | Value |
---|---|
Hardware | IBM Integrated Analytics System (IIAS) |
DB2 version | DB2 11.5.5 |
CPU | 192 cores |
RAM | 500 GB |
Storage | Appropriate |
Workload | OLAP workload, loading |
(sample metrics for coordinator node: member 0)
DB2Mon
https://www.ibm.com/support/pages/performance-monitor-data-collection-and-reports-using-db2mon
https://www.ibm.com/docs/en/db2/11.5?topic=tuning-collecting-reporting-performance-monitor-data
Common concepts
Concept | Description | Link |
---|---|---|
Activity | Every DML,DDL,CALL,LOAD. Only execution of the access plan | https://www.ibm.com/docs/en/db2/11.5?topic=identification-activities |
The package cache | An area of memory that DB2 uses to store access plans | https://datageek.blog/en/2016/09/20/db2-memory-area-in-depth-the-package-cache/ |
"Compile" | Explicit compiles are compilations directly initiated by a user request such as a bind, rebind, prepare or execute immediate | https://www.ibm.com/docs/en/db2/9.7?topic=elements-total-compile-time-total-compile-time |
"Section" | A section is the DB2 UDB executable version of a SQL statement | https://www.progress.com/tutorials/odbc/db2-udb |
Latches | The internal DB2 locks used to serialize access to shared data structures | https://www.ibm.com/support/pages/identifying-bufferpool-latch-contention https://www.ibm.com/support/pages/measuring-latching-overhead-monitoring-functions-db2 |
CF | Cluster caching facility PureScale concept | https://www.ibm.com/docs/en/db2/11.5?topic=ct-db2-purescale-feature-cluster-caching-facility-cf-configuration |
Prefetch | Prefetch is a mechanism for reading a set of pages into the buffer pool with only one asynchronous I/O operation. | https://www.ibm.com/docs/en/db2-for-zos/12?topic=explain-prefetch-access-paths-prefetchd-s-l-u |
FCM | Fast Communication Manager | https://www.ibm.com/docs/en/db2/11.5?topic=allocation-fcm-buffer-pool-memory-requirements |
Request | (internal) DB2 UDB processing actually works in terms of SQL requests rather than SQL statements | https://www.progress.com/tutorials/odbc/db2-udb |
Asynchronous/synchronous | Synchronous read - CPU wait for the result. Asynchronous - opposite. Sequential Prefetch,Dynamic prefetch | https://www.toolbox.com/tech/oracle/question/synchronous-i-o-vs-asynchronous-i-o-030714/ |
Find the database throughput section near the beginning to identify work done by the system:
Throughput metrics at database level
Metric | Meaning | Sample value | Env1 | Env2 |
---|---|---|---|---|
TS_DELTA | db2mon time parameter | 35 | 121 | 190 |
ACT_PER_S | Number of completed activities per second | 22629.7 | 641.7 | 11.5 |
CMT_PER_S | Number of commits per second | 2361.1 | 641.4 | 6.8 |
RB_PER_S | Rollback per second | 0.0 | 0.0 | 1.1 |
DDLCK_PER_S | Deadlock per second | 0.0 | 0.0 | 0.0 |
SEL_P_S | Selects per second | 13089.6 | 0.0 | 5.0 |
UID_P_S | Update/Insert/Delete statements per second | 9540.0 | 641.6 | 1.9 |
ROWS_INS_P_S | Rows inserted per second | 4364.0 | 0.0 | 0.6 |
ROWS_UPD_P_S | Rows updated per second | 641.4 | 0.1 (403.1 members) | |
ROWS_RET_P_S | Rows returned per second | 0.0 | 13436.9 | |
ROWS_MOD_P_S | Rows modified per second | 648.0 | 2500.0 | |
PKG_CACHE_INS_P_S | The total number of times that a requested section was not available for use and had to be loaded into the package cache / per second | 0.2 | 0.6 | |
P_RD_PER_S | Number of physical reads from containers to buffer pools per second | 26.5 | 0.0 (1467 members) |
Analysis of processing time for the database
Time breakdown at database level (wait + processing)
PCT_COMPILE: Spotting high compile times (over 15%) on a transactional system can be a sign of an application that is not following the best practice of using parameter markers and is instead using literals. Comment:
-
Env1 : transactional, most of the time is committing
-
Env2 : OLAP, no committing, also column organized tables
Metric | Meaning | Sample value | Env1 | Env2 |
---|---|---|---|---|
TOTAL_RQST_TM | The total amount of time spent working on requests. | 1035374 | 2209856 | 6701006 |
PCT_COMPILE | Percentage of request time spent on explicit compile | 0.04 | 0.05 | 0.13 |
PCT_SECTION | Percentage of request time spent on executing section | 78.59 | 2.21 | 98.78 |
PCT_SORT | Percentage of request time spent on executing sorting inside section | 0.74 | 0.0 | 0.02 |
PCT_COL | Percentage of request time spent on accessing column data in columnar tables | 0.0 | 0.0 | 25.47 (members 68.54) |
PCT_COL_SYNOP | Percentage of request time spent accessing column synopsis tables | 0.0 | 0.0 | 0.00 |
PCT_COMMIT | Percentage of request time spent on commits | 9.38 | 91.59 | 0.01 |
PCT_RBACK | Percentage of request time spent on rollbacks | 0.0 | 0.0 | 0.00 |
PCT_CONN | Percentage of request time spent on connects or reconnects | 0.0 | 0.0 | 0.33 |
PCT_RTN_USR_CODE | Percentage of request time spent on total amount of time spent executing in routines outside DB2, usually user code | routines | 0.0 | 0.0 |
PCT_BACKUP | The percentage of request time spent on online backups | 0.0 | 0.0 | 0.00 |
PCT_IDX_BLD | The percentage of request time spent on index creation or recreation | 0.0 | 0.0 | 0.00 |
PCT_RUNSTATS | The percentage of request time spent on creating stat, running runstats | 0.0 | 0.0 | 0.02 |
PCT_REORG | The percentage of request time spent on performing reorg | 0.0 | 0.0 | 0.00 |
PCT_LOAD | The percentage of request time spent on loading | 0.0 | 0.0 | 0.00 |
Analysis of time spent waiting by the database. This report is often the most important top-level information on performance health
Wait times at database level
In this example, each member spends 70% of each request waiting (PCT_RQST_WAIT). The majority of this wait time is spent performing CF communication (PCT_CF), followed by log writes (PCT_LG_DST) and lock wait (PCT_LOCK).
-
Env1: PCT_LG_DSK, transactional workload, most of the time transaction log writing
-
Env2: PCT_FCM, PCT_LTCH, Distributed, MMP, most of the time communication between nodes, PC
Metric | Meaning | Sample value | Env1 | Env2 (coord) | Env2 (member) |
---|---|---|---|---|---|
TOTAL_RQST_TM | The total amount of time spent working on requests (The same as above) | 1035374 | 2209856 | 6701006 | 5608004 |
TOTAL_WAIT_TM | The total amount of time spent on waiting | 732450 | 2176203 | 6367786 | 3107040 |
PCT_RQST_WAIT | The percentage of total time spent on waiting | 70.74 | 98.47 | 95.02 | 55.40 |
PCT_LOCK | The percentage of total time spent on waiting on locks | 1.31 | 0.00 | 0.00 | 3.33 |
PCT_GLB_LOCK | The percentage of time spent on global lock waits, the lock being on a remote member | 0.57 | 0.0 | 0.00 | 0.00 |
PCT_LTCH | The percentage of time spent on latches | 0.89 | 0.11 | 0.01 | 11.28 |
PCT_LG_DSK | The percentage of time spent on waiting for log records to be flushed to disk | 4.28 | 91.52 | 0.00 | 0.00 |
PCT_LG_BUF | The percentage of time spent on s waiting for space into the log buffer | 0.00 | 0.00 | 0.00 | |
PCT_RCLM | (PureScale) The percentage of time spent on waiting on page locks, where the lock request caused a page to be reclaimed | 0.0 | 0.00 | 0.00 | |
PCT_CF | (PureScale) The percentage of time spent on communicating with the cluster caching facility. | 56.26 | 0.0 | 0.00 | 0.00 |
PCT_PFTCH | The percentage of time spent on waiting for an I/O server(prefetcher) to finish loading pages into the buffer pool. | 0.0 | 0.00 | 0.79 | |
PCT_DIAG | The percentage of time spent on waiting on a write to the db2diag logfile. | 0.0 | 0.00 | 0.00 | |
PCT_POOL_R | The percentage of time spent on reading data and index pages from the table space containers (physical) for all types of tablespaces | 0.82 | 0.00 | 4.53 | |
PCT_DIR_R | The percentage of time spent required to perform the direct reads, outside buffer pool | 0.6 | 0.29 | 0.00 | |
PCT_DIR_W | The percentage of time required to perform the direct writes, outside buffer pool | 0.00 | 0.41 | 0.38 | |
PCT_FCM | The percentage of time spent on waiting to receive data through FCM or blocking on an FCM send operation | 0.0 | 91.61 | 35.77 | |
PCT_TCPIP | The percentage of time spent on blocking on a TCP/IP send to the client or waiting for an incoming client request over TCP/IP excluding idle time | 5.93 | 0.00 | 0.00 |
Identify the busiest statements
Top SQL statements by execution time
Metric | Meaning | Sample value | Env1 | Env2 |
---|---|---|---|---|
NUM_EXEC | Number of executions | 61729 | 19399 | 2 |
COORD_STMT_EXEC_TIME | Execution time for statement by coordinator agent monitor element | 301979 | 15940 | 802533 |
AVG_COORD_EXEC_TIME | Average coordinator execution time per single statement execution | 4.89 | 33.03 | 401266.50 |
PCT_COORD_STMT_EXEC_TIME | Percentage of coordinator statement time per this statement | 15.09 | 47.00 | |
TOTAL_CPU_TIME | Total CPU time spent on statement | 29928189 | 10449038 | 4981 |
AVG_CPU_TIME | Average CPU time spend on a single execution of the statement | 484 | 538 | 2490 |
PCT_WAIT_TIME | The percentage of total CPU time spent on waiting | 53.67 | 36.31 | 95.73 |
AVG_SECT_TIME | The average section time spent in statement | 4.89 | 0.82 | 0.00 |
AVG_COL_TIME | The average elapsed time spent accessing columnar data in a query on column-organized tables. | 0.00 | 0.00 | 0.00 |
Metric | Meaning | Sample value | Env1 | Env2 |
---|---|---|---|---|
PCT_WAIT | The percentage of total CPU time spent on waiting (as above) | 53.67 | 36.31 | 95.73 |
PCT_LG_DSK | The percentage of time spent on waiting for log records to be flushed to disk | 0.00 | 0.00 | 0.55 |
PCT_LG_BUF | The percentage of time spent on waiting for space into the log buffer | 0.00 | 0.00 | 0.00 |
PCT_LOCK | The percentage of total time spent on waiting on locks | 2.37 | 0.00 | 0.00 |
PCT_GL_LK | The percentage of time spent on global lock waits, the lock is on a remote member | 0.36 | 6.84 | 0.00 |
PCT_LTCH | The percentage of time spent on latches | 2.34 | 0.00 | 0.00 |
PCT_RCLM | (PureScale) The percentage of time spent on waiting on page locks, where the lock request caused a page to be reclaimed | 0.00 | 0.00 | 0.00 |
PCT_CF | (PureScale) The percentage of time spent on communicating with the cluster caching facility. | 47.55 | 0.00 | 0.00 |
PCT_PFTCH | The percentage of time spent on waiting for an I/O server(prefetcher) to finish loading pages into the buffer pool. | 0.00 | 0.00 | 0.00 |
PCT_DIAG | The percentage of time spent on waiting on a write to the db2diag logfile. | 0.00 | 0.00 | 0.00 |
PCT_POOL_R | The percentage of time spent on reading data and index pages from the tablespace containers (physical) for all types of tablespaces | 1.40 | 29.47 | 0.00 |
PCT_DIR_R | The percentage of time spent required to perform the direct reads, outside buffer pool | 0.00 | 0.00 | 0.55 |
PCT_DIR_W | The percentage of time required to perform the direct writes, outside buffer pool | 0.00 | 0.00 | |
PCT_FCM | The percentage of time spent on waiting to receive data through FCM or blocking on an FCM send operation | 0.00 | 25.97 |
Metric | Meaning | Sample value | Env1 | Env2 |
---|---|---|---|---|
NUM_EXEC | Number of executions (the same as above) | 61729 | 19339 | 2 |
AVG_D_LRD | The average number of data pages that have been requested from the buffer pool (logical) for regular and large tablespaces | 0.0 | 5.1 | 4 |
AVG_D_PRD | The average number of data pages read in from the tablespace containers (physical) for regular and large table spaces | 0.0 | 0.0 | 0 |
AVG_I_LRD | The average number of index pages that have been requested from the buffer pool (logical) for regular and large tablespaces. | 607.4 | 3.0 | 3 |
AVG_I_PRD | The average number of index pages read in from the tablespace containers (physical) for regular and large tablespaces. | 0.1 | 0.0 | 0 |
AVG_TD_PRD | The average number of data pages read in from the tablespace containers (physical) for temporary tablespaces | 0.0 | 0.0 | 0 |
AVG_TI_PRD | The average number of index pages read in from the tablespace containers (physical) for temporary tablespaces | 0.0 | 0.0 | |
AVG_COL_LRD | The average number of column-organized pages read in from the tablespace containers (physical) for regular and large table spaces | 0.0 | 0.0 | 0 |
AVG_COL_PRD | The average number of column-organized pages read in from the tablespace containers (physical) for regular and large table spaces | 0.0 | 0.0 | 0 |
AVG_DIR_R_RQS | The average number of requests to perform a direct read of one or more sectors of data | 0.0 | 0.0 | 2 |
AVG_DIR_W_RQS | The average number of requests to perform a direct-write of one or more sectors of data | 0.0 | 0.0 | 0 |
Examining Log Disk Performance
Database log write times
A high percentage of log disk wait time (PCT_LG_DSK in db2mon output) is common for transactional applications.
ENV1/LOG_WRITE_TIME_PER_IO_MS : log write is very slow, a bottleneck
Metric | Meaning | Sample value | Env1 | Env2 coord | Env2 member |
---|---|---|---|---|---|
NUM_LOG_WRITE_IO | The number of I/O requests issued by the logger for writing logdata to the disk | 57237 | 8756 | 867 | 4334 |
LOG_WRITE_IO_PER_S | The number of I/O log requests per second | 1788.6562 | 1788.6562 | 4.5631 | 22.8105 |
LOG_WRITE_MB_PER_S | The size of writing log data to disk in MB | 7.0000 | 5.2809 | 0.0421 | 0.3421 |
LOG_WRITE_TIME | The total elapsed time spent by the logger writing log data to the disk | 25306 | 110772 | 650 | 4217 |
LOG_WRITE_TIME_PER_IO_MS | The time elapsed by the logger writing log data per a single I/O request | 0.4421 | 12.6509 | 0.7497 | 1.1421 |
NUM_LOG_BUFFER_FULL | Number of times full log buffer caused agents to wait monitor element | 0 | 0 | 0 | 0 |
Buffer pool read time
High percentage bufferpool read time (PCT_POOL_R) is a common bottleneck, especially when the database is warming up (e.g. shortly after database activation or when a new application starts making connections). Average pool read times (AVG_READ_TIME) are reported at bufferpool and tablespace levels. Similarly, a high percentage for direct I/O time (for Large Objects) can be tracked in AVG_DRCT_READ_TIME and AVG_DRCT_WRITE_TIME.
Metric | Meaning | Sample value | Env1 | Env2 coord | Env2 member |
---|---|---|---|---|---|
BP_NAME | BP16 | ||||
POOL_DATA_L_READS | The number of data pages that have been requested from the buffer pool (logical) for regular and large table spaces | 3037899 | 359590 | 58506 | 167737 |
POOL_DATA_P_READS | Indicates the number of data pages read in from the tablespace containers (physical) for regular and large table spaces | 44159 | 3218 | 0 | 2723 |
POOL_INDEX_L_READS | Indicates the number of index pages that have been requested from the buffer pool (logical) for regular and large tablespaces. | 45043859 | 232830 | 56719 | 102246 |
POOL_INDEX_P_READS | Indicates the number of index pages read in from the tablespace containers (physical) for regular and large table spaces | 27427 | 0 | 0 | 75 |
POOL_COL_L_READS | Indicates the number of column-organized pages requested from the buffer pool (logical) for regular and large tablespaces | 0 | 0 | 0 | 909672 |
POOL_COL_P_READS | Indicates the number of column-organized pages read in from the table space containers (physical) for regular and large table spaces | 0 | 0 | 0 | 257798 |
POOL_READ_TIME | Indicates the total amount of time spent reading in data and index pages from the table space containers (physical) for all types of table spaces | 19619 | 18273 | 0 | 221163 |
AVG_READ_TIME | Average pool read time per a single pool read request | 5.67 | 0.27 | 0 | 0.84 |
Metric | Meaning | Sample value | Env1 | Env2 |
---|---|---|---|---|
TBSP_NAME | Tablespace name | TBS_OM | ---- | |
NUM_READS | Number of all read from tablespace: data, index and columnar | 29235 | 3217 | 329906 |
AVG_READ_TIME | The average read time for all requests | 0.27 | 5.68 | 0.86 |
DIRECT_READ_REQS | The number of requests to perform a direct read of one or more sectors of data | 0 | 0 | 0 |
AVG_DRCT_READ_TIME | The average direct read time | 0 | 0 | 0 |
NUM_WRITES | The number of all types of buffer pool data/index/columnar page was physically written to disk | 77301 | 74 | |
AVG_WRITE_TIME | The average write time of all types of data | 2.06 | 0.1 | |
DIRECT_WRITE_REQS | The number of direct-write requests | 0 | 0 | |
AVG_DRCT_WRITE_TIME | The average time od direct write request | 0 | 0 |
Round-trip time for messages between the members and CFs (PureScale)
For reports that show high percentage CF wait time (PCT_CF), the following report shows performance data on messages exchanged between the Db2 member and the CFs.
Round-trip CF command execution counts and average response times
Metric | Meaning | Sample value | Env1 |
---|---|---|---|
CF_CMD_NAME | Command name | ||
TOTAL_CF_REQUESTS | |||
AVG_CF_REQUEST_TIME_MICRO |
Reclaims of Index and Data pages (Pure Scale)
Metric | Meaning | Sample value | Env1 |
---|---|---|---|
TABSCHEMA | |||
TABNAME | |||
OBJTYPE | |||
PAGE_RECLAIMS | The number of times a page related to the object was reclaimed by another member in the Db2pureScale | ||
RECLAIM_WAIT_TIME | The amount of time spent waiting on page locks,where the lock request caused a page to be reclaimed. |
Reclaims of SMP pages (PureScale)
The following example shows reclaims of SMP pages, which are often due to heavy inserts into a table with a too-small tablespace extent size.
Metric | Meaning | Sample value | Env1 |
---|---|---|---|
TABSCHEMA | |||
TABNAME | |||
OBJTYPE | |||
SMP_PAGE_RECLAIMS | The number of times a page related to a space map pagewas reclaimed by another member in the Db2pureScale® instancebefore its planned release | ||
SMP_PAGE_RECLAIM_WAIT_TIME | The amount of time spent waiting on page locks,where the lock request caused a page to be reclaimed. |
Latch wait time
Metric | Meaning | Sample value | Env1 | Env2 |
---|---|---|---|---|
LATCH_NAME | SQLO_LT_SQLB_BPD__bpdLatch_SX | SQLO_LT_ibm_cde__query__TableQueue__mTQMutex | ||
TOT_EXT_LATCH_WAIT_TIME_MS | The amount of time, in milliseconds, spent in extended latch waits | 7182 | 2588 | 601828 |
TOT_EXT_LATCH_WAITS | The number of extended latch waits | 4454 | 453 | 101 |
TIME_PER_LATCH_WAIT_MS | The average amount of time spent on single wait | 1.61 | 5.71 | 5958.69 |
Example for ENV1
MEMBER LATCH_NAME TOT_EXT_LATCH_WAIT_TIME_MS TOT_EXT_LATCH_WAITS TIME_PER_LATCH_WAIT_MS
------ ------------------------------------------------------------ -------------------------- -------------------- ----------------------
0 SQLO_LT_SQLB_BPD__bpdLatch_SX 2588 453 5.71
0 SQLO_LT_SQLP_DBCB__sqlp_dbcb_sem 343 28 12.25
0 SQLO_LT_SQLD_PAGE_CACHE__pageCacheLatch 20 537 0.03
0 SQLO_LT_sqlo_waitpost__cs 6 5 1.20
0 SQLO_LT_SQLP_LHSH__xhshlatch 1 1 1.00
0 SQLO_LT_sqlo_WaitPostThreshold__cs 1 1 1.00
0 SQLO_LT_sqmPartitionedRequestMetrics__metricsSyncPoint 1 1 1.00
0 SQLO_LT_NO_IDENTITY 0 8 0.00
0 SQLO_LT_SMemPool__MemLatchType__latch 0 1 0.00
0 SQLO_LT_sqmObjMetricsTable__m_objMetricsLatch 0 4 0.00
0 SQLO_LT_sqlra_workspace_sibling__workspace_sibling_latch 0 2 0.00
Example for doc sample
MEMBER LATCH_NAME TOT_EXT_LATCH_WAIT_TIME_MS TOT_EXT_LATCH_WAITS TIME_PER_LATCH_WAIT_MS
------ ------------------------------------------------------------ -------------------------- -------------------- ----------------------
0 SQLO_LT_SQLB_BPD_bpdLatch_SX 7182 4454 1.61
3 SQLO_LT_SQLB_BPD_bpdLatch_SX 4085 1965 2.07
2 SQLO_LT_SQLB_BPD_bpdLatch_SX 3086 2088 1.47
1 SQLO_LT_SQLB_BPD_bpdLatch_SX 1764 1724 1.02
0 SQLO_LT_SQLB_BPD_WARLatch 1272 1783 0.71
2 SQLO_LT_SQLB_BPD_WARLatch 1168 2637 0.44
3 SQLO_LT_SQLB_BPD_WARLatch 706 2110 0.33
1 SQLO_LT_SQLB_BPD_WARLatch 509 1870 0.27
0 SQLO_LT_SQLP_LHSH_hshlatch 263 758 0.34
3 SQLO_LT_SQLI_INX_PAGE_CACHE_ipcLatch 240 36 6.66
2 SQLO_LT_SQLI_INX_PAGE_CACHE_ipcLatch 221 45 4.91
1 SQLO_LT_SQLI_INX_PAGE_CACHE_ipcLatch 193 21 9.19
0 SQLO_LT_SQLI_INX_PAGE_CACHE_ipcLatch 175 25 7.00
Example for Env2
MEMBER LATCH_NAME TOT_EXT_LATCH_WAIT_TIME_MS TOT_EXT_LATCH_WAITS TIME_PER_LATCH_WAIT_MS
------ ------------------------------------------------------------ -------------------------- -------------------- ----------------------
15 SQLO_LT_SQLKT_CB__multiReaderLatch 2760789 28 98599.60
7 SQLO_LT_ibm_cde__query__TableQueue__mTQMutex 601828 101 5958.69
12 SQLO_LT_ibm_cde__query__TableQueue__mTQMutex 596051 96 6208.86
16 SQLO_LT_ibm_cde__query__TableQueue__mTQMutex 559837 98 5712.62
2 SQLO_LT_SQLB_BPD__bpdLatch_SX 4592 310 14.81
1 SQLO_LT_SQLB_BPD__bpdLatch_SX 4574 263 17.39