SAP DB2 semget - HewlettPackard/LinuxKI GitHub Wiki
LinuxKI Warning
Poor Performance for Linux SAP clients accessing DB2 database server
Date: 01/08/2014
Problem
When using an application running on a remote DB2 client such as SAP, poor performance can be observed when accessing the DB2 database due to a high number of failed semget() calls used by the internal trace facility (db2trc), even if tracing facility is not enabled.
Background
In a recent benchmark, a customer was comparing the performance of their SAP application running on a remote DB2 client using both an HP DL380 (Linux) server and an IBM P770 (AIX) server. While the P770 server had a higher clock speed (3.1 GHz) than the DL380 (2.2 GHz), the performance of the DL380 was 2.5 to 3 times worse, which is worse than expected.
Investigation
While the CPU speed was considered a primary factor, Linux KI trace data was collected for 40 seconds with the runki script and analyzed to understand how the SAP client processes spent their time. Below is edited output from the PID Analysis Report (kiinfo -kipid) for one of the SAP client processes:
PID 1026 dw.sapRU1_D00
PPID 7939 dw.sapRU1_D00
********* SCHEDULER ACTIVITY REPORT ********
RunTime : 6.187232 SysTime : 0.451327 UserTime : 5.735905
SleepTime : 31.581517 Sleep Cnt : 30901 Wakeup Cnt : 26398
RunQTime : 2.191165 Switch Cnt: 35158 PreemptCnt : 4257
Last CPU : 2 CPU Migrs : 7346 NODE Migrs : 4128
Policy : SCHED_NORMAL vss : 8060548 rss : 58919
busy : 15.48%
sys : 1.13%
user : 14.35%
runQ : 5.48%
sleep : 79.03%
Note that while the CPU is slower, the process spends most of its time sleeping. The System Call Report for this process shows the a lot of time spent sleeping in rwsem_down_failed_common():
******** SYSTEM CALL REPORT ********
System Call Name Count Rate ElpTime Avg Max Errs AvSz KB/s
semget 27022 676.2 26.619859 0.000985 0.006248 27022
SLEEP 26305 658.3 24.837537 0.000944
Sleep Func 26304 24.837537 0.000944 0.006212 rwsem_down_failed_common
RUNQ 1.462762
CPU 0.319563
recvfrom 4649 116.3 7.017916 0.001510 0.110619 60 761 86.5
SLEEP 4535 113.5 6.711683 0.001480
Sleep Func 4535 6.711683 0.001480 0.110599 sk_wait_data
RUNQ 0.250706
CPU 0.055527
sendto 4485 112.2 0.073516 0.000016 0.001277 0 686 75.2
RUNQ 0.004954
CPU 0.068562
Note that the process made 27022 semget() calls that all failed (27022 errors). Many of the calls to semget() would sleep trying to obtain an exclusive lock on the internal semaphore's rwsem lock. The lock contention is due to many SAP processes calling semget() in parallel. Looking at the individual kitrace records for the processes, we can verify that all the semget() calls fail with an errno 2 (ENOENT) using the same semaphore key value:
0.100159 cpu=2 pid=1026 recvfrom ret=0xd68 syscallbeg= 0.000802 sockfd=9 *buf=0x2b035e9d5fe0 len=131072
flags=0x0 *src_addr=0x0 *addrlen=0x0
0.100908 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.000725 key=0x10cabb74 nsems=1 semflag=0x0
0.101352 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.000441 key=0x10cabb74 nsems=1 semflag=0x0
0.102538 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.001183 key=0x10cabb74 nsems=1 semflag=0x0
0.103752 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.001210 key=0x10cabb74 nsems=1 semflag=0x0
0.104775 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.001020 key=0x10cabb74 nsems=1 semflag=0x0
0.106332 cpu=3 pid=1026 semget ret=-2 syscallbeg= 0.001554 key=0x10cabb74 nsems=1 semflag=0x0
0.106633 cpu=3 pid=1026 writev ret=0x224 syscallbeg= 0.000016 fd=16 *iov=0x7fffa5a18230 iovcnt=2
This ENOENT error indicates that the semaphore with the key 0x10cabb74 does not exist and the IPC_CREAT flag is not set in the semflag parameter. To identify why the SAP process is trying to obtain a semaphore using a non-existent key, gstack/pstack was used to identify user-space stack trace when the semget() call was made:
#0 0x00000033baaea777 in semget () from /lib64/libc.so.6
#1 0x00002baee1b3c959 in sharedMutexAttach(TRC_SHARED_IPC_INFO_T*, int*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#2 0x00002baee1b3bb60 in sharedAttach(TRC_SHARED_IPC_INFO_T*, bool) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#3 0x00002baee1b37c80 in trcAttachShared () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#4 0x00002baee1ea4eeb in internalProcessTraceInit () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#5 0x00002baee1ea4cea in sqltinit () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#6 0x00002baee2081f9d in CLI_scrollGetDescAttr(db2UCinterface*, dataDescriptor**, unsigned long*, unsigned char) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#7 0x00002baee1c5145c in CLI_sqlCompleteCallback(db2UCinterface*, int) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#8 0x00002baee1e09871 in sqljrRqstComplete(sqljrDrdaArCb*, db2UCinterface*, int, bool) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#9 0x00002baee22ca8d1 in sqljrParseOpenQueryReply(db2UCinterface*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#10 0x00002baee22e1f45 in sqljrParse(db2UCinterface*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#11 0x00002baee22d397f in sqljrDrdaArOpen(db2UCinterface*, db2UCCursorInfo*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#12 0x00002baee21255d7 in csmOpen(db2UCinterface*, db2UCCursorInfo*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#13 0x00002baee1c593ab in CLI_sqlOpen(CLI_STATEMENTINFO*, CLI_ERRORHEADERINFO*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#14 0x00002baee1fe112f in SQLExecute2(CLI_STATEMENTINFO*, CLI_ERRORHEADERINFO*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#15 0x00002baee1ffc245 in SQLExecute () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so...
From the stack trace, the semaphore appears to be defined in the SQL tracing facility for DB2 (libdb2.so). Note the call to internalProcessTraceInit() above.
Note that the P770 system has the same problem with the failing semget() calls, but the semget() calls on AIX do not suffer from the same rwlock contention in the kernel code that Linux does. So what is a minor problem on AIX is a major problem on Linux.
Note that the Kparse report (kiinfo -kparse) will detect the issue:
- Warning: SAP worker task with many semget() errors (Freq > 50)
Solution
The problem and its solution are documented by IBM in the following article:
https://www-304.ibm.com/support/docview.wss?rs=71&uid=swg21209523&wv=1
The solution is to issue the following command at the DB2 client to allocate the semaphore in question.
db2trc alloc
Take special care that the right db2trc binary is used if multiple library versions are present on the system. Customer should consult with IBM if further details are needed.
After implementing the db2trc alloc, the performance of the DL380 was 30% faster than the P770, despite the difference in the CPU clock speed for this specific test.