SAP DB2 semget - HewlettPackard/LinuxKI GitHub Wiki

LinuxKI Warning

Poor Performance for Linux SAP clients accessing DB2 database server
Date: 01/08/2014

Problem

When using an application running on a remote DB2 client such as SAP, poor performance can be observed when accessing the DB2 database due to a high number of failed semget() calls used by the internal trace facility (db2trc), even if tracing facility is not enabled.

Background

In a recent benchmark, a customer was comparing the performance of their SAP application running on a remote DB2 client using both an HP DL380 (Linux) server and an IBM P770 (AIX) server. While the P770 server had a higher clock speed (3.1 GHz) than the DL380 (2.2 GHz), the performance of the DL380 was 2.5 to 3 times worse, which is worse than expected.

Investigation

While the CPU speed was considered a primary factor, Linux KI trace data was collected for 40 seconds with the runki script and analyzed to understand how the SAP client processes spent their time. Below is edited output from the PID Analysis Report (kiinfo -kipid) for one of the SAP client processes:

PID 1026  dw.sapRU1_D00
  PPID 7939  dw.sapRU1_D00

    ********* SCHEDULER ACTIVITY REPORT ********
    RunTime    :  6.187232  SysTime   :  0.451327   UserTime   :  5.735905
    SleepTime  : 31.581517  Sleep Cnt :     30901   Wakeup Cnt :     26398
    RunQTime   :  2.191165  Switch Cnt:     35158   PreemptCnt :      4257
    Last CPU   :         2  CPU Migrs :      7346   NODE Migrs :      4128
    Policy     : SCHED_NORMAL     vss :   8060548          rss :     58919

    busy   :     15.48%
      sys  :      1.13%
      user :     14.35%
    runQ   :      5.48%
    sleep  :     79.03%

Note that while the CPU is slower, the process spends most of its time sleeping. The System Call Report for this process shows the a lot of time spent sleeping in rwsem_down_failed_common():

    ******** SYSTEM CALL REPORT ********
    System Call Name     Count     Rate     ElpTime        Avg        Max    Errs    AvSz     KB/s
    semget               27022    676.2   26.619859   0.000985   0.006248   27022
       SLEEP             26305    658.3   24.837537   0.000944
          Sleep Func     26304            24.837537   0.000944   0.006212  rwsem_down_failed_common
       RUNQ                                1.462762
       CPU                                 0.319563
    recvfrom              4649    116.3    7.017916   0.001510   0.110619      60     761     86.5
       SLEEP              4535    113.5    6.711683   0.001480
          Sleep Func      4535             6.711683   0.001480   0.110599  sk_wait_data
       RUNQ                                0.250706
       CPU                                 0.055527
    sendto                4485    112.2    0.073516   0.000016   0.001277       0     686     75.2
       RUNQ                                0.004954
       CPU                                 0.068562

Note that the process made 27022 semget() calls that all failed (27022 errors). Many of the calls to semget() would sleep trying to obtain an exclusive lock on the internal semaphore's rwsem lock. The lock contention is due to many SAP processes calling semget() in parallel. Looking at the individual kitrace records for the processes, we can verify that all the semget() calls fail with an errno 2 (ENOENT) using the same semaphore key value:

    0.100159 cpu=2 pid=1026 recvfrom ret=0xd68 syscallbeg=    0.000802 sockfd=9 *buf=0x2b035e9d5fe0 len=131072 
flags=0x0 *src_addr=0x0 *addrlen=0x0
    0.100908 cpu=3 pid=1026 semget ret=-2 syscallbeg=    0.000725 key=0x10cabb74 nsems=1 semflag=0x0
    0.101352 cpu=3 pid=1026 semget ret=-2 syscallbeg=    0.000441 key=0x10cabb74 nsems=1 semflag=0x0
    0.102538 cpu=3 pid=1026 semget ret=-2 syscallbeg=    0.001183 key=0x10cabb74 nsems=1 semflag=0x0
    0.103752 cpu=3 pid=1026 semget ret=-2 syscallbeg=    0.001210 key=0x10cabb74 nsems=1 semflag=0x0
    0.104775 cpu=3 pid=1026 semget ret=-2 syscallbeg=    0.001020 key=0x10cabb74 nsems=1 semflag=0x0
    0.106332 cpu=3 pid=1026 semget ret=-2 syscallbeg=    0.001554 key=0x10cabb74 nsems=1 semflag=0x0
    0.106633 cpu=3 pid=1026 writev ret=0x224 syscallbeg=    0.000016 fd=16 *iov=0x7fffa5a18230 iovcnt=2 

This ENOENT error indicates that the semaphore with the key 0x10cabb74 does not exist and the IPC_CREAT flag is not set in the semflag parameter. To identify why the SAP process is trying to obtain a semaphore using a non-existent key, gstack/pstack was used to identify user-space stack trace when the semget() call was made:

#0  0x00000033baaea777 in semget () from /lib64/libc.so.6
#1  0x00002baee1b3c959 in sharedMutexAttach(TRC_SHARED_IPC_INFO_T*, int*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#2  0x00002baee1b3bb60 in sharedAttach(TRC_SHARED_IPC_INFO_T*, bool) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#3  0x00002baee1b37c80 in trcAttachShared () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#4  0x00002baee1ea4eeb in internalProcessTraceInit () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#5  0x00002baee1ea4cea in sqltinit () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#6  0x00002baee2081f9d in CLI_scrollGetDescAttr(db2UCinterface*, dataDescriptor**, unsigned long*, unsigned char) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#7  0x00002baee1c5145c in CLI_sqlCompleteCallback(db2UCinterface*, int) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#8  0x00002baee1e09871 in sqljrRqstComplete(sqljrDrdaArCb*, db2UCinterface*, int, bool) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#9  0x00002baee22ca8d1 in sqljrParseOpenQueryReply(db2UCinterface*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#10 0x00002baee22e1f45 in sqljrParse(db2UCinterface*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#11 0x00002baee22d397f in sqljrDrdaArOpen(db2UCinterface*, db2UCCursorInfo*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#12 0x00002baee21255d7 in csmOpen(db2UCinterface*, db2UCCursorInfo*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#13 0x00002baee1c593ab in CLI_sqlOpen(CLI_STATEMENTINFO*, CLI_ERRORHEADERINFO*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#14 0x00002baee1fe112f in SQLExecute2(CLI_STATEMENTINFO*, CLI_ERRORHEADERINFO*) () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so
#15 0x00002baee1ffc245 in SQLExecute () from /usr/sap/RU1/D00/exe/db6_clidriver/lib/libdb2.so...

From the stack trace, the semaphore appears to be defined in the SQL tracing facility for DB2 (libdb2.so). Note the call to internalProcessTraceInit() above.

Note that the P770 system has the same problem with the failing semget() calls, but the semget() calls on AIX do not suffer from the same rwlock contention in the kernel code that Linux does. So what is a minor problem on AIX is a major problem on Linux.

Note that the Kparse report (kiinfo -kparse) will detect the issue:

  • Warning: SAP worker task with many semget() errors (Freq > 50)

Solution

The problem and its solution are documented by IBM in the following article:

https://www-304.ibm.com/support/docview.wss?rs=71&uid=swg21209523&wv=1

The solution is to issue the following command at the DB2 client to allocate the semaphore in question.

db2trc alloc

Take special care that the right db2trc binary is used if multiple library versions are present on the system. Customer should consult with IBM if further details are needed.

After implementing the db2trc alloc, the performance of the DL380 was 30% faster than the P770, despite the difference in the CPU clock speed for this specific test.