Semaphore Lock Scaling - HewlettPackard/LinuxKI GitHub Wiki

LinuxKI Warning

SV/Posix Semaphore Lock Scaling Issue
Date: 10/16/2015

Problem

High system time and poor performance for applications which employ semaphores, such as Oracle, on scale-up hardware.

Investigation

On a Superdome-X system running RHEL 6.7 with Oracle 12c, extremely high system time was observed. A Linux KI Toolset data collection was taken and the kparse report flagged high CPU utilization:

1.1 Global CPU Usage Counters
    nCPU          sys%        user%        idle%
      80        89.60%        8.82%        1.59%
Warning: CPU Bottleneck (Idle < 10%)

The kiprof (profile) report showed semtimedop() and semctl() system calls accounted for the majority of system CPU consumption:

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Kernel Functions executed during profile
   Count     Pct  State  Function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   99387  65.74%  SYS    sys_semtimedop
   36853  24.38%  SYS    sys_semctl

Examination of the hardclock records showed three main code locations:

$ grep " hardclock state=SYS sys_sem" ki.MMDD_HHMM|awk '{print $7}'|sort|uniq -c|sort -rn
  65522 sys_semtimedop+0x3c1
  34897 sys_semctl+0x137
  33852 sys_semtimedop+0x615

A review of the kernel debug information via gdb shows inlined sem_lock()

$ cat uname-a.MMDD_HHMM
Linux tux 2.6.32-573.el6.x86_64 #1 SMP Wed Jul 1 18:23:37 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux

$ gdb /usr/lib/debug/lib/modules/2.6.32-573.el6.x86_64/vmlinux
(gdb) list *(sys_semtimedop+0x3c1)
0xffffffff81221fa1 is in sys_semtimedop (ipc/sem.c:1668).
1663          error = security_sem_semop(sma, sops, nsops, alter);
1664          if (error)
1665                 goto out_rcu_wakeup;
1666  
1667          error = -EIDRM;
1668          locknum = sem_lock(sma, sops, nsops);
1669          if (sma->sem_perm.deleted)
1670                 goto out_unlock_free;
1671          /*
1672          * semid identifiers are not unique - find_alloc_undo may have
 
0218 /*
0219  * If the request contains only one semaphore operation, and there are
0220  * no complex transactions pending, lock only the semaphore involved.
0221  * Otherwise, lock the entire semaphore array, since we either have
0222  * multiple semaphores in our own semops, or we need to look at
0223  * semaphores from other pending complex operations.
0224  */
0225 static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
0226                               int nsops)
0227 {

sem_lock() ether obtains the spinlock protecting the entire semaphore set or the spinlock protecting the individual semaphore.

An examination of the semaphore configuration showed large values for SEMMSL (number of semaphore per semaphore set) and SEMOPM (maximum number of operations per semop system call):

$ grep sem sysctl-a.MMDD_HHMM
kernel.sem = 4096 512000 1600 2048

Solution

Reducing SEMMSL to 250 resulted in the required semaphores being spread across a greater number of semaphore sets, thus reducing the lock contention and resolving the high system CPU utilization. Note the Oracle 12c documentation recommends the following:

kernel.sem = 250 32000 100 128

Please note that prior to RHEL 6.6 (2.6.32-504) and SLES12, the following critical locking change is missing which could cause the semaphore set spinlock contention to be even worse:

BZ#880024

Previously, the locking of a semtimedop semaphore operation was not fine enough with remote non-uniform memory architecture (NUMA) node accesses. As a consequence, spinlock contention occurred, which caused delays in the semop() system call and high load on the server when running numerous parallel processes accessing the same semaphore. This update improves scalability and performance of workloads with a lot of semaphore operations, especially on larger NUMA systems. This improvement has been achieved by turning the global lock for each semaphore array into a per-semaphore lock for many semaphore operations, which allows multiple simultaneous semop() operations. As a result, performance degradation no longer occurs.