NUMA Balancing - HewlettPackard/LinuxKI GitHub Wiki

LinuxKI Warning

Oracle Performance degrades after upgrading to RHEL 7
Dated: 04/06/2015

Problem

After upgrading from RHEL 6 to RHEL 7, some Oracle customers have reported degraded performance, especially on large Non-Uniform Memory Access (NUMA) systems like the DL980 or Superdome X servers.

Investigation

During the normal Oracle processing, Linux KI data was collected and analyzed. Analysis of the individual Oracle processes shows that the processes spent a lot of time sleeping in sleep_on_page(). Further review of the stack trace for this wait event shows that the thread is handling a page fault when it goes to sleep waiting on a page migration from one NUMA node to another.

PID 150089  oraclePROC1
  PPID 1  /usr/lib/systemd/system

    ********* SCHEDULER ACTIVITY REPORT ********
    RunTime    :  1.240211  SysTime   :  0.300939   UserTime   :  0.939272
    SleepTime  : 18.566175  Sleep Cnt :     11321   Wakeup Cnt :       629
    RunQTime   :  0.182789  Switch Cnt:     11714   PreemptCnt :       393
    Last CPU   :        43  CPU Migrs :      4330   NODE Migrs :       858
    Policy     : SCHED_NORMAL     vss :  27916259          rss :      7205    

    busy   :      6.20%
      sys  :      1.51% 
      user :      4.70%
    runQ   :      0.91%
    sleep  :     92.88%
 
    Kernel Functions calling sleep() - Top 20 Functions
       Count     Pct    SlpTime    Slp% TotalTime%   Msec/Slp   MaxMsecs  Func
         806   7.12%     5.9461  32.03%     29.75%      7.377     80.777  sleep_on_page
        1669  14.74%     5.5614  29.95%     27.82%      3.332     75.438  do_blockdev_direct_IO
        6728  59.43%     3.9617  21.34%     19.82%      0.589    159.637  sk_wait_data
        1958  17.30%     2.7493  14.81%     13.75%      1.404    139.797  poll_schedule_timeout
         126   1.11%     0.2368   1.28%      1.18%      1.879     73.157  read_events
          32   0.28%     0.1013   0.55%      0.51%      3.167     10.763  __mutex_lock_slowpath
           1   0.01%     0.0095   0.05%      0.05%      9.541      9.541  sleep_on_page_killable
 
    Process Sleep stack traces (sort by % of total wait time) - Top 20 stack traces
       count    wpct      avg   Stack trace
                  %     msecs
    ===============================================================
        1668  29.90     3.328   do_blockdev_direct_IO  __blockdev_direct_IO  blkdev_direct_IO  generic_file_aio_read 
    do_sync_read  vfs_read  sys_pread64  tracesys  |  __pread_nocancel  ksfd_skgfqio  ksfd_io  ksfdread  kcfrbd1
    kcbzib  kcbgtcr
        6621  21.33     0.598   sk_wait_data  tcp_recvmsg  inet_recvmsg  sock_aio_read.part.7  sock_aio_read
    do_sync_read  vfs_read  sys_read  tracesys  |  __read_nocancel  nttfprd  nsbasic_brc  nsbrecv  nioqrc  opikndf2
        1907  14.16     1.379   poll_schedule_timeout  do_sys_poll  sys_poll  tracesys  |  __poll_nocancel 
    sskgxp_selectex  skgxpiwait  skgxpwaiti  skgxpwait  ksxpwait  ksliwat  kslwaitctx  ksxprcv_int  ksxprcvimdwctx
    kclwcrs
         100   4.49     8.332   sleep_on_page  __wait_on_bit  wait_on_page_bit  __migration_entry_wait.isra.37
    migration_entry_wait  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  |  lxeg2u  ldxdts  evadis
    evaopn2  qerixGetKey  qerixStart
          84   3.53     7.805   sleep_on_page  __wait_on_bit  wait_on_page_bit  __migration_entry_wait.isra.37 
    migration_entry_wait  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  |  lxeg2u  ldxdts  evadis
    evaopn2  qerixGetKey  qerixStart
          82   3.45     7.802   sleep_on_page  __wait_on_bit  wait_on_page_bit  __migration_entry_wait.isra.37
    migration_entry_wait  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  |  ttcpip  opitsk  opiino
          62   2.41     7.218   sleep_on_page  __wait_on_bit  wait_on_page_bit  __migration_entry_wait.isra.37
    migration_entry_wait  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  |  kpobii  kpobav  opibvg
    opiexe  opiefn  opiodr
          44   1.28     5.396   sleep_on_page  __wait_on_bit  wait_on_page_bit  __migration_entry_wait.isra.37
    migration_entry_wait  handle_mm_fault  __do_page_fault  do_page_fault  page_fault  |  opitsk  opiino
...

Meanwhile, the kiprof output also shows some time spent in the page migration code...

non-idle GLOBAL HARDCLOCK STACK TRACES (sort by count):

   Count     Pct  Stack trace
============================================================
     416   0.82%  get_gendisk  blkdev_get  raw_open  chrdev_open  do_dentry_open  finish_open  do_last  path_openat 
do_filp_open  do_sys_open  sys_open  tracesys
     382   0.75%  __blk_run_queue  __elv_add_request  blk_insert_cloned_request  dm_dispatch_request  dm_request_fn 
__blk_run_queue  queue_unplugged  blk_flush_plug_list  blk_finish_plug  do_blockdev_direct_IO  __blockdev_direct_IO
blkdev_direct_IO  generic_file_aio_read  do_sync_read  vfs_read  sys_pread64
     379   0.74%  remove_migration_pte  rmap_walk  migrate_pages  migrate_misplaced_page  do_numa_page handle_mm_fault
 __do_page_fault  do_page_fault  page_fault
     362   0.71%  __page_check_address  try_to_unmap_one  try_to_unmap_file  try_to_unmap  migrate_pages 
 migrate_misplaced_page  do_numa_page  handle_mm_fault  __do_page_fault  do_page_fault  page_fault
     344   0.67%  __mutex_lock_slowpath  mutex_lock  try_to_unmap_file  try_to_unmap  migrate_pages
 migrate_misplaced_page  do_numa_page  handle_mm_fault  __do_page_fault  do_page_fault  page_fault
     304   0.60%  syscall_trace_leave  int_check_syscall_exit_work
     269   0.53%  syscall_trace_enter  tracesys
     257   0.50%  __schedule  schedule  schedule_timeout  sk_wait_data  tcp_recvmsg  inet_recvmsg sock_aio_read.part.7
 sock_aio_read  do_sync_read  vfs_read  sys_read  tracesys

Root Cause

In RHEL 7, additional code was added to make the Linux kernel more NUMA aware by migrating a task's pages to the same NUMA node where the task is running. However, due to the size of the Oracle SGA and frequency of the task migrations, this migration of the pages becomes costly.

Solution

The new NUMA page migration code can be disabled by setting the parameter kernel.numa_balancing to 0:

$ sysctl kernel.numa_balancing=0

$ echo 0 > /proc/sys/kernel/numa_balancing