Unaligned Direct IO - HewlettPackard/LinuxKI GitHub Wiki

LinuxKI Warning

Poor Oracle Logwriter performance using XFS
Date: 03/01/2016

Problem

Customer was using SAP/Oracle application and noticed that the Log File Sync time was one of the top wait events and wanted to know if this could be improved. The LinuxKI Toolset was used to take a KI dump using the runki script and was analyzed by HPE.

Customer was using RHEL 7.1 and the database was configured using XFS as the filesystem.

Investigation

When analyzing the Oracle logwriter (lgwr) task, the logwriter was observed taking significant time in the io_submit() system call sleeping in xfs_dio_wait():

PID 6108  ora_lgwr_PROD1

    ********* SCHEDULER ACTIVITY REPORT ********
    RunTime    :  1.646096  SysTime   :  1.092170   UserTime   :  0.553926
    SleepTime  : 15.103984  Sleep Cnt :     19145   Wakeup Cnt :     33631
    RunQTime   :  3.249784  Switch Cnt:     19293   PreemptCnt :       148
    Last CPU   :        36  CPU Migrs :       126   NODE Migrs :         0
    Policy     : SCHED_NORMAL     vss : 108988353          rss :      4716

...

    ******** SYSTEM CALL REPORT ********
    System Call Name     Count     Rate     ElpTime        Avg        Max    Errs    AvSz     KB/s
    io_submit             4593    229.6   11.602889   0.002526   0.052115       0
       SLEEP             10755    537.7    8.972124   0.000834
          Sleep Func     10754             8.972124   0.000834   0.013605  inode_dio_wait
       RUNQ                                1.902624
       CPU                                 0.729568
    io_getevents          4593    229.6    5.003639   0.001089   0.016506       0
       SLEEP              6723    336.1    3.938404   0.000586
          Sleep Func      6722             3.938395   0.000586   0.009174  read_events
          Sleep Func         1             0.000009   0.000009   0.000009  mutex_lock
       RUNQ                                1.005667
       CPU                                 0.059568
...

    ******** PHYSICAL DEVICE REPORT ********
        device   rw  avque avinflt   io/s   Kb/s  avsz   avwait   avserv    tot    seq    rnd  reque  flush maxwait maxserv
    0x00800010  /dev/sdb
      0x00800010  r   0.50    0.00      0      1     5     0.00     0.25      3      0      3      0      0     0.0     0.3
      0x00800010  w   0.55    0.05    718 108100   150     0.01     0.95  14352  14328     25      0      0     1.0    13.6
      0x00800010  t   0.55    0.05    718 108101   150     0.01     0.95  14355  14328     28      0      0     1.0    13.6
    0x008000c0  /dev/sdm
      0x008000c0  r   0.50    0.00      1     10    13     0.00     0.25     15      0     15      0      0     0.0     0.3
      0x008000c0  w   0.55    0.05    717 108100   150     0.01     0.92  14350  14326     25      0      0     1.4    12.4
      0x008000c0  t   0.55    0.05    718 108111   150     0.01     0.91  14365  14326     40      0      0     1.4    12.4

Ideally, you want the io_submit() to be asynchronous, so all the sleeping should be done in io_getevents() waiting for the I/Os to complete. Notice also that the physical I/Os are fast (less than 1 millisecond), yet the io_submit() system calls take an average of 2.56 milliseconds, plus the waits in io_getevents() waiting for the I/O to complete.

The top sleep stack traces show the code path for the io_submit() calls:

 Process Sleep stack traces (sort by % of total wait time) - Top 20 stack traces
   count    wpct      avg   Stack trace
              %     msecs
===============================================================
   10752  59.38     0.834   inode_dio_wait  xfs_file_dio_aio_write  xfs_file_aio_write  do_io_submit 
sys_io_submit  tracesys  |  (null)  skgfqio  ksfd_skgfqio  ksfdgo  ksfd_sbio  ksfdbio  kcrfw_do_write
kcrfw_redo_write  ksbabs
    6705  26.04     0.587   read_events  sys_io_getevents  tracesys  |  (null)  skgfospo  skgfrwat  ksfdwtio 
ksfd_sbio  ksfdbio  kcrfw_do_write  kcrfw_redo_write  ksbabs  ksbrdp  opirip  opidrv

So inode_dio_wait() is called from xfs_file_dio_aio_write(). The next step is to look at the RHEL 7.1 source code for xfs_file_dio_aio_write() to understand why inode_dio_wait() is called:

0709         /*
0710          * If we are doing unaligned IO, wait for all other IO to drain,
0711          * otherwise demote the lock if we had to flush cached pages
0712          */
0713         if (unaligned_io)
0714                 inode_dio_wait(inode);
0715         else if (iolock == XFS_IOLOCK_EXCL) {
0716                 xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
0717                 iolock = XFS_IOLOCK_SHARED;
0718         }

So, inode_dio_wait() is called because the I/O is unaligned. By default, XFS uses a 4KB filesystem block size (see the mkfs.xfs manpage). Looking at the io_submit() system calls in the KI ASCII trace for the lgwr task, you can see that the I/Os are aligned on 512-byte boundaries:

$ grep "pid=6108" ki.1220_1157 | grep " io_submit " | more
    0.000482 cpu=1 pid=6108 tgid=6108 io_submit [209] entry ctx_id=0x7f6fc4c6d000 nr=6 iocbpp=0x7fff96f0c040
IOCB[0]: iocbp=0x7f6fc0973e38 fd=277 op=1 bytes=140288 offset=0x57f4ce00 IOCB[1]: iocbp=0x7f6fc09740b0 fd=278 op=1 
bytes=140288 offset=0x57f4ce00 IOCB[2]: iocbp=0x7f6fc0974328 fd=277 op=1 bytes=108544 offset=0x57f6f200 
IOCB[3]: iocbp=0x7f6fc09745a0 fd=278 op=1 bytes=108544 offset=0x57f6f200 IOCB[4]: iocbp=0x7f6fc0974818 fd=277 op=1
bytes=136192 offset=0x57f89a00 IOCB[5]: iocbp=0x7f6fc0974a90 fd=278 op=1 bytes=136192 offset=0x57f89a00
    0.003240 cpu=1 pid=6108 tgid=6108 io_submit [209] ret=0x6 syscallbeg=    0.002758 ctx_id=0x7f6fc4c6d000 nr=6 
**iocbpp=0x7fff96f0c040
    0.004548 cpu=1 pid=6108 tgid=6108 io_submit [209] entry ctx_id=0x7f6fc4c6d000 nr=6 **iocbpp=0x7fff96f0c040 
IOCB[0]: iocbp=0x7f6fc0973e38 fd=277 op=1 bytes=426496 offset=0x57faae00 IOCB[1]: iocbp=0x7f6fc09740b0 fd=278 op=1 
bytes=426496 offset=0x57faae00 IOCB[2]: iocbp=0x7f6fc0974328 fd=277 op=1 bytes=440320 offset=0x58013000
IOCB[3]: iocbp=0x7f6fc09745a0 fd=278 op=1 bytes=440320 offset=0x58013000 IOCB[4]: iocbp=0x7f6fc0974818 fd=277 op=1
bytes=312832 offset=0x5807e800 IOCB[5]: iocbp=0x7f6fc0974a90 fd=278 op=1 bytes=312832 offset=0x5807e800
    0.008862 cpu=1 pid=6108 tgid=6108 io_submit [209] ret=0x6 syscallbeg=    0.004314 ctx_id=0x7f6fc4c6d000 nr=6
**iocbpp=0x7fff96f0c040

Note that the number of bytes are evenly divisible by 512, but not 1024. For example, 312832 / 512 = 611.

Solution

If the Oracle redo log files reside on XFS filesystem, then use a 512-byte filesystem block size for best results.