Unaligned Direct IO - HewlettPackard/LinuxKI GitHub Wiki
LinuxKI Warning
Poor Oracle Logwriter performance using XFS
Date: 03/01/2016
Problem
Customer was using SAP/Oracle application and noticed that the Log File Sync time was one of the top wait events and wanted to know if this could be improved. The LinuxKI Toolset was used to take a KI dump using the runki script and was analyzed by HPE.
Customer was using RHEL 7.1 and the database was configured using XFS as the filesystem.
Investigation
When analyzing the Oracle logwriter (lgwr) task, the logwriter was observed taking significant time in the io_submit() system call sleeping in xfs_dio_wait():
PID 6108 ora_lgwr_PROD1
********* SCHEDULER ACTIVITY REPORT ********
RunTime : 1.646096 SysTime : 1.092170 UserTime : 0.553926
SleepTime : 15.103984 Sleep Cnt : 19145 Wakeup Cnt : 33631
RunQTime : 3.249784 Switch Cnt: 19293 PreemptCnt : 148
Last CPU : 36 CPU Migrs : 126 NODE Migrs : 0
Policy : SCHED_NORMAL vss : 108988353 rss : 4716
...
******** SYSTEM CALL REPORT ********
System Call Name Count Rate ElpTime Avg Max Errs AvSz KB/s
io_submit 4593 229.6 11.602889 0.002526 0.052115 0
SLEEP 10755 537.7 8.972124 0.000834
Sleep Func 10754 8.972124 0.000834 0.013605 inode_dio_wait
RUNQ 1.902624
CPU 0.729568
io_getevents 4593 229.6 5.003639 0.001089 0.016506 0
SLEEP 6723 336.1 3.938404 0.000586
Sleep Func 6722 3.938395 0.000586 0.009174 read_events
Sleep Func 1 0.000009 0.000009 0.000009 mutex_lock
RUNQ 1.005667
CPU 0.059568
...
******** PHYSICAL DEVICE REPORT ********
device rw avque avinflt io/s Kb/s avsz avwait avserv tot seq rnd reque flush maxwait maxserv
0x00800010 /dev/sdb
0x00800010 r 0.50 0.00 0 1 5 0.00 0.25 3 0 3 0 0 0.0 0.3
0x00800010 w 0.55 0.05 718 108100 150 0.01 0.95 14352 14328 25 0 0 1.0 13.6
0x00800010 t 0.55 0.05 718 108101 150 0.01 0.95 14355 14328 28 0 0 1.0 13.6
0x008000c0 /dev/sdm
0x008000c0 r 0.50 0.00 1 10 13 0.00 0.25 15 0 15 0 0 0.0 0.3
0x008000c0 w 0.55 0.05 717 108100 150 0.01 0.92 14350 14326 25 0 0 1.4 12.4
0x008000c0 t 0.55 0.05 718 108111 150 0.01 0.91 14365 14326 40 0 0 1.4 12.4
Ideally, you want the io_submit() to be asynchronous, so all the sleeping should be done in io_getevents() waiting for the I/Os to complete. Notice also that the physical I/Os are fast (less than 1 millisecond), yet the io_submit() system calls take an average of 2.56 milliseconds, plus the waits in io_getevents() waiting for the I/O to complete.
The top sleep stack traces show the code path for the io_submit() calls:
Process Sleep stack traces (sort by % of total wait time) - Top 20 stack traces
count wpct avg Stack trace
% msecs
===============================================================
10752 59.38 0.834 inode_dio_wait xfs_file_dio_aio_write xfs_file_aio_write do_io_submit
sys_io_submit tracesys | (null) skgfqio ksfd_skgfqio ksfdgo ksfd_sbio ksfdbio kcrfw_do_write
kcrfw_redo_write ksbabs
6705 26.04 0.587 read_events sys_io_getevents tracesys | (null) skgfospo skgfrwat ksfdwtio
ksfd_sbio ksfdbio kcrfw_do_write kcrfw_redo_write ksbabs ksbrdp opirip opidrv
So inode_dio_wait() is called from xfs_file_dio_aio_write(). The next step is to look at the RHEL 7.1 source code for xfs_file_dio_aio_write() to understand why inode_dio_wait() is called:
0709 /*
0710 * If we are doing unaligned IO, wait for all other IO to drain,
0711 * otherwise demote the lock if we had to flush cached pages
0712 */
0713 if (unaligned_io)
0714 inode_dio_wait(inode);
0715 else if (iolock == XFS_IOLOCK_EXCL) {
0716 xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
0717 iolock = XFS_IOLOCK_SHARED;
0718 }
So, inode_dio_wait() is called because the I/O is unaligned. By default, XFS uses a 4KB filesystem block size (see the mkfs.xfs manpage). Looking at the io_submit() system calls in the KI ASCII trace for the lgwr task, you can see that the I/Os are aligned on 512-byte boundaries:
$ grep "pid=6108" ki.1220_1157 | grep " io_submit " | more
0.000482 cpu=1 pid=6108 tgid=6108 io_submit [209] entry ctx_id=0x7f6fc4c6d000 nr=6 iocbpp=0x7fff96f0c040
IOCB[0]: iocbp=0x7f6fc0973e38 fd=277 op=1 bytes=140288 offset=0x57f4ce00 IOCB[1]: iocbp=0x7f6fc09740b0 fd=278 op=1
bytes=140288 offset=0x57f4ce00 IOCB[2]: iocbp=0x7f6fc0974328 fd=277 op=1 bytes=108544 offset=0x57f6f200
IOCB[3]: iocbp=0x7f6fc09745a0 fd=278 op=1 bytes=108544 offset=0x57f6f200 IOCB[4]: iocbp=0x7f6fc0974818 fd=277 op=1
bytes=136192 offset=0x57f89a00 IOCB[5]: iocbp=0x7f6fc0974a90 fd=278 op=1 bytes=136192 offset=0x57f89a00
0.003240 cpu=1 pid=6108 tgid=6108 io_submit [209] ret=0x6 syscallbeg= 0.002758 ctx_id=0x7f6fc4c6d000 nr=6
**iocbpp=0x7fff96f0c040
0.004548 cpu=1 pid=6108 tgid=6108 io_submit [209] entry ctx_id=0x7f6fc4c6d000 nr=6 **iocbpp=0x7fff96f0c040
IOCB[0]: iocbp=0x7f6fc0973e38 fd=277 op=1 bytes=426496 offset=0x57faae00 IOCB[1]: iocbp=0x7f6fc09740b0 fd=278 op=1
bytes=426496 offset=0x57faae00 IOCB[2]: iocbp=0x7f6fc0974328 fd=277 op=1 bytes=440320 offset=0x58013000
IOCB[3]: iocbp=0x7f6fc09745a0 fd=278 op=1 bytes=440320 offset=0x58013000 IOCB[4]: iocbp=0x7f6fc0974818 fd=277 op=1
bytes=312832 offset=0x5807e800 IOCB[5]: iocbp=0x7f6fc0974a90 fd=278 op=1 bytes=312832 offset=0x5807e800
0.008862 cpu=1 pid=6108 tgid=6108 io_submit [209] ret=0x6 syscallbeg= 0.004314 ctx_id=0x7f6fc4c6d000 nr=6
**iocbpp=0x7fff96f0c040
Note that the number of bytes are evenly divisible by 512, but not 1024. For example, 312832 / 512 = 611.
Solution
If the Oracle redo log files reside on XFS filesystem, then use a 512-byte filesystem block size for best results.