Poor DirectIO Reads - HewlettPackard/LinuxKI GitHub Wiki

LinuxKI Warning

Poor Direct I/O read performance using Oracle with XFS on RHEL 7.1
Date: 03/01/2016

Problem

In a TPCC benchmark test on a Superdome X running RHEL 7.1 with multiple Oracle databases installed, two of the 8 databases were experiencing poor performance, while the remain 6 were performing as expected. Looking at the AWR Reports, the database reads were showing severe performance issues, even though the data resided on the same 3PAR storage array.

Top 10 Foreground Events by Total Wait Time

Event	Waits	Total Wait Time (sec)	Wait Avg(ms)	% DB time	Wait Class
db file scattered read	126,247	13.5K	107.05	48.0	User I/O
db file sequential read	128,698	13.5K	104.90	48.0	User I/O
enq: TX - row lock contention	3,879	869.1	224.05	3.1	Application
DB CPU		322.3		1.1

In all database instances, the databases resided on XFS filesystems.

Investigation

The LinuxKI Toolset was used to collect a 20 second KI dump of the system. The File Activity Report (kifile.*.txt) showed a lot of mutex contention in __mutex_lock_slowpath() on one of the database files:

device: 0x0fd00000 node: 1716550918 fstype:      REG  filename:  /oracle_data/tpccdata.dbf
System Call Name     Count     Rate     ElpTime        Avg        Max    Errs    AvSz     KB/s
readv                10485    524.2  910.428597   0.086832   0.393425       0   37954  19431.0
SLEEP             41089   2054.4  902.106968   0.021955
      Sleep Func     10237           875.541484   0.085527   0.387676  __mutex_lock_slowpath
      Sleep Func      9844             9.275561   0.000942   0.013671  rwsem_down_read_failed
      Sleep Func     10475             8.958206   0.000855   0.015064  rwsem_down_write_failed
      Sleep Func     10485             8.331718   0.000795   0.011906  do_blockdev_direct_IO
   RUNQ                                0.371456
   CPU                                 7.459641
pread64               8793    439.6  748.182554   0.085088   0.390384       0    8192   3517.2
   SLEEP             34473   1723.6  741.649100   0.021514
      Sleep Func      8710           720.760638   0.082751   0.382259  __mutex_lock_slowpath
      Sleep Func      8168             7.578238   0.000928   0.014525  rwsem_down_read_failed
      Sleep Func      8771             7.568110   0.000863   0.014681  rwsem_down_write_failed
      Sleep Func      8793             5.742113   0.000653   0.013306  do_blockdev_direct_IO
   RUNQ                                0.307918
   CPU                                 6.080786
lseek                10531    526.5    0.010822   0.000001   0.000029       0
   RUNQ                                0.000003 
   CPU                                 0.000020

Subsequent analysis of one of the Oracle processes reading this file shows the top sleep stack traces:

PID 295368  oracleORCL
  PPID 255353  bash

    ********* SCHEDULER ACTIVITY REPORT ********
    RunTime    :  0.761588  SysTime   :  0.225303   UserTime   :  0.536284  
    SleepTime  : 19.006968  Sleep Cnt :      1268   Wakeup Cnt :       751
    RunQTime   :  0.006764  Switch Cnt:      1272   PreemptCnt :         4
    Last CPU   :       252  CPU Migrs :        63   NODE Migrs :         0
    Policy     : SCHED_NORMAL     vss :  44006798          rss :      6492

...

    Process Sleep stack traces (sort by % of total wait time) - Top 20 stack traces
      count    wpct      avg   Stack trace
                  %     msecs
    ===============================================================
          80  43.68   103.783   __mutex_lock_slowpath  mutex_lock  xfs_file_aio_read  do_sync_readv_writev 
do_readv_writev  vfs_readv  sys_readv  tracesys  |  sstk  ksfd_skgfqio  ksfd_skgfrvio  ksfd_vio  ksfdvread  kcfrbd1 
kcbz_table_scan_read
          41  18.58    86.132   __mutex_lock_slowpath  mutex_lock  xfs_file_aio_read  do_sync_read  vfs_read 
sys_pread64  tracesys  |  __pread_nocancel  ksfd_skgfqio  ksfd_io  ksfdread  kcfrbd1  kcbzib  kcbgtcr  ktrgtc2
          29  11.60    75.996   SYSC_semtimedop  sys_semtimedop  tracesys  |  semop  skgpwwait  ksliwat  kslwaitctx 
ksqcmi  ksqgtlctx  ksqgelctx  ktuGetTxForXid  ktcwit1  kdddgb  kdusru  updrowFastPath
          23   9.46    78.144   __mutex_lock_slowpath  mutex_lock  xfs_file_aio_read  do_sync_read  vfs_read 
sys_pread64  tracesys  |  __pread_nocancel  ksfd_skgfqio  ksfd_io  ksfdread  kcfrbd1  kcbzib  kcbgtcr  ktrget2
          19   6.98    69.874   __mutex_lock_slowpath  mutex_lock  xfs_file_aio_read  do_sync_read  vfs_read 
sys_pread64  tracesys  |  __pread_nocancel  ksfd_skgfqio  ksfd_io  ksfdread  kcfrbd1  kcbzibmlt  kcbzib  kcbgtcr

Note in the stack traces above, the mutex_lock() function is called from xfs_file_aio_read(). The nice thing about Linux is that the source code is readily available. While viewing the source code can be intimidating, it’s not always super difficult to understand what is going on. Below is the RHEL 7.1 source code for xfs_file_aio_read() where locking is performed:

0283         /*
0284          * Locking is a bit tricky here. If we take an exclusive lock
0285          * for direct IO, we effectively serialise all new concurrent
0286          * read IO to this file and block it behind IO that is currently in
0287          * progress because IO in progress holds the IO lock shared. We only
0288          * need to hold the lock exclusive to blow away the page cache, so
0289          * only take lock exclusively if the page cache needs invalidation.
0290          * This allows the normal direct IO case of no page cache pages to
0291          * proceeed concurrently without serialisation.
0292          */
0293         xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
0294         if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
0295                 xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
0296                 xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
0297
0298                 if (inode->i_mapping->nrpages) {
0299                         ret = filemap_write_and_wait_range(
0300                                                         VFS_I(ip)->i_mapping,
0301                                                         pos, pos + size - 1);
0302                         if (ret) {
0303                                 xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
0304                                 return ret;
0305                         }
0306
0307                         /*
0308                          * Invalidate whole pages. This can return an error if
0309                          * we fail to invalidate a page, but this should never
0310                          * happen on XFS. Warn if it does fail.
0311                          */
0312                         ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
0313                                         pos >> PAGE_CACHE_SHIFT,
0314                                         (pos + size - 1) >> PAGE_CACHE_SHIFT);
0315                         WARN_ON_ONCE(ret);
0316                         ret = 0;
0317                 }
0318                 xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
0319         }

In the above code, if the file access is direct (IO_ISDIRECT) and there are file pages mapped into memory (inode->I_mapping->nrpages), then the shared lock is dropped (XFS_IOLOCK_SHARED), and the exclusive lock is obtained (XFS_IOLOCK_EXCL). Once the exclusive lock is obtained, then the pages associated with the direct I/O are pushed out of the cache. Note that all of the file pages are not flushed, just the ones that overlap the direct I/O. Eventually, if ALL the pages get pushed out of the filecache, performance would return to normal.

Linux versions impacted

In Linux version 3.16 and earlier, when the cached pages were invalidated, they were invalidated at the position of the Direct I/O until the end of the file. With Linux version 3.17, only the overlapping pages were flushed, so every time a new Direct I/O was done, only a small portion of the pages in the cache were invalidated, so the exclusive lock had to be obtained until the entire file was pushed out of the cache, which may never happen.

In Linux version 4.3, when the first Direct I/O is performed, the entire file is forced out of the cache, thus avoiding the exclusive lock after the first Direct I/O.

RHEL 7.1 followed the 3.17 behavior. See also http://oss.sgi.com/archives/xfs/2015-08/msg00296.html. RHEL 7.0 follows the 3.16 behavior and can still suffer from the same issue. RHEL 7.2 follows the 4.3 behavior and does not observe the same issue.

SLES 12 and earlier follows the 3.16 and earlier behavior.

Workaround

A quick solution to push the database pages out of the cache is to drop all of the file cache pages.

$ echo 3 > /proc/sys/vm/drop_caches

While this will resolve the problem with the Direct I/O performance, note that ALL pages for every filesystem are pushed out of the cache. So if you have some other files that perform cached access rather than direct I/O access, removing those pages from the cache may have other impacts.

If you only want to impact the one filesystem in question, you will need to stop the database, unmount the filesystem, and mount it again.

Another workaround existed for Linux distributions that follow the 3.16 behavior (RHEL 7.0 and earlier, SLES 12 and earlier). Since the pages are flushed from the position of the Direct I/O operation until the end of the file, you can simply do a single Direct I/O read from the file at file offset 0. For example:

$ dd if=<filename> of=/dev/null bs=4k count=1 iflag=direct

Solution

To resolve the problem, upgrade to RHEL 7.2 or to a Linux kernel based on version 4.3 or later behavior