Barrier Writes - HewlettPackard/LinuxKI GitHub Wiki

LinuxKI Warning

Performance impact of barrier writes on Oracle DB writers using Oracle ASM

Date: 02/27/2013 updated: 11/06/2014

Problem

When using Oracle Automatic Storage Management (ASM), Oracle accesses the block devices directly, instead of using a file system. When Oracle opens the block devices, it sets the O_SYNC flag to ensure that writes are synchronous. The O_SYNC flag can then potentially cause Linux to issue barrier writes to the block device for each physical write request sent to the device. The barrier writes can cause the Oracle DB writers to be delayed in the io_submit() system call and can result in "Free buffer busy" wait in Oracle. Barrier writes can also impact the Oracle Redo Logwriter as well.

One of the purposes of this article is to show how the LinuxKI Toolset can be used to troubleshoot performance issues on Linux systems.

Background

The purpose of barrier writes it to ensure the on-disk consistency and ordering of critical data. The barrier writes are implemented using the SYNCHRONIZE_CACHE SCSI command. The SYNCHRONIZE_CACHE command forces the disk to flush its internal disk cache to the physical spindles, so that a power outage does not cause any loss of data. However, with more intelligent disk arrays, with battery-backed (non-volatile) cache, barrier writes are typically not needed.

Whether or not the SYNCHRONIZE_CACHE command is issued depends on how the storage array advertises itself. If the array advertises as "write cache disabled" or "write through cache", the SYNCHRONIZE_CACHE command is not issued. Note that some disk arrays with non-volatile caches will advertise as "write cache disabled" even though it does indeed have a write cache. However, if the array advertises as "write cache enabled" or "writeback cache", the SYNCHRONIZE_CACHE command would be sent for synchronous writes to the device. The write cache behavior can typically be seen during bootup, for example:

sd 7:0:5:0: [sdo] Write cache: enabled, read cache: enabled, supports DPO and FUA

The LinuxKI Toolset can show the barrier write and the potential impact it can have on performance. In the LinuxKI trace data, a barrier write sequence looks similar to the following...

0.018948 cpu=1 pid=9999 block_rq_insert dev_t=0x04100030 wr=write flags=SYNC|**FLUSH sector=0x0 len=0**
0.018949 cpu=1 pid=9999 block_rq_issue dev_t=0x04100030 wr=write flags=SYNC|**FLUSH sector=0xffffffffffffffff len=0**
0.019129 cpu=18 pid=13147 block_rq_complete dev_t=0x04100030 wr=write flags=SYNC **sector=0x0 len=0** qpid=0 spid=0 qtm= 0.000000 svtm= 0.000000

Note that the block_rq_insert and block_rq_issue records will have the FLUSH flag set. The sector will be either 0x0 or 0xffffffffffffffff, and the length will be 0. For the block_rq_complete will be similar, but the FLUSH flag may be cleared. Also note that the service time (svtm) will show 0.000000. Unfortunately, kiinfo cannot track the queue time (qtm) and service time (svtm) since a valid sector and length are not used. But note the different time difference between the block_rq_issue and block_rq_insert is the actual service time (0.019129 - 0x018949 = 0.000180 seconds). Note the barrier write is fast, but it is overhead and it is synchronous.

Investigation

So how do the barrier writes impact the Oracle DB writers? First, you need to understand how Oracle issues asynchronous I/Os when the Oracle parameter DISK_ASYNC_IO is set to TRUE. Oracle manages the I/O using the Asynchronous I/O system calls io_submit() and io_getevents(). The io_submit() system call is used to submit one or more I/Os in parallel, and then io_getevents() is used later to read the asynchronous I/O events from the completion queue and wait if needed.

On a recent internal Proof-of-Concept testing, a LinuxKI trace sample was taken with the "runki -p" command. Analysis of one of the Oracle DB writer processes using kiinfo -kipid showed the following:

PID 11771  ora_dbw0_proddb
  PPID 1  /sbin/init

********* SCHEDULER ACTIVITY REPORT ********
RunTime    :  5.438581  SysTime   :  3.748545   UserTime   :  1.690036
SleepTime  :  5.243795  Sleep Cnt :     33225   Wakeup Cnt :      2523
RunQTime   :  9.281119  Switch Cnt:     39427   PreemptCnt :      6202
Last CPU   :        12  CPU Migrs :      8887   NODE Migrs :       673
Policy     : SCHED_NORMAL

******** SYSTEM CALL REPORT ********
System Call Name     Count     Rate     ElpTime        Avg        Max Errs
io_submit               73      3.7   16.157268   **0.221332**   0.382323    0
   SLEEP             32647   1635.2    5.039375   0.000154
      **Sleep Func     32647             5.039375   0.000154   0.006489  wait_for_completion**
   RUNQ                                7.609908
   CPU                                 3.591303
semctl                2534    126.9    0.656791   0.000259   0.023531    0
   RUNQ                                0.547052
   CPU                                 0.109739
semop                  137      6.9    0.226927   0.001656   0.021632    0
   SLEEP               107      5.4    0.133690   0.001249
      Sleep Func       107             0.133690   0.001249   0.021362  sys_semtimedop
   RUNQ                                0.090507
   CPU                                 0.002730
io_getevents           266     13.3    0.003781   0.000014   0.000171    0
   RUNQ                                0.000105
   CPU                                 0.003676

While there is certainly CPU pressure as the RunQTime is high, notice that most of the Sleep Time is spent in the io_submit() system call in the kernel function wait_for_completion(). This is unusual since io_submit() system call is supposed to be doing asynchronous I/O. Note the average time for each io_submit() call is 0.221332 seconds. Typically, a process would sleep in io_getevents() if the I/O is not completed, but in the above example, io_getevents() always returns without sleeping. To understand why the process is sleeping in io_submit(), we can examine the individual KI trace records from kiinfo -kitrace:

Command line: kiinfo -kitrace pid=11771,sysenter,sysargs -ts 1204_0546 ... 0.274873 cpu=17 pid=11771 io_submit [209] entry ctx_id=0x7f32e270c000 nr=531 iocbpp=0x7fffa7931c80 0.274905 cpu=17 pid=11771 block_rq_insert dev_t=0x00800020 wr=write flags=ELVPRIV|SYNC|ALLOCED sector=0x565e00 len=8192 ** 0.274910 cpu=17 pid=11771 block_rq_issue dev_t=0x00800020 wr=write flags=SORTED|ELVPRIV|SYNC|ALLOCED sector=0x565e00 len=8192 0.274979 cpu=17 pid=11771 block_rq_insert dev_t=0x00800020 wr=write flags=SYNC|ALLOCED|NOIDLE|FLUSH sector=0x0 len=0 ** 0.274981 cpu=17 pid=11771 block_rq_issue dev_t=0x00800020 wr=write flags=SYNC|NOIDLE|FLUSH sector=0xffffffffffffffff len=0** 0.275061 cpu=17 pid=11771 sched_switch prio=120 state=SSLEEP next_pid=3410 next_prio=120 next_tgid=n/a policy=n/a STACKTRACE: wait_for_completion+0x1d __blkev_issue_flush+0xad blkdev_issue_flush+0x16 blkdev_fsync+0x3f vfs_fsync_range+0xa1 generic_write_sync+0x4b blkdev_aio_write+0x7e aio_rw_vect_retry+0x84 aio_run_iocb+0x64 do_io_submit+0x291 sys_io_submit+0x10 tracesys+0xd9 0.275198 cpu=12 pid=-1 sched_wakeup target_pid=11771 prio=120 target_cpu=17 success=1

Note that the processes issued two I/Os, the first for the user data, and the second for the barrier write. After issuing the barrier write, the process goes to sleep waiting for the barrier write to complete. We can examine the trace records for CPU 12 to see what occurred just prior to the sched_wakeup listed above:

    0.275196 cpu=12 pid=-1 block_rq_complete dev_t=0x00800020 wr=write flags=SOFTBARRIER|SYNC|ALLOCED|NOIDLE sector=0x0 len=0 qpid=0 spid=0 qtm=  0.000000 svtm=   0.000000
   0.275198 cpu=12 pid=-1 sched_wakeup target_pid=11771 prio=120 target_cpu=17 success=1

Note the wakeup occurred just after the completion of the barrier write. The DB writer isn't waiting for the user data to be written to the disk, but it is waiting for the completion of the barrier write, which makes the io_submit() take much longer.

With LinuxKI Toolset 2.0, the SCSI subsystem traces scsi_dispatch_cmd_start and scsi_dispatch_cmd_done can be enabled to identify the SYNCHRONIZE_CACHE commands. For example:

$ kiinfo -kitrace subsys=scsi -a 1000 -p 1| grep SYNC | more
...
    0.001219 cpu=20 pid=156881 scsi_dispatch_cmd_start path=3:0:5:0 **opcode=SYNCHRONIZE_CACHE** cmd_len=10 data_sglen=0, prot_sglen=0 prot_op=0 cmd=[35 00 00 00 00 00 00 00 00 00]
    0.001370 cpu=4 pid=-1 scsi_dispatch_cmd_done path=3:0:5:0 **opcode=SYNCHRONIZE_CACHE** cmd_len=10 data_sglen=0, prot_sglen=0 prot_op=0 cmd=[35 00 00 00 00 00 00 00 00 00] result=0

Solution / Workaround

As a result of this investigation, Bugzilla reports 955441 (RHEL 6) and 1022216 (RHEL 7) were filed. Starting with RHEL 6.5 an RHEL 7, the devices "cache_type" can be modified to alter the behavior of a disk device. For example, to change the cache_type of a device to avoid the barrier writes, the device can be changed from "write back" to "write_through" by writing the value "temporary write through" to the cache_type file.

$ cat  /sys/devices/pci0000:00/0000:00:05.0/0000:11:00.0/host3/rport-3:0-5/target3:0:5/3:0:5:0/scsi_disk/3:0:5:0/cache_type
write back
$ echo "temporary write through" > /sys/devices/pci0000:00/0000:00:05.0/0000:11:00.0/host3/rport-3:0-5/target3:0:5/3:0:5:0/scsi_disk/3:0:5:0/cache_type
$ echo 1 > /sys/devices/pci0000:00/0000:00:05.0/0000:11:00.0/host3/rport-3:0-5/target3:0:5/3:0:5:0/rescan
$ cat  /sys/devices/pci0000:00/0000:00:05.0/0000:11:00.0/host3/rport-3:0-5/target3:0:5/3:0:5:0/scsi_disk/3:0:5:0/cache_type
write through

Note the device must be rescanned for the change to take affect is only in effect until the next reboot. The SCSI block devices will need to be closed and reopened for the change to take effect, which may mean the applications (such as Oracle) will need to be restarted. Also, if you are using the multipath driver, you can reconfigure the devices through the multipathd command as the SCSI block devices are opened by the multipath daemon:

$ echo "reconfigure" | multipathd -k 

_CAUTION: Use with care. Changing the cache type of a disk device with volatile cache or improperly modifying the cache_type can result in data loss, data corruption, or performance degradation. Before using in a production environment, be sure to test in a test, benchmark, or POC environment to measure the impact and verify the behavior. _

Important note on 3PAR storage arrays

The issue has been common on 3PAR storage arrays that advertise as "Write cache: enabled". However, with 3PAR OS version 3.2.1 GA/MU1, the 3PAR storage arrays should now advertise as "Write cache: disabled". The change is documented in the HP 3PAR OS 3.2.1 GA/MU1 Release Notes:

Issue ID Item Description 104168 Correct reporting of writeback cache enable bit The WCE bit in the Cache Mode page (08h) is set to zero. This enables correct reporting of writeback cache enable bit.

After the 3PAR OS upgrade, be sure to reboot the Linux server or rescan each of the 3PAR devices as previously mentioned.