Unterminated ixgbe NICs - HewlettPackard/LinuxKI GitHub Wiki

LinuxKI Warning

**10Gb Network Performance issues with unterminated ixgbe NICs **
Date: 04/20/2015

Problem

Intel ixgbe based network interface cards (NICs) with driver revision 3.22.02 or lower that have no SFP adapter connected can cause system-wide network delays in the 100-1000ms range when certain commands and utilities attempt to detect the SFP capabilities of the NIC. Other symptoms will include elevated Load Averages and processes in an uninterruptible sleep state. The utilities that attempt to identify the SFP features will wait for multiple 100ms timeouts if no SFP is connected. The waiting task is holding a transport-wide rtnl_lock used to control access to routing and device tables in exclusive mode causing many other unrelated network activities to block for this lock. Common network tasks and services such as the snmpd daemon frequently query the NIC while holding the rtnl_lock.

Investigation

During a period where network commands such as 'ifconfig' appear to hang, Linux KI data was collected and analyzed. Analysis of the hanging 'ifconfig' command shows that the task spent a lot of time sleeping in stacks related to acquisition of the rtnl_lock:

Process Sleep stack traces (sort by % of total wait time) - Top 20 stack traces
   count    wpct      avg   Stack trace
              %     msecs
===============================================================
           6  50.12  1135.401   __mutex_lock_slowpath  mutex_lock  rtnl_lock  devinet_ioctl  inet_ioctl  sock_ioctl
vfs_ioctl  do_vfs_ioctl  sys_ioctl  tracesys
       4  33.76  1146.952   __mutex_lock_slowpath  mutex_lock  rtnl_lock  rtnetlink_rcv  netlink_unicast
netlink_sendmsg  sock_sendmsg  sys_sendto  tracesys
       1   8.36  1136.040   __mutex_lock_slowpath  mutex_lock  netlink_dump  netlink_recvmsg  sock_recvmsg
__sys_recvmsg  sys_recvmsg  tracesys
       1   7.76  1055.000   __mutex_lock_slowpath  mutex_lock  rtnl_lock  dev_ioctl  sock_ioctl  vfs_ioctl 
do_vfs_ioctl  sys_ioctl  tracesys 

The “COOPERATING/COMPETING TASKS REPORT” section for this task will show the other tasks that wake it from these mutex lock sleeps:

******** COOPERATING/COMPETING TASKS REPORT ********

Tasks woken up by this task (Top 10)
       PID    Count   SlpPcnt     Slptime  Command 
    138108        1    36.20%   13.720052  -bash
     71219        1    14.93%    4.515843  kiinfo  (kiinfo)
       231        1     7.54%    2.281812  [migration/76]
     93828        4     6.43%    2.270062  cmathreshd

Tasks that have woken up this task(Top 10)
       PID    Count   SlpPcnt     Slptime  Command 
    144390       11    91.65%   12.456243  /usr/sbin/snmpd
     93806        1     8.35%    1.135013  cmahostd
        -1        0     0.00%    0.000000   ICS

Since the snmpd task appears to be involved with the vast majority of the sleep time, we take a closer look at what it is doing with its time. Note that the snmpd task is also sleeping the majority of the time in the following paths related to querying the ixgbe NIC :

PID 144390  /usr/sbin/snmpd
  PPID 1  /sbin/init

    ********* SCHEDULER ACTIVITY REPORT ********
    RunTime    :  4.518001  SysTime   :  4.511822   UserTime   :  0.006178
    SleepTime  : 35.032811  Sleep Cnt :       325   Wakeup Cnt :        72
    RunQTime   :  0.001950  Switch Cnt:       326   PreemptCnt :         1
    Last CPU   :         0  CPU Migrs :         3   NODE Migrs :         0
    Policy     : SCHED_NORMAL     vss :     27542          rss :      2583

    busy   :     11.42%
      sys  :     11.41%
      user :      0.02%
    runQ   :      0.00%
    sleep  :     88.57%

    ******** COOPERATING/COMPETING TASKS REPORT ********

    Tasks woken up by this task (Top 10)
           PID    Count   SlpPcnt     Slptime  Command 
          8114       40   100.00%   30.640394  /sbin/rsyslogd  (rsyslogd)
         71606       12   100.00%   13.496665  /sbin/ifconfig
         71591       11    91.65%   12.456243  /sbin/ifconfig
         71314        5   100.00%    5.417794  /sbin/ifconfig
         71310        1    14.70%    5.311666  kiinfo  (kiinfo)
         93806        2     2.89%    1.136224  cmahostd
         93828        1     0.00%    0.000000  cmathreshd

    Tasks that have woken up this task(Top 10)
           PID    Count   SlpPcnt     Slptime  Command 
            -1      325   100.00%   35.032811   ICS 
         93828        1     0.00%    0.000000  cmathreshd

    ******** SLEEP REPORT ********
    Kernel Functions calling sleep() - Top 20 Functions
       Count     Pct    SlpTime    Slp% TotalTime%   Msec/Slp   MaxMsecs  Func
         317  97.54%    31.9682  91.25%     80.82%    100.846    101.647  msleep
           7   2.15%     3.0646   8.75%      7.75%    437.799   1002.136  poll_schedule_timeout

    Process Sleep stack traces (sort by % of total wait time) - Top 20 stack traces
       count    wpct      avg   Stack trace
                  %     msecs
    ===============================================================
         310  89.23   100.842   msleep  ixgbe_read_i2c_byte_generic  ixgbe_read_i2c_eeprom_generic 
ixgbe_identify_sfp_module_generic  ixgbe_identify_module_generic  ixgbe_get_settings  dev_ethtool  dev_ioctl   
sock_ioctl  vfs_ioctl  do_vfs_ioctl  sys_ioctl  tracesys  |  ioctl
           7   8.75   437.799   poll_schedule_timeout  do_select  core_sys_select  sys_select  tracesys  | 
__select_nocancel  0x702e64706d6e732f
           7   2.02   101.032   msleep  ixgbe_read_i2c_byte_generic  ixgbe_read_i2c_eeprom_generic 
ixgbe_identify_sfp_module_generic  ixgbe_identify_module_generic  ixgbe_get_settings  dev_ethtool  dev_ioctl 
sock_ioctl  vfs_ioctl  do_vfs_ioctl  sys_ioctl  system_call_fastpath  |  ioctl

Other behaviors that can be observed without LinuxKI data include elevated Load Averages due to processing sleeping in an uninterruptible state. The ps -ax command will show these sleepers with a state flag of "D", and they will typically be those processes making the ioctl() syscall into the driver.

Root Cause

Source code review shows that the rtnl_lock is acquired in dev_ioctl() and held for the duration of the subsequent calls in the stack. The attempt to read the SFP ‘type’ via the call to ixgbe_read_i2c_byte_generic() will sleep for 10 iterations each with a 100ms timeout, thus holding the rtnl_lock for ~1 second. Other commands and network code paths requiring the lock will block waiting for the lock. The ixgbe driver revisions 3.22.0.2 and older can exhibit this problem.

Solution

A fix for the SFP data retrieval algorithm is available in driver versions 3.22.0.3 and newer.

The latest “HP Service Pack for Proliant”, version 2015.04.0, contains the 3.23.0.79-1 version of the ixgbe driver. This Service Pack is available at http://h17007.www1.hpe.com/us/en/enterprise/servers/products/service_pack/spp/index.aspx.

Workarounds

Connect an SFP termination to the NIC(s).

Identify and stop the processes querying the NIC (such as the snmpd daemon).