When cow doesn't reboot - tum-t38/firefly GitHub Wiki

The disks in Slot 0 & 1 are configured with RAID1 as Virtual Disk 0. It contains the OS and is configured to be bootable in the MegaRAID controller configuration. This controller

LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 03)

is very old and has trouble using the correct boot disk. As a consequence if cow ever reboots, the BIOS will fail to find the boot disk. For some reason it tries to boot off a different Virtual Disk used by the ZFS setup.

The only known solution is to shutdown cow, disconnect all the disks except for those in Slots 0 & 1, and reboot cow. This will allow the OS to boot normally. After linux is back, the disks can be reconnected live. They will be detected by the OS but will not yet be properly known to the controller nor belong to the proper ZFS pools.

At this point follow this procedure. Alternatively, one can choose to one reconnect one set of disks belonging to a single ZFS pool and run the following commands, and then repeat for the next set of disks.

storcli64 -PDList -aAll | sed -e "s@\[.* Sectors\]@@g" | grep -e "Firmware state\|Slot Number:\|Raw Size:\|Drive has flagged a S.M.A.R.T alert" | sed 'N;N;N;s/\n/\t\t/g'

# Restart the controller is disks are not appearing in the list
storcli64 /c0 restart

# Clear all foreign flags
storcli64 /c0 /fall delete

# Setup only unconfigured good drives as raid 0 to be ready for ZFS
storcli64 -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0

#storcli /c0 restart
zpool import -d /dev/disk/by-id -aN

zpool status

Once the ZFS pools are back, you may notice on NFS clients that the disks are still not available or you get messages about stale NFS handles.

Check rpcbind is functioning correctly with service rpcbind status

Try to restart the nfs server on cow with service nfs-kernel-server restart

It may require a short amount of time for the clients to recover.