Replacing Hard Drive In Ambrosia April 2016 - shawfdong/hyades GitHub Wiki
# zpool status zhome pool: zhome state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: resilvered 1.69T in 25h10m with 0 errors on Tue Feb 9 13:52:05 2016 config: NAME STATE READ WRITE CKSUM zhome ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 mfid24 ONLINE 0 0 0 mfid25 ONLINE 0 0 2 mfid26 ONLINE 0 0 0 mfid27 ONLINE 0 0 0 mfid28 ONLINE 0 0 0 mfid34 ONLINE 0 0 0 mfid29 ONLINE 0 0 0 mfid30 ONLINE 0 0 0 mfid31 ONLINE 0 0 0 mfid35 ONLINE 0 0 0 mfid32 ONLINE 0 0 0 mfid33 ONLINE 0 0 0 errors: No known data errors
Which HDD is mfid25?
# /var/log/messages Apr 4 10:34:53 ambrosia kernel: mfi0: 1014842 (513077710s/0x0002/info) - Unexpected sense: PD 11(e0x09/s2) Path 50000c0f01d2d59a, CDB: 8f 00 00 00 00 01 ca 1e f4 49 00 00 10 00 00 00, Sense: b/11/03
We see Predictive Failure for [9:1]:
# MegaCli -PDList -aAll # MegaCli -pdInfo -PhysDrv '[9:1]' -a0 Enclosure Device ID: 9 Slot Number: 1 Drive's position: DiskGroup: 25, Span: 0, Arm: 0 Enclosure position: 1 Device Id: 19 WWN: 50000C0F01D2BAD5 Sequence Number: 2 Media Error Count: 34509 Other Error Count: 3094 Predictive Failure Count: 13 Last Predictive Failure Event Seq Number: 1009334 PD Type: SAS Inquiry Data: WD WD4001FYYG-01SL3VR07WD-WMC1F1253802
and
# mfiutil show drives | grep WMC1F1253802 19 ( 3726G) ONLINE <WD WD4001FYYG-01SL3 VR07 serial=WD-WMC1F1253802> SCSI-6 E2:S1So mfid25 = PD 11(e0x09/s2) = e2:s1
# mfiutil locate E2:S1 on # mfiutil locate E2:S1 off # mfiutil fail E2:S1
Then we physically replaced the hard drive. This time the machine didn't automatically reboot, likely because of the BIOS change!
Now zhome is degraded:
# zpool status zhome pool: zhome state: DEGRADED status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scan: resilvered 1.69T in 25h10m with 0 errors on Tue Feb 9 13:52:05 2016 config: NAME STATE READ WRITE CKSUM zhome DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 mfid24 ONLINE 0 0 0 1701713305 REMOVED 0 0 0 was /dev/mfid25 mfid26 ONLINE 0 0 0 mfid27 ONLINE 0 0 0 mfid28 ONLINE 0 0 0 mfid34 ONLINE 0 0 0 mfid29 ONLINE 0 0 0 mfid30 ONLINE 0 0 0 mfid31 ONLINE 0 0 0 mfid35 ONLINE 0 0 0 mfid32 ONLINE 0 0 0 mfid33 ONLINE 0 0 0 errors: No known data errors
# mfiutil show drives 48 ( 3726G) UNCONFIGURED GOOD <WD WD4001FYYG-01SL3 VR08 serial=WD-WMC1F0E8KLXY> SCSI-6 E2:S1
# mfiutil create jbod -v E2:S1 Adding drive 48 to array 35 Adding array 35 to volume 25It worked! No need to discard preserved cache!
# mfiutil show volumes mfi0 Volumes: Id Size Level Stripe State Cache Name mfid0 ( 3725G) RAID-0 64k OPTIMAL Writes mfid1 ( 3725G) RAID-0 64k OPTIMAL Writes mfid2 ( 3725G) RAID-0 64k OPTIMAL Writes mfid3 ( 3725G) RAID-0 64k OPTIMAL Writes mfid4 ( 3725G) RAID-0 64k OPTIMAL Writes mfid5 ( 3725G) RAID-0 64k OPTIMAL Writes mfid6 ( 3725G) RAID-0 64k OPTIMAL Writes mfid7 ( 3725G) RAID-0 64k OPTIMAL Writes mfid8 ( 3725G) RAID-0 64k OPTIMAL Writes mfid9 ( 3725G) RAID-0 64k OPTIMAL Writes mfid10 ( 3725G) RAID-0 64k OPTIMAL Writes mfid11 ( 3725G) RAID-0 64k OPTIMAL Writes mfid12 ( 3725G) RAID-0 64k OPTIMAL Writes mfid13 ( 3725G) RAID-0 64k OPTIMAL Writes mfid14 ( 3725G) RAID-0 64k OPTIMAL Writes mfid15 ( 3725G) RAID-0 64k OPTIMAL Writes mfid16 ( 3725G) RAID-0 64k OPTIMAL Writes mfid17 ( 3725G) RAID-0 64k OPTIMAL Writes mfid18 ( 3725G) RAID-0 64k OPTIMAL Writes mfid19 ( 3725G) RAID-0 64k OPTIMAL Writes mfid20 ( 3725G) RAID-0 64k OPTIMAL Writes mfid21 ( 3725G) RAID-0 64k OPTIMAL Writes mfid22 ( 3725G) RAID-0 64k OPTIMAL Writes mfid23 ( 3725G) RAID-0 64k OPTIMAL Writes mfid24 ( 3725G) RAID-0 64k OPTIMAL Writes mfid26 ( 3725G) RAID-0 64k OPTIMAL Writes mfid27 ( 3725G) RAID-0 64k OPTIMAL Writes mfid28 ( 3725G) RAID-0 64k OPTIMAL Writes mfid29 ( 3725G) RAID-0 64k OPTIMAL Writes mfid30 ( 3725G) RAID-0 64k OPTIMAL Writes mfid31 ( 3725G) RAID-0 64k OPTIMAL Writes mfid32 ( 3725G) RAID-0 64k OPTIMAL Writes mfid33 ( 3725G) RAID-0 64k OPTIMAL Writes mfid34 ( 3725G) RAID-0 64k OPTIMAL Writes mfid35 ( 3725G) RAID-0 64k OPTIMAL Writes mfid25 ( 3725G) RAID-0 64k OPTIMAL WritesSo the new volume is still mfid25.
# zpool replace zhome 1701713305 mfid25 # zpool status zhome pool: zhome state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Mon Apr 4 11:51:39 2016 5.46M scanned out of 24.3T at 200K/s, (scan is slow, no estimated time) 421K resilvered, 0.00% done config: NAME STATE READ WRITE CKSUM zhome DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 mfid24 ONLINE 0 0 0 replacing-1 REMOVED 0 0 0 1701713305 REMOVED 0 0 0 was /dev/mfid25/old mfid25 ONLINE 0 0 0 (resilvering) mfid26 ONLINE 0 0 0 mfid27 ONLINE 0 0 0 mfid28 ONLINE 0 0 0 mfid34 ONLINE 0 0 0 mfid29 ONLINE 0 0 0 mfid30 ONLINE 0 0 0 mfid31 ONLINE 0 0 0 mfid35 ONLINE 0 0 0 mfid32 ONLINE 0 0 0 mfid33 ONLINE 0 0 0 errors: No known data errors