Replacing Hard Drive In Ambrosia April 2016 - shawfdong/hyades GitHub Wiki

zhome

# zpool status zhome
  pool: zhome
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 1.69T in 25h10m with 0 errors on Tue Feb  9 13:52:05 2016
config:

	NAME        STATE     READ WRITE CKSUM
	zhome       ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    mfid24  ONLINE       0     0     0
	    mfid25  ONLINE       0     0     2
	    mfid26  ONLINE       0     0     0
	    mfid27  ONLINE       0     0     0
	    mfid28  ONLINE       0     0     0
	    mfid34  ONLINE       0     0     0
	    mfid29  ONLINE       0     0     0
	    mfid30  ONLINE       0     0     0
	    mfid31  ONLINE       0     0     0
	    mfid35  ONLINE       0     0     0
	    mfid32  ONLINE       0     0     0
	    mfid33  ONLINE       0     0     0

errors: No known data errors

Which HDD is mfid25?

# /var/log/messages
Apr  4 10:34:53 ambrosia kernel: mfi0: 1014842 (513077710s/0x0002/info) - Unexpected sense: PD 11(e0x09/s2) Path 50000c0f01d2d59a, CDB: 8f 00 00 00 00 01 ca 1e f4 49 00 00 10 00 00 00, Sense: b/11/03

We see Predictive Failure for [9:1]:

# MegaCli -PDList -aAll

# MegaCli -pdInfo -PhysDrv '[9:1]' -a0
                                     
Enclosure Device ID: 9
Slot Number: 1
Drive's position: DiskGroup: 25, Span: 0, Arm: 0
Enclosure position: 1
Device Id: 19
WWN: 50000C0F01D2BAD5
Sequence Number: 2
Media Error Count: 34509
Other Error Count: 3094
Predictive Failure Count: 13
Last Predictive Failure Event Seq Number: 1009334
PD Type: SAS

Inquiry Data: WD      WD4001FYYG-01SL3VR07WD-WMC1F1253802 

and

# mfiutil show drives | grep WMC1F1253802
19 ( 3726G) ONLINE <WD WD4001FYYG-01SL3 VR07 serial=WD-WMC1F1253802> SCSI-6 E2:S1
So mfid25 = PD 11(e0x09/s2) = e2:s1
# mfiutil locate E2:S1 on
# mfiutil locate E2:S1 off
# mfiutil fail E2:S1

Then we physically replaced the hard drive. This time the machine didn't automatically reboot, likely because of the BIOS change!

Now zhome is degraded:

# zpool status zhome
  pool: zhome
 state: DEGRADED
status: One or more devices has been removed by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
  scan: resilvered 1.69T in 25h10m with 0 errors on Tue Feb  9 13:52:05 2016
config:

	NAME            STATE     READ WRITE CKSUM
	zhome           DEGRADED     0     0     0
	  raidz2-0      DEGRADED     0     0     0
	    mfid24      ONLINE       0     0     0
	    1701713305  REMOVED      0     0     0  was /dev/mfid25
	    mfid26      ONLINE       0     0     0
	    mfid27      ONLINE       0     0     0
	    mfid28      ONLINE       0     0     0
	    mfid34      ONLINE       0     0     0
	    mfid29      ONLINE       0     0     0
	    mfid30      ONLINE       0     0     0
	    mfid31      ONLINE       0     0     0
	    mfid35      ONLINE       0     0     0
	    mfid32      ONLINE       0     0     0
	    mfid33      ONLINE       0     0     0

errors: No known data errors
# mfiutil show drives
48 ( 3726G) UNCONFIGURED GOOD <WD WD4001FYYG-01SL3 VR08 serial=WD-WMC1F0E8KLXY> SCSI-6 E2:S1
# mfiutil create jbod -v E2:S1
Adding drive 48 to array 35
Adding array 35 to volume 25
It worked! No need to discard preserved cache!
# mfiutil show volumes
mfi0 Volumes:
  Id     Size    Level   Stripe  State   Cache   Name
 mfid0 ( 3725G) RAID-0      64k OPTIMAL Writes  
 mfid1 ( 3725G) RAID-0      64k OPTIMAL Writes  
 mfid2 ( 3725G) RAID-0      64k OPTIMAL Writes  
 mfid3 ( 3725G) RAID-0      64k OPTIMAL Writes  
 mfid4 ( 3725G) RAID-0      64k OPTIMAL Writes  
 mfid5 ( 3725G) RAID-0      64k OPTIMAL Writes  
 mfid6 ( 3725G) RAID-0      64k OPTIMAL Writes  
 mfid7 ( 3725G) RAID-0      64k OPTIMAL Writes  
 mfid8 ( 3725G) RAID-0      64k OPTIMAL Writes  
 mfid9 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid10 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid11 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid12 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid13 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid14 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid15 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid16 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid17 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid18 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid19 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid20 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid21 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid22 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid23 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid24 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid26 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid27 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid28 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid29 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid30 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid31 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid32 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid33 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid34 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid35 ( 3725G) RAID-0      64k OPTIMAL Writes  
mfid25 ( 3725G) RAID-0      64k OPTIMAL Writes  
So the new volume is still mfid25.
# zpool replace zhome 1701713305 mfid25
# zpool status zhome
  pool: zhome
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Apr  4 11:51:39 2016
        5.46M scanned out of 24.3T at 200K/s, (scan is slow, no estimated time)
        421K resilvered, 0.00% done
config:

	NAME              STATE     READ WRITE CKSUM
	zhome             DEGRADED     0     0     0
	  raidz2-0        DEGRADED     0     0     0
	    mfid24        ONLINE       0     0     0
	    replacing-1   REMOVED      0     0     0
	      1701713305  REMOVED      0     0     0  was /dev/mfid25/old
	      mfid25      ONLINE       0     0     0  (resilvering)
	    mfid26        ONLINE       0     0     0
	    mfid27        ONLINE       0     0     0
	    mfid28        ONLINE       0     0     0
	    mfid34        ONLINE       0     0     0
	    mfid29        ONLINE       0     0     0
	    mfid30        ONLINE       0     0     0
	    mfid31        ONLINE       0     0     0
	    mfid35        ONLINE       0     0     0
	    mfid32        ONLINE       0     0     0
	    mfid33        ONLINE       0     0     0

errors: No known data errors
⚠️ **GitHub.com Fallback** ⚠️