HDD bad block remap - hpaluch/hpaluch.github.io GitHub Wiki
HDD bad block remap
[!WARNING] It was actually wacky disk bay again - causing disk to malfunction (with mechanical like errors including disk clicking, etc..). Connection disk directly to cables magically resolved all issues...
Honestly I'm no longer able to distinguish what was real disk error and what was again problematic disk bay contact (or rather disk bay connectors?)... Becasue few individual sectors are reported in self-tests...
I'm lucky to see Failing HDD in action (very old 200GB Maxtor - SATA variant - I have also identical disk in PATA/IDE variant).
I was fortunate enough to backups it recently (had installed 3 BSDs for experiments) - you can find details on BSD Dump Restore.
How do distinguish from Power issues
In the past I had issues with disk bays not working properly (I had to disconnect and reconnect both SATA cables to make them work).
Typical symptoms when there is faulty connection:
- it takes long time to detect HDD at all by SATA controller
- even MBR read will take forever or not at all
- there are NO NEW ERRORS in
smartctl
output (not new error log entries and not increase of error counters)
When drive is really failing - typical (but not always!) behavior is:
- drive is quickly detected
- MBR read without issues
- but suddenly some other sector read will take forever and will fail
- increased error numbers in SMART
- new entries in smart error log
Here are typical entries in SMART error log (output from smartctl -l error /dev/DRIVE
:
Error 68 occurred at disk power-on lifetime: 30548 hours (1272 days + 20 hours)
When the command that caused the error occurred, the device was in an unknown state.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
5e 00 08 8c 76 ea e0
To find LBA of wrong block we can start self-test:
smartctl -t long /dev/sdb
After after a while we can query for results and we should see LBA of problematic sector:
$ smartctl -l selftest /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-10-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Selective offline Completed: read failure 40% 32584 15365772
# 2 Selective offline Completed: read failure 40% 32584 15365772
# 3 Extended offline Completed: read failure 40% 32584 15365772
Now we know that troublesome LBA sector address is 15365772.
To verify that it is indeed true we can try non-destructive read using:
$ dd if=/dev/sdb skip=15365772 of=bad.bin count=1
dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 3.02316 s, 0.0 kB/s
Notice: Input/output error
and 0 bytes copied
. Also dmesg | tail
:
[ 2743.673661] ata10.00: status: { DRDY ERR }
[ 2743.678046] ata10.00: error: { UNC }
[ 2743.723253] ata10.00: configured for UDMA/133
[ 2743.727563] sd 9:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
[ 2743.731959] sd 9:0:0:0: [sdb] tag#14 Sense Key : Medium Error [current]
[ 2743.736376] sd 9:0:0:0: [sdb] tag#14 Add. Sense: Unrecovered read error - auto reallocate failed
[ 2743.740901] sd 9:0:0:0: [sdb] tag#14 CDB: Read(10) 28 00 00 ea 76 88 00 00 08 00
[ 2743.745341] I/O error, dev sdb, sector 15365772 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 2743.749858] Buffer I/O error on dev sdb, logical block 1920721, async page read
[ 2743.754401] ata10: EH complete
Now DESTRUCTIVE write test:
WARNING! Command below is wrong, because it uses Buffered I/O which will try read-first (preventing bad sector remap).
##### DESTRUCTIVE - OVERWRITES BAD SECTOR! #####
$ dd if=/dev/zero of=/dev/sdb seek=15365772 count=1
dd: writing to '/dev/sdb': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 1.51658 s, 0.0 kB/s
And dmesg
output:
[ 2824.075133] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 2824.079610] ata10.00: irq_stat 0x40000001
[ 2824.083941] ata10.00: failed command: READ DMA
[ 2824.088292] ata10.00: cmd c8/00:08:88:76:ea/00:00:00:00:00/e0 tag 28 dma 4096 in
res 51/40:08:8c:76:ea/00:00:00:00:00/e0 Emask 0x9 (media error)
[ 2824.097341] ata10.00: status: { DRDY ERR }
[ 2824.101902] ata10.00: error: { UNC }
[ 2824.146833] ata10.00: configured for UDMA/133
[ 2824.151431] sd 9:0:0:0: [sdb] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
[ 2824.156146] sd 9:0:0:0: [sdb] tag#28 Sense Key : Medium Error [current]
[ 2824.160835] sd 9:0:0:0: [sdb] tag#28 Add. Sense: Unrecovered read error - auto reallocate failed
[ 2824.165625] sd 9:0:0:0: [sdb] tag#28 CDB: Read(10) 28 00 00 ea 76 88 00 00 08 00
[ 2824.170377] I/O error, dev sdb, sector 15365772 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 2824.175227] ata10: EH complete
[ 2824.225367] sdb: sdb1 sdb2 sdb3
sdb1: <bsd: sdb5 sdb6 >
sdb2: <netbsd: sdb7 sdb8 >
sdb3: <openbsd: sdb9 sdb10bad subpartition - ignored
bad subpartition - ignored
>
Do you see READ DMA
? It should be write! It is caused by buffered I/O.
Trying Direct I/O write:
##### DESTRUCTIVE - OVERWRITES BAD SECTOR! #####
$ dd if=/dev/zero of=/dev/sdb seek=15365772 count=1 oflag=direct
1+0 records in
1+0 records out
512 bytes copied, 0.000588475 s, 870 kB/s
Looks good! Now try reading:
$ dd if=/dev/sdb skip=15365772 of=bad.bin count=1 iflag=direct
1+0 records in
1+0 records out
512 bytes copied, 0.00862607 s, 59.4 kB/s
Also looks good!
We can also see in SMART attributes (diff comparing with old SMART output)
65,66c65,66
< 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 2
< 198 Offline_Uncorrectable 0x0008 253 253 000 Old_age Offline - 0
---
> 197 Current_Pending_Sector 0x0008 253 253 000 Old_age Offline - 1
> 198 Offline_Uncorrectable 0x0008 252 252 000 Old_age Offline - 1
Looks good - try again smart test from broken sector:
smartctl -t select,15365772-max /dev/sdb
(you should hear HDD seeking for some time)
You can watch progress of this selective self-test with:
$ smartctl -l selective /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-10-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 15365772 398297087 Self_test_in_progress [40% left] (20870712-20936247)
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
OS Recovery
I have 3 BSDs there (FreeBSD, NetBSD, OpenBSD). Affected system was FreeBSD. Unexpected the filesystem was heavily corrupted - missing essential commands.
So I decided to overwrite that partition with zeroes.
- first try - too slow:
$ dd if=/dev/zero of=/dev/sdb1 oflag=direct status=progress 2525140480 bytes (2.5 GB, 2.4 GiB) copied, 328 s, 7.7 MB/s^
It would take 150 minutes to overwrite complete FreeBSD partition (64GB).
Trying 1MB block size:
```shell
$ dd if=/dev/zero of=/dev/sdb1 bs=1024k oflag=direct status=progress
14296285184 bytes (14 GB, 13 GiB) copied, 264 s, 54.2 MB/s
Ummm, after some time:
[ 870.611126] ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 870.616378] ata10.00: configured for UDMA/133
[ 878.331127] ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ 878.336387] ata10.00: configured for UDMA/133
But I again hear infamous sounds from disk - when disk bay is wacky... Looks suspicious - so I rather moved up HDD from bay to direct SATA connection and tried again:
$ dd if=/dev/zero of=/dev/sdb1 bs=1024k oflag=direct status=progress
5373952000 bytes (5.4 GB, 5.0 GiB) copied, 81 s, 66.3 MB/s
And surprise! Write finished without single error (or SATA disconnect) at nice 63.5 MB/s average rate (remember that it is one of 1st SATA drive - actually PATA with on-board PATA to SATA adapter).
So now I can follow my own guide at BSD Dump Restore to restore FreeBSD partition.