HDD bad block remap - hpaluch/hpaluch.github.io GitHub Wiki

HDD bad block remap

[!WARNING] It was actually wacky disk bay again - causing disk to malfunction (with mechanical like errors including disk clicking, etc..). Connection disk directly to cables magically resolved all issues...

Honestly I'm no longer able to distinguish what was real disk error and what was again problematic disk bay contact (or rather disk bay connectors?)... Becasue few individual sectors are reported in self-tests...

I'm lucky to see Failing HDD in action (very old 200GB Maxtor - SATA variant - I have also identical disk in PATA/IDE variant).

I was fortunate enough to backups it recently (had installed 3 BSDs for experiments) - you can find details on BSD Dump Restore.

How do distinguish from Power issues

In the past I had issues with disk bays not working properly (I had to disconnect and reconnect both SATA cables to make them work).

Typical symptoms when there is faulty connection:

it takes long time to detect HDD at all by SATA controller
even MBR read will take forever or not at all
there are NO NEW ERRORS in smartctl output (not new error log entries and not increase of error counters)

When drive is really failing - typical (but not always!) behavior is:

drive is quickly detected
MBR read without issues
but suddenly some other sector read will take forever and will fail
increased error numbers in SMART
new entries in smart error log

Here are typical entries in SMART error log (output from smartctl -l error /dev/DRIVE:

Error 68 occurred at disk power-on lifetime: 30548 hours (1272 days + 20 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  5e 00 08 8c 76 ea e0

To find LBA of wrong block we can start self-test:

smartctl -t long /dev/sdb

After after a while we can query for results and we should see LBA of problematic sector:

$ smartctl -l selftest /dev/sdb

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-10-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Selective offline   Completed: read failure       40%     32584         15365772
# 2  Selective offline   Completed: read failure       40%     32584         15365772
# 3  Extended offline    Completed: read failure       40%     32584         15365772

Now we know that troublesome LBA sector address is 15365772.

To verify that it is indeed true we can try non-destructive read using:

$ dd if=/dev/sdb skip=15365772 of=bad.bin count=1

dd: error reading '/dev/sdb': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 3.02316 s, 0.0 kB/s

Notice: Input/output error and 0 bytes copied. Also dmesg | tail:

[ 2743.673661] ata10.00: status: { DRDY ERR }
[ 2743.678046] ata10.00: error: { UNC }
[ 2743.723253] ata10.00: configured for UDMA/133
[ 2743.727563] sd 9:0:0:0: [sdb] tag#14 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
[ 2743.731959] sd 9:0:0:0: [sdb] tag#14 Sense Key : Medium Error [current]
[ 2743.736376] sd 9:0:0:0: [sdb] tag#14 Add. Sense: Unrecovered read error - auto reallocate failed
[ 2743.740901] sd 9:0:0:0: [sdb] tag#14 CDB: Read(10) 28 00 00 ea 76 88 00 00 08 00
[ 2743.745341] I/O error, dev sdb, sector 15365772 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 2743.749858] Buffer I/O error on dev sdb, logical block 1920721, async page read
[ 2743.754401] ata10: EH complete

Now DESTRUCTIVE write test:

WARNING! Command below is wrong, because it uses Buffered I/O which will try read-first (preventing bad sector remap).

##### DESTRUCTIVE - OVERWRITES BAD SECTOR! #####

$ dd if=/dev/zero of=/dev/sdb seek=15365772 count=1

dd: writing to '/dev/sdb': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 1.51658 s, 0.0 kB/s

And dmesg output:

[ 2824.075133] ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 2824.079610] ata10.00: irq_stat 0x40000001
[ 2824.083941] ata10.00: failed command: READ DMA
[ 2824.088292] ata10.00: cmd c8/00:08:88:76:ea/00:00:00:00:00/e0 tag 28 dma 4096 in
                        res 51/40:08:8c:76:ea/00:00:00:00:00/e0 Emask 0x9 (media error)
[ 2824.097341] ata10.00: status: { DRDY ERR }
[ 2824.101902] ata10.00: error: { UNC }
[ 2824.146833] ata10.00: configured for UDMA/133
[ 2824.151431] sd 9:0:0:0: [sdb] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s
[ 2824.156146] sd 9:0:0:0: [sdb] tag#28 Sense Key : Medium Error [current]
[ 2824.160835] sd 9:0:0:0: [sdb] tag#28 Add. Sense: Unrecovered read error - auto reallocate failed
[ 2824.165625] sd 9:0:0:0: [sdb] tag#28 CDB: Read(10) 28 00 00 ea 76 88 00 00 08 00
[ 2824.170377] I/O error, dev sdb, sector 15365772 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 2824.175227] ata10: EH complete
[ 2824.225367]  sdb: sdb1 sdb2 sdb3
                sdb1: <bsd: sdb5 sdb6 >
                sdb2: <netbsd: sdb7 sdb8 >
                sdb3: <openbsd: sdb9 sdb10bad subpartition - ignored
               bad subpartition - ignored
                >

Do you see READ DMA? It should be write! It is caused by buffered I/O.

Trying Direct I/O write:

##### DESTRUCTIVE - OVERWRITES BAD SECTOR! #####

$ dd if=/dev/zero of=/dev/sdb seek=15365772 count=1 oflag=direct

1+0 records in
1+0 records out
512 bytes copied, 0.000588475 s, 870 kB/s

Looks good! Now try reading:

$ dd if=/dev/sdb skip=15365772 of=bad.bin count=1 iflag=direct

1+0 records in
1+0 records out
512 bytes copied, 0.00862607 s, 59.4 kB/s

Also looks good!

We can also see in SMART attributes (diff comparing with old SMART output)

65,66c65,66
< 197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       2
< 198 Offline_Uncorrectable   0x0008   253   253   000    Old_age   Offline      -       0
---
> 197 Current_Pending_Sector  0x0008   253   253   000    Old_age   Offline      -       1
> 198 Offline_Uncorrectable   0x0008   252   252   000    Old_age   Offline      -       1

Looks good - try again smart test from broken sector:

smartctl -t select,15365772-max /dev/sdb

(you should hear HDD seeking for some time)

You can watch progress of this selective self-test with:

$ smartctl -l selective /dev/sdb

smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.8.12-10-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Selective self-test log data structure revision number 1
 SPAN   MIN_LBA    MAX_LBA  CURRENT_TEST_STATUS
    1  15365772  398297087  Self_test_in_progress [40% left] (20870712-20936247)
    2         0          0  Not_testing
    3         0          0  Not_testing
    4         0          0  Not_testing
    5         0          0  Not_testing

OS Recovery

I have 3 BSDs there (FreeBSD, NetBSD, OpenBSD). Affected system was FreeBSD. Unexpected the filesystem was heavily corrupted - missing essential commands.

So I decided to overwrite that partition with zeroes.

first try - too slow:


$ dd if=/dev/zero of=/dev/sdb1 oflag=direct status=progress

2525140480 bytes (2.5 GB, 2.4 GiB) copied, 328 s, 7.7 MB/s^


It would take 150 minutes to overwrite complete FreeBSD partition (64GB).

Trying 1MB block size:
```shell
$ dd if=/dev/zero of=/dev/sdb1 bs=1024k oflag=direct status=progress

14296285184 bytes (14 GB, 13 GiB) copied, 264 s, 54.2 MB/s

Ummm, after some time:

[  870.611126] ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[  870.616378] ata10.00: configured for UDMA/133
[  878.331127] ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[  878.336387] ata10.00: configured for UDMA/133

But I again hear infamous sounds from disk - when disk bay is wacky... Looks suspicious - so I rather moved up HDD from bay to direct SATA connection and tried again:

$ dd if=/dev/zero of=/dev/sdb1 bs=1024k oflag=direct status=progress

5373952000 bytes (5.4 GB, 5.0 GiB) copied, 81 s, 66.3 MB/s

And surprise! Write finished without single error (or SATA disconnect) at nice 63.5 MB/s average rate (remember that it is one of 1st SATA drive - actually PATA with on-board PATA to SATA adapter).

So now I can follow my own guide at BSD Dump Restore to restore FreeBSD partition.