[Solved] Weird I/O Errors: "Logical unit not ready, cause not reportable"

Message

Zoot · #1 Post by **Zoot** » 2023-01-07 12:52

EDIT - Usually I/O Errors on HDDs/SSDs are signs of the drive failing, but I have the case where it doesn't seem to be hardware related at all, (at least that I can tell)

UPDATE (19/03/2023): I think I can mark this as solved. The TLDR version is that the Seagate Ironwolf & Toshiba N300 10TB HDDs I am using seem to have a serious issue with being awoken from Standby when the standby/spindown timer is set via hdparm -S XXX <drive> in my use case, but Western Digital Red 10TB drives do not. Seagate maintain an opensource tool to configure their drives instead (openSeaChest), I've switched to this to get the Ironwolf to spindown and it seems to work quite well. However I'm unsure of what to do as regards the Toshiba N300, maybe I just have to leave it spinning all the time...

My server which is running Bullseye consists of an array of 8x10TB hard drives. I run MergerFS to pool the drives, and SnapRAID for parity protection of the data. The 8 Hard Drives are all connected to an LSI 9207-8i HBA card.

Recently I've been having I/O errors on a single drive repeatedly in my server for quite a while. It all started after I upgraded the final 4TB drive within the system to a 10TB drive, in this case a Seagate Ironwolf Non-Pro 10TB.

Here is a sample of the dmesg reported logs. The drive then either gets unmounted entirely or re-mounted read-only. The errors are always reported on one particular drive in the system, I have yet to see them on any other drive in the system.

Code: Select all

Jan 07 06:43:27 server.home.lan kernel: sd 0:0:2:0: attempting task abort!scmd(0x000000006689c7fe), outstanding for 30464 ms & timeout 30000 ms
Jan 07 06:43:27 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8278 CDB: Read(16) 88 00 00 00 00 00 52 40 89 00 00 00 00 08 00 00
Jan 07 06:43:27 server.home.lan kernel: scsi target0:0:2: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jan 07 06:43:27 server.home.lan kernel: scsi target0:0:2: enclosure logical id(0x500605b006bbdb30), slot(1)
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: task abort: SUCCESS scmd(0x000000006689c7fe)
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8308 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=34s
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8308 Sense Key : Not Ready [current]
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8308 Add. Sense: Logical unit not ready, cause not reportable
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8308 CDB: Read(16) 88 00 00 00 00 00 52 40 89 00 00 00 00 08 00 00
Jan 07 06:43:31 server.home.lan kernel: blk_update_request: I/O error, dev sdc, sector 1379961088 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
Jan 07 06:43:31 server.home.lan kernel: EXT4-fs warning (device sdc1): htree_dirblock_to_tree:1042: inode #21561345: lblock 0: comm Thread Pool Wor: error -5 reading directory block
Jan 07 06:43:33 server.home.lan kernel: [UFW BLOCK] IN=enp35s0 OUT= MAC=d0:50:99:d6:ed:6b:52:54:00:c1:9c:45:08:00 SRC=10.1.9.192 DST=10.1.9.190 LEN=296 TOS=0x00 PREC=0x00 TTL=64 ID>
Jan 07 06:43:53 server.home.lan kernel: [UFW BLOCK] IN=enp35s0 OUT= MAC=d0:50:99:d6:ed:6b:52:54:00:c1:9c:45:08:00 SRC=10.1.9.192 DST=10.1.9.190 LEN=296 TOS=0x00 PREC=0x00 TTL=64 ID>
Jan 07 06:44:01 server.home.lan mono[2954]: [Error] DownloadedEpisodesImportService: Import failed, path does not exist or is not accessible by Sonarr: /Storage1/Torrents/tv-sonarr>
Jan 07 06:44:07 server.home.lan kernel: sd 0:0:2:0: device_block, handle(0x000a)
Jan 07 06:44:09 server.home.lan kernel: sd 0:0:2:0: device_unblock and setting to running, handle(0x000a)
Jan 07 06:44:09 server.home.lan kernel: blk_update_request: I/O error, dev sdc, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
Jan 07 06:44:09 server.home.lan kernel: sd 0:0:2:0: [sdc] Synchronizing SCSI cache
Jan 07 06:44:09 server.home.lan kernel: sd 0:0:2:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan 07 06:44:09 server.home.lan systemd[1]: Unmounting /Media6...
Jan 07 06:44:09 server.home.lan systemd[26428]: Media6.mount: Succeeded.
Jan 07 06:44:09 server.home.lan systemd[1]: Media6.mount: Succeeded.
Jan 07 06:44:09 server.home.lan systemd[1]: Unmounted /Media6.
Jan 07 06:44:09 server.home.lan kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221102000000)
Jan 07 06:44:09 server.home.lan kernel: mpt2sas_cm0: removing handle(0x000a), sas_addr(0x4433221102000000)
Jan 07 06:44:09 server.home.lan kernel: mpt2sas_cm0: enclosure logical id(0x500605b006bbdb30), slot(1)

It looks like the drive stops responding based on the "Logical Unit not Ready, Cause unreportable" part of the logs. The only way to get it back is either a reboot, or to physically remove power from the drive and re-connect it. The issue is random, it's not easy to reproduce it, sometimes the system will be good for weeks on end, but other times it'll happen pretty quickly after reboot. The drive that the errors are reported on has much more files than the other drives, I'm not sure if that can be a factor. See below:

Code: Select all

SnapRAID status report:

   Files Fragmented Excess  Wasted  Used    Free  Use Name
            Files  Fragments  GB      GB      GB
   18801      42      90   -76.9    5939    3978  59% media1
   15233      17      26   -77.8    8216    1700  82% media2
   11843      14      17   -78.1    6419    3498  64% media3
    5867      18      64    -1.5    6270    3724  62% media4
   44713      36      55   -73.7    5531    4386  55% media5
  690930    1076    1843    14.8    4126    5774  42% media6
 --------------------------------------------------------------------------
  787387    1203    2095    14.8   36503   23063  61%

I do call hdparm on boot to have the drives set to be spun down after 15 minutes, which I'm not sure could be a factor or not.

Normally, these type of errors are down to a failing hard drive, and in almost all cases in the past for me that has indeed been the case, but now I'm not even convinced it's a hardware problem anymore. Here is the list of things I've tried - SMART Data is all Clean.

Long, Short & Extended SMART Tests on the Drive - All come up OK

Swapping the Data Cable (SFF8087 to 4X SATA) Twice

Running the badblocks utility on the Drive - No badblocks are found

Swapping the drive for a different one - a Toshiba N300 10TB - Both show the same issues.

Moving the Drive from the LSI HBA card to the onboard motherboards SATA ports - Same issue

Swapping the Power Supply - Issue re-occurs

At this point, I'm not sure it's even a hardware problem at all. It definitely can't be the drive after swapping it for a drive from a totally different manufacturer. The HBA Card and the Motherboard SATA ports showing the same error kind of rules out those too. Swapping out the Power Supply points away from that also. If anyone has any comments or suggestions, they would be appreciated.

Things I intend on trying:

Different Kernel Versions - As of this morning, I've switched to Kernel 6.0.0 from the Bullseye backports, from the 5.10 that's in Bullseye

Disabling the Standby timeout on the drive that shows the issue.

Zoot · #2 Post by **Zoot** » 2023-01-07 19:41

These are the corresponding dmesg logs for the error occuring on a Seagate Ironwolf 10TB, the above post logs are from a Toshiba N300 10TB which gets unmounted entirely after the issue occurs.

The behaviour is slightly different on both drives, but fundamentally the issue appears to be the same.

This time, the drive gets mounted read-only after it occurs. The file system on the Ironwolf also has errors after this occurs (from a tune2fs -l), running fsck fixes them however.

Code: Select all

Dec 23 21:00:47.415403 server.home.lan kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Dec 23 21:00:47.415607 server.home.lan kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Dec 23 21:00:47.415652 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7052 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=4s
Dec 23 21:00:47.416103 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7052 CDB: Read(16) 88 00 00 00 00 03 74 d5 16 00 00 00 02 00 00 00
Dec 23 21:00:47.416440 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845023744 op 0x0:(READ) flags 0x80700 phys_seg 63 prio class 0
Dec 23 21:00:47.416492 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7057 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=4s
Dec 23 21:00:47.420680 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7057 CDB: Read(16) 88 00 00 00 00 03 74 d5 18 00 00 00 02 00 00 00
Dec 23 21:00:47.421330 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845024256 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0
Dec 23 21:00:47.421420 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7081 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:47.421903 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7081 Sense Key : Not Ready [current]
Dec 23 21:00:47.422139 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7081 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:47.422396 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7081 CDB: Read(16) 88 00 00 00 00 03 74 d5 16 00 00 00 00 08 00 00
Dec 23 21:00:47.422636 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845023744 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Dec 23 21:00:47.422675 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7085 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:47.422911 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7085 Sense Key : Not Ready [current]
Dec 23 21:00:47.423101 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7085 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:47.423265 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7085 CDB: Read(16) 88 00 00 00 00 03 74 d5 1a 00 00 00 02 00 00 00
Dec 23 21:00:47.423431 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845024768 op 0x0:(READ) flags 0x80700 phys_seg 59 prio class 0
Dec 23 21:00:47.423460 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7086 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:47.423632 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7086 Sense Key : Not Ready [current]
Dec 23 21:00:47.423798 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7086 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:47.423966 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7086 CDB: Read(16) 88 00 00 00 00 03 74 d5 18 00 00 00 00 08 00 00
Dec 23 21:00:47.424136 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845024256 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Dec 23 21:00:47.424166 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7035 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:47.424332 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7035 Sense Key : Not Ready [current]
Dec 23 21:00:47.424499 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7035 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:47.424690 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7035 CDB: Read(16) 88 00 00 00 00 03 74 d5 1c 00 00 00 02 00 00 00
Dec 23 21:00:47.424853 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845025280 op 0x0:(READ) flags 0x80700 phys_seg 59 prio class 0
Dec 23 21:00:47.424887 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7037 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:47.425044 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7037 Sense Key : Not Ready [current]
Dec 23 21:00:47.425186 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7037 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:47.425339 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7037 CDB: Read(16) 88 00 00 00 00 03 74 d5 1a 00 00 00 00 08 00 00
Dec 23 21:00:47.425485 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845024768 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Dec 23 21:00:47.425510 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6981 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:47.425648 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6981 Sense Key : Not Ready [current]
Dec 23 21:00:47.425799 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6981 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:47.425976 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6981 CDB: Read(16) 88 00 00 00 00 03 74 d5 1e 00 00 00 02 00 00 00
Dec 23 21:00:47.426109 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845025792 op 0x0:(READ) flags 0x80700 phys_seg 63 prio class 0
Dec 23 21:00:47.426128 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6982 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:47.426258 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6982 Sense Key : Not Ready [current]
Dec 23 21:00:47.426416 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6982 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:47.426558 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6982 CDB: Read(16) 88 00 00 00 00 03 74 d5 1c 00 00 00 00 08 00 00
Dec 23 21:00:47.426684 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845025280 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Dec 23 21:00:47.426702 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6984 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:47.426833 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6984 Sense Key : Not Ready [current]
Dec 23 21:00:47.426968 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6984 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:47.427094 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6984 CDB: Read(16) 88 00 00 00 00 03 74 d5 1e 00 00 00 00 08 00 00
Dec 23 21:00:47.427220 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 14845025792 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Dec 23 21:00:49.308555 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623695621)
Dec 23 21:00:49.312531 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623696128)
Dec 23 21:00:49.312661 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695365
Dec 23 21:00:49.312733 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695366
Dec 23 21:00:49.312812 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695367
Dec 23 21:00:49.312878 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695368
Dec 23 21:00:49.312940 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695369
Dec 23 21:00:49.312996 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695370
Dec 23 21:00:49.313060 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695371
Dec 23 21:00:49.313121 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695372
Dec 23 21:00:49.313181 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695373
Dec 23 21:00:49.313237 server.home.lan kernel: Buffer I/O error on device sdd1, logical block 623695374
Dec 23 21:00:49.313389 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623698176)
Dec 23 21:00:49.316525 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623700224)
Dec 23 21:00:49.320526 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623702272)
Dec 23 21:00:49.320659 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623703920)
Dec 23 21:00:49.320722 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623704176)
Dec 23 21:00:49.320780 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623704320)
Dec 23 21:00:49.320837 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623704576)
Dec 23 21:00:49.320938 server.home.lan kernel: EXT4-fs warning (device sdd1): ext4_end_bio:347: I/O error 10 writing to inode 2067970 starting block 623705092)
Dec 23 21:00:53.480548 server.home.lan kernel: scsi_io_completion_action: 357 callbacks suppressed
Dec 23 21:00:53.480795 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6994 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.481305 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6994 Sense Key : Not Ready [current]
Dec 23 21:00:53.481668 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6994 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.482026 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6994 CDB: Write(16) 8a 00 00 00 00 01 29 69 d7 00 00 00 0a 00 00 00
Dec 23 21:00:53.482380 server.home.lan kernel: print_req_error: 357 callbacks suppressed
Dec 23 21:00:53.482440 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989769472 op 0x1:(WRITE) flags 0x4000 phys_seg 60 prio class 0
Dec 23 21:00:53.482496 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6995 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.482854 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6995 Sense Key : Not Ready [current]
Dec 23 21:00:53.483225 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6995 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.483588 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6995 CDB: Write(16) 8a 00 00 00 00 01 29 69 e1 00 00 00 0a 00 00 00
Dec 23 21:00:53.483939 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989772032 op 0x1:(WRITE) flags 0x4000 phys_seg 55 prio class 0
Dec 23 21:00:53.483988 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6996 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.484339 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6996 Sense Key : Not Ready [current]
Dec 23 21:00:53.484725 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6996 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.485077 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6996 CDB: Write(16) 8a 00 00 00 00 01 29 69 eb 00 00 00 0a 00 00 00
Dec 23 21:00:53.485429 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989774592 op 0x1:(WRITE) flags 0x4000 phys_seg 28 prio class 0
Dec 23 21:00:53.485476 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6997 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.485825 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6997 Sense Key : Not Ready [current]
Dec 23 21:00:53.486171 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6997 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.486522 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6997 CDB: Write(16) 8a 00 00 00 00 01 29 69 f5 00 00 00 08 10 00 00
Dec 23 21:00:53.486928 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989777152 op 0x1:(WRITE) flags 0x0 phys_seg 113 prio class 0
Dec 23 21:00:53.486982 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6998 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.487337 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6998 Sense Key : Not Ready [current]
Dec 23 21:00:53.487694 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6998 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.488056 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6998 CDB: Write(16) 8a 00 00 00 00 01 29 69 fd 10 00 00 06 70 00 00
Dec 23 21:00:53.488407 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989779216 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
Dec 23 21:00:53.488457 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6999 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.488809 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6999 Sense Key : Not Ready [current]
Dec 23 21:00:53.489138 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6999 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.489462 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#6999 CDB: Write(16) 8a 00 00 00 00 01 29 6a 03 80 00 00 04 80 00 00
Dec 23 21:00:53.489786 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989780864 op 0x1:(WRITE) flags 0x0 phys_seg 57 prio class 0
Dec 23 21:00:53.489830 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7000 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.490151 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7000 Sense Key : Not Ready [current]
Dec 23 21:00:53.490477 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7000 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.490797 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7000 CDB: Write(16) 8a 00 00 00 00 01 29 6a 08 00 00 00 0a 00 00 00
Dec 23 21:00:53.491118 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989782016 op 0x1:(WRITE) flags 0x4000 phys_seg 59 prio class 0
Dec 23 21:00:53.491168 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7001 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.491493 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7001 Sense Key : Not Ready [current]
Dec 23 21:00:53.491841 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7001 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.492166 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7001 CDB: Write(16) 8a 00 00 00 00 01 29 6a 12 00 00 00 0a 00 00 00
Dec 23 21:00:53.492487 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989784576 op 0x1:(WRITE) flags 0x4000 phys_seg 60 prio class 0
Dec 23 21:00:53.492547 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7002 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.492865 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7002 Sense Key : Not Ready [current]
Dec 23 21:00:53.493186 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7002 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.493516 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7002 CDB: Write(16) 8a 00 00 00 00 01 29 6a 1c 00 00 00 0a 00 00 00
Dec 23 21:00:53.493846 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989787136 op 0x1:(WRITE) flags 0x4000 phys_seg 60 prio class 0
Dec 23 21:00:53.493889 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7003 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
Dec 23 21:00:53.494234 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7003 Sense Key : Not Ready [current]
Dec 23 21:00:53.494558 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7003 Add. Sense: Logical unit not ready, cause not reportable
Dec 23 21:00:53.494889 server.home.lan kernel: sd 0:0:3:0: [sdd] tag#7003 CDB: Write(16) 8a 00 00 00 00 01 29 6a 26 00 00 00 0a 00 00 00
Dec 23 21:00:53.495214 server.home.lan kernel: blk_update_request: I/O error, dev sdd, sector 4989789696 op 0x1:(WRITE) flags 0x4000 phys_seg 57 prio class 0
Dec 23 21:00:53.704541 server.home.lan kernel: Buffer I/O error on dev sdd1, logical block 1220581893, lost sync page write
Dec 23 21:00:53.704702 server.home.lan kernel: Aborting journal on device sdd1-8.
Dec 23 21:00:53.704746 server.home.lan kernel: Buffer I/O error on dev sdd1, logical block 1220575232, lost sync page write
Dec 23 21:00:53.704781 server.home.lan kernel: JBD2: Error -5 detected when updating journal superblock for sdd1-8.
Dec 23 21:00:53.751807 server.home.lan kernel: Buffer I/O error on dev sdd1, logical block 0, lost sync page write
Dec 23 21:00:53.751913 server.home.lan kernel: EXT4-fs (sdd1): I/O error while writing superblock
Dec 23 21:00:53.751977 server.home.lan kernel: EXT4-fs error (device sdd1): ext4_journal_check_start:83: Detected aborted journal
Dec 23 21:00:53.752139 server.home.lan kernel: EXT4-fs (sdd1): Remounting filesystem read-only

The only thing I can think of now is that the large file count on the problematic disk vs the other disks is somehow related to what's going on. I wonder if it would be worth trying another filesystem other than EXT4 (all disks are EXT4) to rule that out too.

Zoot · #3 Post by **Zoot** » 2023-01-08 19:07

A few more thoughts on this:

The Power Supply being the issue is quite unlikely now that I think of it. The PSU I'm using powers 4 SATA devices from a cables. I have the 8 HDDs connected via two individual cables each with 4 SATA connectors each.

This of course means that the 3.3V, 5V & 12V rails are shared between a number of the disks, given the cable is shared. Thus a PSU issue is very unlikely, as one would expect to see problems on more than just a single drive.

Initially when I got the problem back in Jan 2022, I swapped the data cable (SFF8087 to 4X SATA) and the problem disappeared until Dec of 2022. Although, I know now from my debug that the cable wasn't to blame, could it be that the successive kernel releases for Bullseye introduced the problem, removed it, and re-introduced it?

It seems unlikely but the Kernel is definitely one thing that would have been changing as time went on. I don't recall how the released lined up with my getting the issue back in Jan 2021.

#4 Post by **Aki** » 2023-01-09 14:14

Hello,
Could it be the effect of a race condition between heavy disk log by user programs accessing the disk (e.g.: SnapRAID) and periodic SMART disk scrub (e.g: smartd) ?

#5 Post by **p.H** » 2023-01-09 17:37

Zoot wrote: ↑2023-01-08 19:07 The Power Supply being the issue is quite unlikely now that I think of it. The PSU I'm using powers 4 SATA devices from a cables. I have the 8 HDDs connected via two individual cables each with 4 SATA connectors each.

This of course means that the 3.3V, 5V & 12V rails are shared between a number of the disks, given the cable is shared. Thus a PSU issue is very unlikely, as one would expect to see problems on more than just a single drive.

Or one would expect that the most active drive is the most likely to have problems.

Zoot wrote: ↑2023-01-08 19:07 Initially when I got the problem back in Jan 2022, I swapped the data cable (SFF8087 to 4X SATA) and the problem disappeared until Dec of 2022. Although, I know now from my debug that the cable wasn't to blame

What about the power cable ?

Zoot · #6 Post by **Zoot** » 2023-01-09 22:32

Aki wrote: ↑2023-01-09 14:14 Hello,
Could it be the effect of a race condition between heavy disk log by user programs accessing the disk (e.g.: SnapRAID) and periodic SMART disk scrub (e.g: smartd) ?

Possibly, I don't know enough to say yes or no. The one thing I do know is that it always happens when there's load on multiple disks. I've seen it happen during a Snapraid Scrub of the Array, during a Snapraid Sync of the array or when other services I have running (Sonarr/Radarr/Plex etc.) are scanning the drives. I haven't yet found out how to re-produce it reliably. There's no one single piece of software that brings on the issue from what I can see.

Is there any reason you mention smartd? Are there known issues with it? I do notice I had smartd running, usually I disable it given I have scripts of my own that I run as scheduled tasks to check for SMART errors.

p.H wrote: ↑2023-01-09 17:37 Or one would expect that the most active drive is the most likely to have problems.

What about the power cable ?

The most active drive consistently having issues points away from a hardware issue I think. As for the power cable I have swapped the whole PSU so that includes the power cable.

#7 Post by **Aki** » 2023-01-10 07:15

Hello,

Zoot wrote: ↑2023-01-09 22:32
Aki wrote: ↑2023-01-09 14:14 Hello,
Could it be the effect of a race condition between heavy disk log by user programs accessing the disk (e.g.: SnapRAID) and periodic SMART disk scrub (e.g: smartd) ?
Possibly, I don't know enough to say yes or no. The one thing I do know is that it always happens when there's load on multiple disks. I've seen it happen during a Snapraid Scrub of the Array, during a Snapraid Sync of the array or when other services I have running (Sonarr/Radarr/Plex etc.) are scanning the drives. I haven't yet found out how to re-produce it reliably. There's no one single piece of software that brings on the issue from what I can see.
Is there any reason you mention smartd? Are there known issues with it? I do notice I had smartd running, usually I disable it given I have scripts of my own that I run as scheduled tasks to check for SMART errors.

I don't know about specific issues with SMART during heavy disk loads.
Anyway, it could reasonable to modify the schedule of SMART checks not to run when other processes are doing heavy load on disks.

Zoot · #8 Post by **Zoot** » 2023-01-11 09:03

Aki wrote: ↑2023-01-10 07:15 Hello,

I don't know about specific issues with SMART during heavy disk loads.
Anyway, it could reasonable to modify the schedule of SMART checks not to run when other processes are doing heavy load on disks.

Thank you for your suggestion. Smartd was enabled and running. It was doing stuff a minute or two after the last time I got the error, but it didn't seem to show up in the logs before then.

When I switched back to Debian a little over 3 years ago with Buster, it was one of the things I disabled as I didn't want it waking the disks in the background from standby.

I meant to actually learn to configure it to my needs, but in the end I wrote my own script to check the SMART data of the drives and run short tests on them instead, set it up as a scheduled task and just left SmartD disabled. It must have been re-enabled when I did the upgrade to Bullseye from Buster back at release time. I've now disabled it once more.

So I have 4 days of uptime now without the error since I switched to the 6.0.0 kernel in backports. I was encountering the problem within 24-48 hours after a reboot in the past month so that is a very good sign.

If I get to over a week, I'll tentatively call the problem solved and just continue to use the newer kernel from backports until Bookworm is to release later this year.

Is it worth my filing a bug report? I've never done so before.

Zoot · #9 Post by **Zoot** » 2023-01-15 21:55

More updates - Hopefully I'll get to the bottom of it and this thread will prove a useful reference for someone in the future with a similar issue.

So to my disappointment, I got the same error again with the 6.0.0 Kernel present in Backports after a few days of uptime. See below:

Code: Select all

Jan 11 22:20:02 server.home.lan kernel: sd 8:0:2:0: attempting task abort!scmd(0x00000000c502c162), outstanding for 31208 ms & timeout 30000 ms
Jan 11 22:20:02 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5144 CDB: Read(16) 88 00 00 00 00 00 00 00 c9 08 00 00 00 08 00 00
Jan 11 22:20:02 server.home.lan kernel: scsi target8:0:2: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jan 11 22:20:02 server.home.lan kernel: scsi target8:0:2: enclosure logical id(0x500605b006bbdb30), slot(1)
Jan 11 22:20:06 server.home.lan bazarr[3717]: 2023-01-11 22:20:06,159 - urllib3.connectionpool           (7f94288f3bf0) :  WARNING (connectionpool:664) - Retrying (Retry(total=7, c>
Jan 11 22:20:06 server.home.lan kernel: sd 8:0:2:0: task abort: SUCCESS scmd(0x00000000c502c162)
Jan 11 22:20:06 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5160 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=35s
Jan 11 22:20:06 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5160 Sense Key : Not Ready [current]
Jan 11 22:20:06 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5160 Add. Sense: Logical unit not ready, cause not reportable
Jan 11 22:20:06 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5160 CDB: Read(16) 88 00 00 00 00 00 00 00 c9 08 00 00 00 08 00 00
Jan 11 22:20:06 server.home.lan kernel: I/O error, dev sda, sector 51464 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 2
Jan 11 22:20:06 server.home.lan kernel: EXT4-fs error (device sda1): __ext4_find_entry:1663: inode #2: comm bash: reading directory lblock 0
Jan 11 22:20:06 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5161 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jan 11 22:20:06 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5161 Sense Key : Not Ready [current]
Jan 11 22:20:06 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5161 Add. Sense: Logical unit not ready, cause not reportable
Jan 11 22:20:06 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5161 CDB: Read(16) 88 00 00 00 00 00 00 00 c9 08 00 00 00 08 00 00
Jan 11 22:20:06 server.home.lan kernel: I/O error, dev sda, sector 51464 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 2
Jan 11 22:20:06 server.home.lan kernel: EXT4-fs error (device sda1): __ext4_find_entry:1663: inode #2: comm bash: reading directory lblock 0
Jan 11 22:20:11 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5133 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jan 11 22:20:11 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5133 Sense Key : Not Ready [current]
Jan 11 22:20:11 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5133 Add. Sense: Logical unit not ready, cause not reportable
Jan 11 22:20:11 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5133 CDB: Write(16) 8a 00 00 00 00 02 46 05 10 98 00 00 00 10 00 00
Jan 11 22:20:11 server.home.lan kernel: I/O error, dev sda, sector 9764671640 op 0x1:(WRITE) flags 0x800 phys_seg 2 prio class 2
Jan 11 22:20:11 server.home.lan kernel: Aborting journal on device sda1-8.
Jan 11 22:20:11 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5134 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jan 11 22:20:11 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5134 Sense Key : Not Ready [current]
Jan 11 22:20:11 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5134 Add. Sense: Logical unit not ready, cause not reportable
Jan 11 22:20:11 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5134 CDB: Write(16) 8a 08 00 00 00 02 46 04 08 00 00 00 00 08 00 00
Jan 11 22:20:11 server.home.lan kernel: I/O error, dev sda, sector 9764603904 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 2
Jan 11 22:20:11 server.home.lan kernel: Buffer I/O error on dev sda1, logical block 1220575232, lost sync page write
Jan 11 22:20:11 server.home.lan kernel: JBD2: Error -5 detected when updating journal superblock for sda1-8.
Jan 11 22:20:15 server.home.lan kernel: [UFW BLOCK] IN=enp35s0 OUT= MAC=d0:50:99:d6:ed:6b:52:54:00:c1:9c:45:08:00 SRC=10.1.9.192 DST=10.1.9.190 LEN=296 TOS=0x00 PREC=0x00 TTL=64 ID>
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5183 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5183 Sense Key : Not Ready [current]
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5183 Add. Sense: Logical unit not ready, cause not reportable
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5183 CDB: Read(16) 88 00 00 00 00 00 00 00 c9 08 00 00 00 08 00 00
Jan 11 22:20:19 server.home.lan kernel: I/O error, dev sda, sector 51464 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 2
Jan 11 22:20:19 server.home.lan kernel: EXT4-fs error (device sda1): __ext4_find_entry:1663: inode #2: comm bash: reading directory lblock 0
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5101 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5101 Sense Key : Not Ready [current]
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5101 Add. Sense: Logical unit not ready, cause not reportable
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5101 CDB: Read(16) 88 00 00 00 00 00 00 00 c9 08 00 00 00 08 00 00
Jan 11 22:20:19 server.home.lan kernel: I/O error, dev sda, sector 51464 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 2
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5102 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5102 Sense Key : Not Ready [current]
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5102 Add. Sense: Logical unit not ready, cause not reportable
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5102 CDB: Write(16) 8a 08 00 00 00 00 00 00 08 00 00 00 00 08 00 00
Jan 11 22:20:19 server.home.lan kernel: I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0x23800 phys_seg 1 prio class 2
Jan 11 22:20:19 server.home.lan kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
Jan 11 22:20:19 server.home.lan kernel: EXT4-fs (sda1): I/O error while writing superblock
Jan 11 22:20:19 server.home.lan kernel: EXT4-fs error (device sda1): __ext4_find_entry:1663: inode #2: comm bash: reading directory lblock 0
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5103 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5103 Sense Key : Not Ready [current]
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5103 Add. Sense: Logical unit not ready, cause not reportable
Jan 11 22:20:19 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5103 CDB: Write(16) 8a 08 00 00 00 00 00 00 08 00 00 00 00 08 00 00
Jan 11 22:20:19 server.home.lan kernel: I/O error, dev sda, sector 2048 op 0x1:(WRITE) flags 0x23800 phys_seg 1 prio class 2
Jan 11 22:20:19 server.home.lan kernel: Buffer I/O error on dev sda1, logical block 0, lost sync page write
Jan 11 22:20:19 server.home.lan kernel: EXT4-fs (sda1): I/O error while writing superblock
Jan 11 22:20:25 server.home.lan bazarr[3717]: 2023-01-11 22:20:25,179 - urllib3.connectionpool           (7f94288f3bf0) :  WARNING (connectionpool:664) - Retrying (Retry(total=6, c>
Jan 11 22:20:34 server.home.lan kernel: [UFW BLOCK] IN=enp35s0 OUT= MAC=d0:50:99:d6:ed:6b:d0:50:99:f0:ac:a6:08:00 SRC=10.1.9.191 DST=10.1.9.190 LEN=246 TOS=0x00 PREC=0x00 TTL=64 ID>
Jan 11 22:20:42 server.home.lan kernel: sd 8:0:2:0: device_block, handle(0x000a)
Jan 11 22:20:44 server.home.lan kernel: sd 8:0:2:0: device_unblock and setting to running, handle(0x000a)
Jan 11 22:20:44 server.home.lan kernel: device offline error, dev sda, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 2
Jan 11 22:20:44 server.home.lan systemd[1]: Unmounting /Media6...
Jan 11 22:20:44 server.home.lan kernel: sd 8:0:2:0: [sda] Synchronizing SCSI cache
Jan 11 22:20:44 server.home.lan kernel: sd 8:0:2:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan 11 22:20:44 server.home.lan kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221102000000)
Jan 11 22:20:44 server.home.lan kernel: mpt2sas_cm0: removing handle(0x000a), sas_addr(0x4433221102000000)
Jan 11 22:20:44 server.home.lan kernel: mpt2sas_cm0: enclosure logical id(0x500605b006bbdb30), slot(1)
Jan 11 22:20:45 server.home.lan systemd[1]: Media6.mount: Succeeded.
Jan 11 22:20:45 server.home.lan systemd[1]: Unmounted /Media6

That sort of rules out the kernel, I've now returned to the default of 5.10.0-20 present in the Bullseye Repos. I had noticed issues mentioned for audio in the 5.10.0-20 kernel in this thread, but I went back through my logs and I've seen the issue occur in the Bullseye 5.10.0-19 kernel too. Three kernel installations across two entire different revisions of the kernel kind of rules out that too.
viewtopic.php?t=153532

I also accidentally induced the problem by accessing files on the disk in question - It was in standby and upon spinning up the error came back once again. I believe the problem only occurs when the disk in question is awoken from standby, it would fit with what goes on in the logs. It seems to happen at the start of Scrubs or when other Software is scanning multiple disks at once.

The good news is that disabling spinning down the drive should work - I did that 2 days ago and the system is still running well, but that is more of a workaround rather than a solution. Although if I have to leave the drive spinning all the time, I don't consider it a total show-stopper, it does have certain advantages too.

The last time I got the error, the drive didn't return after a reboot - I had to re-seat the SATA power connections to get it to come up again. This simply must be a problem with power delivery to the drive. It's likely having multiple drives spinning up at the same time is an issue, why it only occurs on one drive I'll never understand though.

Transient current pulled from the PSU during spin up is probably an issue, maybe that drive just isn't able to pull enough current from the rail while it spins up under certain circumstances. My experience in electronics tells me that the voltage rail should be pulled down if that's the case and given the voltage rail is common to the rest of the drives, they should experience issues aswell. Nonetheless, I lack the appropriate electronic equipment at home to investigate that theory.

I have experimented in manually putting all the drives into standby and waking them all at once but I haven't been able to reproduce the problem intentionally at all. To stress things I tried a scrub of 33% of the Array via SnapRAID with the Prime95 Torture Test running in the background over the weekend and that passed without issue. That wouldn't be totally surprising though if the spin-up specifically is the problem.

I have tried two separate Power Supplies within the server and both have yielded the same result, which sort of points away from the Power Supply. However, both of these Power Supplies are old - one being 10 years old and the other being 14 years old.

Maybe it's reasonable to assume that a consumer grade power supply from 10+ years ago is going to have an issue with 8 modern High Capacity NAS/Enterprise Hard-Drives, given far less power hungry 1-2TB drives were the norm back then. It might still be time for me to try replacing the PSU with a new one, having a new one in the server is probably worth it for the 10 year warranty most PSUs come with now anyway.

I'll continue to update this thread depending on what I find out. I think I do at least have a workaround now.

Zoot · #10 Post by **Zoot** » 2023-01-22 17:55

Some updates:

I got the same error with a 3rd Power Supply, so I know it's definitely not the PSU. I also thought it might be pressure on the power connectors to the drives so I re-did the cable management in the case to remedy that problem, needless to say that didn't solve the issue either. The same drive dropped offline after 2 days. The issue I described with the drive not coming online after a reboot is always solved by doing a cold boot, pointing away from the PSU.

I started to question this part of the logs, it's the only thing that seems to pop up every time just before the error occurs and the drive drops offline, this points to the LSI Card.

Code: Select all

Jan 11 22:20:02 server.home.lan kernel: sd 8:0:2:0: attempting task abort!scmd(0x00000000c502c162), outstanding for 31208 ms & timeout 30000 ms
Jan 11 22:20:02 server.home.lan kernel: sd 8:0:2:0: [sda] tag#5144 CDB: Read(16) 88 00 00 00 00 00 00 00 c9 08 00 00 00 08 00 00
Jan 11 22:20:02 server.home.lan kernel: scsi target8:0:2: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jan 11 22:20:02 server.home.lan kernel: scsi target8:0:2: enclosure logical id(0x500605b006bbdb30), slot(1)

After some research, it does seem I'm not alone with this issue. Here are a few instances of a very similar issue to the one I'm experiencing. The setups are different, (the 1st two threads are from FreeNas & FreeBSD), but the behavior of drives randomly dropping offline with an LSI card is common to all of the people posting in these threads.
https://www.truenas.com/community/threa ... 016.58251/
https://bugs.freebsd.org/bugzilla//show ... ?id=224496
https://bugzilla.kernel.org/show_bug.cgi?id=215446

I did mention above that I saw an issue upon moving the drive from the LSI card to the onboard SATA ports, but I don't recall which drive showed issues in that case and I still would have had the LSI HBA card present in the system with 4 out of the 8 drives plugged into it.

I had not yet tried removing the LSI 9207-8i card completely from my system, so that's what I've gone ahead and done now. I've now gone back to the old setup of a mixture of using the onboard SATA ports and an Add-In PCIe SATA Card. I know this setup to be stable as I had over 300 days of uptime with it while running Buster. These are the SATA Controllers I'm using now, I have no problems after 4 days now.

AMD X470
ASMedia 1062
Marvell 88SE9215

If I get beyond 1 week, I'll tentatively call the issue solved and just retire the LSI 9207-8i card. It's now a "Legacy" controller on Broadwells website, and the latest firmware upgrade came from 2015 so there isn't any prospect of a new version.

The Marvell & the ASMedia do have the potential to hold back the 10TB Drives given they're both only fed by a single lane of PCIe 2.0 at 5Gbps or 500MB/s max, but despite that I know them to work well. There are newer 2X PCIe 3.0 SATA Controllers that can be had cheaply that will solve that issue. 8-10 drives is the max I'm going to need for a long time to come.

I'll be sure to update the thread once I can confirm the issue is truly down to some issue with the LSI Card.

steve_v · #11 Post by **steve_v** » 2023-01-22 19:08

While I have no immediate answers for you, this is all quite fascinating.
My home server is running 8TB ironwolfs on a pair of old LSI SAS2008 based HBAs, and I can't say it's missed a beat in years. I don't spin down my disks though, and I strongly suspect this has much to do with things.

Have you by chance seen this? I haven't tried it, as I'm not seeing this issue and my policy WRT firmware is "don't fix it if it isn't broken"... But it does sound pretty plausible.

Zoot · #12 Post by **Zoot** » 2023-01-22 19:56

steve_v wrote: ↑2023-01-22 19:08 While I have no immediate answers for you, this is all quite fascinating.
My home server is running 8TB ironwolfs on a pair of old LSI SAS2008 based HBAs, and I can't say it's missed a beat in years. I don't spin down my disks though, and I strongly suspect this has much to do with things.

Have you by chance seen this? I haven't tried it, as I'm not seeing this issue and my policy WRT firmware is "don't fix it if it isn't broken"... But it does sound pretty plausible.

Definitely agree with that policy for firmware, particularly for hard drives.

Interesting you've had no issues with a SAS2008 based card. I suppose it could be an issue with the SAS2308 specifically, which is the controller on the card I have, maybe the combination of that and 10TB drives. Is 8TB the biggest capacity drive you have? What kind of file count would be on your drives out of interest?

I did see that issue reported on the Unraid forums, but it specifically refers to the Ironwolfs. I never did try any of the stuff posted there though. Once I saw the same thing happen to the Toshiba N300, it kind of ruled out that for me.

I also came across this blog post. It's a similiar issue to me, but the author was using ZFS which I'm not. He had different messages in the logs but it was resolved through a firmware release from Seagate themselves.
https://blog.quindorian.org/2019/09/iro ... efix.html/

Initially I thought it was that issue I was getting given it was a problem on an Ironwolf for me too, but the Ironwolf I have already has Firmware SC61 on it. Plus I got it on a Toshiba N300 too. I do have a cold spare WD 10TB red that I could try for interest to see if it still happens on that.

However I'm kind of tired of the random drive disconnections since it's a colossal show-stopper for what is supposed to be a NAS. If the setup I have now solves the issue, I'll be leaving it as it is.

steve_v · #13 Post by **steve_v** » 2023-01-22 20:16

Zoot wrote: ↑2023-01-22 19:56Is 8TB the biggest capacity drive you have?

It is, though that will undoutedly change at some point in the future, hence my interest in this issue.

Zoot wrote: ↑2023-01-22 19:56What kind of file count would be on your drives out of interest?

ZFS (RAIDZ2), so "file count" WRT individual drives is a pretty slippery thing.

Zoot wrote: ↑2023-01-22 19:56Once I saw the same thing happen to the Toshiba N300, it kind of ruled out that for me.

I wouldn't be at all surprised if both seagate and toshiba have added similar firmware features in recent models, or at it causing the same problems with certain older HBAs for that matter.
Since you have ruled out my first guess, power delivery, firmware bugs or incompatibilities is where I would be looking next.
Changing a couple of settings with seatools doesn't sound all that scary to me either, but there's not much point in me trying it as I don't appear have a problem to solve, at least not right now.

Zoot wrote: ↑2023-01-22 19:56If the setup I have now solves the issue, I'll be leaving it as it is.

Fair enough. Personally I've had nothing but grief from cheap PCIE SATA controllers in the past, and I don't see any options there that get me 16 ports in 2 8x slots anyway. But hey, if it works for you, it works for you.

Zoot · #14 Post by **Zoot** » 2023-01-29 18:26

Just to follow up - I think I can tentatively mark this issue solved. My conclusion is that there is some incompatibility between the particular Hard-Drives I have and the LSI 9207-8i card I have. From the links I posted above it would seem I'm not alone.

I've removed the LSI card and reverted back to a combination of the onboard AMD X470 chipset SATA ports and a PCIe SATA controller card - The system has been 100% stable since. I did reboot once but that was to boot into the new kernel release (5.10.0-21) in Bullseye last week.

These are the SATA controllers present in my system now:

Code: Select all

mark : server @ ~ $ lspci | grep SATA
03:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset SATA Controller (rev 01)
25:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
27:00.0 SATA controller: ASMedia Technology Inc. Device 1164 (rev 02)
29:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

6 Drives (2 SSD, 4 HDD) are connected to the AMD X470 Chipset, and the rest are connected to the ASM1164 (4 HDD). I'm not using the ASM1062 since it's a little slower (although should still be no issue for 2 HDDs), it is an additional SATA controller included with my motherboard.

steve_v wrote: ↑2023-01-22 20:16 Fair enough. Personally I've had nothing but grief from cheap PCIE SATA controllers in the past, and I don't see any options there that get me 16 ports in 2 8x slots anyway. But hey, if it works for you, it works for you.

I would definitely like a more modern Server-Grade HBA Card, but they're just too expensive to justify for my use-case. If you require lots of drives - I would agree that you definitely need to go that way.

#15 Post by **Aki** » 2023-01-29 22:08

Hello,
Thanks for updating the thread.

Zoot · #16 Post by **Zoot** » 2023-02-08 19:46

Sadly to my immense disappointment and frustration, the same drive showed errors again this morning during a scheduled backup task. (I've removed the solved tag from the thread title now)

This time the drive was remounted read-only. These are the logs this time:

Code: Select all

Feb 08 09:13:12 server.home.lan kernel: ata13: link is slow to respond, please be patient (ready=0)
Feb 08 09:13:17 server.home.lan kernel: ata13: COMRESET failed (errno=-16)
Feb 08 09:13:17 server.home.lan kernel: ata13: hard resetting link
Feb 08 09:13:18 server.home.lan bazarr[3764]: 2023-02-08 09:13:18,398 - urllib3.connectionpool           (7f653b20b260) :  WARNING (connectionpool:664) - Retrying (Retry(total=7, c>
Feb 08 09:13:22 server.home.lan kernel: ata13: link is slow to respond, please be patient (ready=0)
Feb 08 09:13:25 server.home.lan bazarr[3764]: 2023-02-08 09:13:25,714 - root                             (7f6542593ae0) :  ERROR (get_episodes:346) - BAZARR Error trying to get ser>
Feb 08 09:13:30 server.home.lan kernel: [UFW BLOCK] IN=enp35s0 OUT= MAC=d0:50:99:d6:ed:6b:d0:50:99:f0:ac:a6:08:00 SRC=10.1.9.191 DST=10.1.9.190 LEN=246 TOS=0x00 PREC=0x00 TTL=64 ID>
Feb 08 09:13:37 server.home.lan bazarr[3764]: 2023-02-08 09:13:37,417 - urllib3.connectionpool           (7f653b20b260) :  WARNING (connectionpool:664) - Retrying (Retry(total=6, c>
Feb 08 09:13:49 server.home.lan kernel: [UFW BLOCK] IN=enp35s0 OUT= MAC=d0:50:99:d6:ed:6b:d0:50:99:f0:ac:a6:08:00 SRC=10.1.9.191 DST=10.1.9.190 LEN=246 TOS=0x00 PREC=0x00 TTL=64 ID>
Feb 08 09:13:52 server.home.lan kernel: ata13: COMRESET failed (errno=-16)
Feb 08 09:13:52 server.home.lan kernel: ata13: limiting SATA link speed to 3.0 Gbps
Feb 08 09:13:52 server.home.lan kernel: ata13: hard resetting link
Feb 08 09:13:57 server.home.lan kernel: ata13: COMRESET failed (errno=-16)
Feb 08 09:13:57 server.home.lan kernel: ata13: reset failed, giving up
Feb 08 09:13:57 server.home.lan kernel: ata13.00: disabled
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#15 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=91s
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#15 Sense Key : Not Ready [current]
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#15 Add. Sense: Logical unit not ready, hard reset required
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#15 CDB: Read(16) 88 00 00 00 00 00 ce 40 09 08 00 00 01 00 00 00
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 3460303112 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
Feb 08 09:13:57 server.home.lan kernel: ata13: EH complete
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=91s
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#20 CDB: Read(16) 88 00 00 00 00 00 ce 40 09 00 00 00 00 08 00 00
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 3460303104 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=91s
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#21 CDB: Read(16) 88 00 00 00 00 00 ce 40 09 10 00 00 00 f8 00 00
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 3460303120 op 0x0:(READ) flags 0x80000 phys_seg 31 prio class 0
Feb 08 09:13:57 server.home.lan kernel: EXT4-fs error (device sdj1): ext4_get_inode_loc:4522: inode #54067207: block 432537632: comm mono: unable to read itable block
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#20 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#20 CDB: Write(16) 8a 00 00 00 00 00 00 00 08 00 00 00 00 08 00 00
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Feb 08 09:13:57 server.home.lan kernel: Buffer I/O error on dev sdj1, logical block 0, lost sync page write
Feb 08 09:13:57 server.home.lan kernel: EXT4-fs (sdj1): I/O error while writing superblock
Feb 08 09:13:57 server.home.lan kernel: EXT4-fs error (device sdj1) in ext4_reserve_inode_write:5819: IO failure
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#21 CDB: Write(16) 8a 00 00 00 00 00 00 00 08 00 00 00 00 08 00 00
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Feb 08 09:13:57 server.home.lan kernel: Buffer I/O error on dev sdj1, logical block 0, lost sync page write
Feb 08 09:13:57 server.home.lan kernel: EXT4-fs (sdj1): I/O error while writing superblock
Feb 08 09:13:57 server.home.lan kernel: EXT4-fs error (device sdj1): ext4_dirty_inode:6021: inode #54067207: comm mono: mark_inode_dirty error
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#22 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#22 CDB: Write(16) 8a 00 00 00 00 00 00 00 08 00 00 00 00 08 00 00
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 2048 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Feb 08 09:13:57 server.home.lan kernel: Buffer I/O error on dev sdj1, logical block 0, lost sync page write
Feb 08 09:13:57 server.home.lan kernel: EXT4-fs (sdj1): I/O error while writing superblock
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#31 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#31 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
Feb 08 09:13:57 server.home.lan kernel: blk_update_request: I/O error, dev sdj, sector 9764735552 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
Feb 08 09:13:57 server.home.lan kernel: Aborting journal on device sdj1-8.
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#0 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#0 CDB: Write(16) 8a 00 00 00 00 02 46 04 08 00 00 00 00 08 00 00
Feb 08 09:13:57 server.home.lan kernel: Buffer I/O error on dev sdj1, logical block 1220575232, lost sync page write
Feb 08 09:13:57 server.home.lan kernel: JBD2: Error -5 detected when updating journal superblock for sdj1-8.
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#23 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK cmd_age=0s
Feb 08 09:13:57 server.home.lan kernel: sd 12:0:0:0: [sdj] tag#23 CDB: Write(16) 8a 00 00 00 00 00 00 00 08 00 00 00 00 08 00 00
Feb 08 09:13:57 server.home.lan kernel: Buffer I/O error on dev sdj1, logical block 0, lost sync page write
Feb 08 09:13:57 server.home.lan kernel: EXT4-fs (sdj1): I/O error while writing superblock
Feb 08 09:13:57 server.home.lan kernel: EXT4-fs error (device sdj1): ext4_journal_check_start:83: Detected aborted journal
Feb 08 09:13:57 server.home.lan kernel: EXT4-fs (sdj1): Remounting filesystem read-only

To re-cap and list out all of the things I've tried:

Swapping the SATA Power & Data Cables
Swapping the Power Supply - I've seen the problem with 3 different power supplies.
Swapping and Cloning the Problematic Drive - I've encountered the issue with a Seagate Ironwolf 10TB and a Toshiba N300 10TB.
Different Kernel Versions - I've gotten the issue on the 5.10.0-19, 5.10.0-20 & 5.10.0-21 kernels from the Bullseye Repos, and also the 6.0.0 from the Bullseye Backports.
Connecting the problematic drive to different SATA controllers - An LSI 9207-8i (SAS2308 Controller), an ASMedia 1164 and the Onboard X470 all have shown the same issue.

The only thing that is now left is the CPU, memory or motherboard.

Regarding the memory - I left Memtest86+ running on the server today for 2 hours today to complete 2 passes, and no errors were shown. I had not tried this until today. This leaves me with the CPU and motherboard. My system has an ASRockRack X470D4U motherboard and an AMD Ryzen 5 2600.

Since this problem has began plaguing me from December, usually upon rebooting after getting the problem, the system will hang upon boot before it even gets to Grub and obviously doesn't come up. (As I've mentioned above) It will always fully come up after a full cold boot though.

Hangs and freezes on boot/reboot or while the OS is running have been some commonly reported issues with some of the earlier Ryzen generations on Linux, I run my server command line only with no desktop environment installed so maybe I'm more immune to hangups, but obviously not completely. I think it's reasonable to assume my problem with the drive dropping offline is related to the hangups on boot I get since both issues tend to occur one after the other.

These sources, among many others mention to change a particular power firmware setting in the BIOS. The theory being that this low power state is poorly supported. I have dug through the settings in my motherboards BIOS and enabled it appropriately.
https://forums.unraid.net/topic/84707-r ... zes-daily/
https://wiki.archlinux.org/title/Ryzen
https://forums.unraid.net/topic/46802-f ... ent-819173

I'll have to wait and see if that solves the problem for me. If that turns out to be the problem, it's definitely consistent with my troubleshooting thus far.

The Arch Wiki page suggests that the AMD AGESA 1.2.0.2 release solves the hangs on bootup. ASRock have a newer firmware release for my motherboard that provides that but since I'm on a Ryzen 2000 series the newest I can get is 1.0.0.6 from ASRock. Going by their recommendations, I would have to upgrade my CPU for that which I'm not yet prepared to do.
https://wiki.archlinux.org/title/Ryzen

On another note, my firewall runs IPFire with an AMD Athlon 200GE which is 1st generation Ryzen. I've had no major issues with it on Kernels 4.14, 5.10 & 5.15.

However, it did lock up on me at least once requiring a forced reboot, but it hasn't done so with newer motherboard firmware updates from Asus. It does have AGESA 1.2.0.7 now which is interesting, given the wiki page above. I don't recall the AGESA release it was running when it locked up however. Needless to say, that also runs without any desktop environment but only has a single M.2 drive since it's just a firewall.

As frustrating as this is, if the UEFI setting that I've changed solves the issue for me, I'll settle for that. If not, I'll have to look at moving away from a Ryzen based setup.

Zoot · #17 Post by **Zoot** » 2023-02-21 17:01

Some updates:

Changing the setting in the motherboards UEFI as I mentioned in the previous post didn't do anything to solve the issue.

I've also since changed the CPU to a newer Ryzen 5 3600, and upgraded to the very latest firmware available from ASRock. Sadly, I had the same drive encounter errors and get re-mounted read-only once again after a week of uptime. Image Attached below - Even less information this time.

I also found out that the hangups on bootup after the problems with the drive dropping offline/getting re-mounted read-only are encountered when initalizing the ROM of the PCIe SATA controller connected to the hard-drive in question. I previously had CSM in the bios off - I enabled it and now I can see the PCIe Option ROM being initialized, and never going beyond that.

This is kind of bringing me full-circle to start questioning the drive itself once again, as it very much points away from the OS given grub isn't even reached by this point. My previous work of trying different power supplies, SATA controllers, data/power cables and Memtest86 runs also support that. I do know removing the power to the drive, and re-connecting it to force a hard reset fixes things. Pretty much all of this points to some sort of issue within the drive itself, it's like the drive locks up and stops responding. It fits with the messages from the logs above - "Logical unit not ready. Cause Unreportable", "Logical Unit not ready. Hard Reset required." etc.

My above assumption of laying the blame at the CPU is likely not correct. It does make more sense as other than the problem with the drive I've discussed here, the system is stable.

What is bizarre is that I've encountered the problem with both a Seagate 10TB Ironwolf & a Toshiba N300 10TB. I'd have thought that two different drives from different manafacturers showing the same issue would point away from the drive, but as steve_v suggested above maybe it's reasonable to assume that Toshiba & Seagate could be doing something very similiar in their firmware. Seagate did already issue a firmware fix to their Ironwolf line to fix a problem similar to what I'm seeing, although that firmware release came included with the Ironwolf when I got it. I can't seem to find any information about the Toshiba drive, but I have seen some posts from people on other forums saying that switching to Western Digital drives from Seagate solved the issue for them.

I have a spare Western Digital 10TB Red drive to try, it's a little old, but still perfectly functional. I've cloned the data from the problematic drive to this one and it's been running without issue for 2 days, but I do know from experience that 2 days is nowhere near enough to declare the problem solved - 1 Month would be much better. Once again, I have to wait and see.

Zoot · #18 Post by **Zoot** » 2023-03-19 18:14

Okay – So by way of update:-

My older Western Digital Red 10TB (WD100EFAX) does NOT show any problem being awoken from standby. It’s been 100% solid for over 28 days, it’s been spun down and awoken many times throughout that period and there have been no issues at all. As a result, it would seem that the Seagate Ironwolf drives & Toshiba N300 drives are affected in my use case, but the Western Digitals are not.

My working theory is now is that neither of these drives like how I set the spindown timer for them, which is to call the following via systemd on bootup for each of the drives.

Code: Select all

hdparm -B 127 -S 180 /dev/disk/by-uuid/<UUID>

The Seagate Ironwolfs don’t support APM (they use EPC instead) so the “-B” switch is useless above, but the Toshiba does support APM, but yet still has an issue.

I’ve also since learned that Seagate maintains an open-source tool to manage many of the low-level features of their drives:
https://github.com/Seagate/openSeaChest

I’ve been playing around with this and it really gives a lot of fine control over the behavior of the drives. I’ve switched from using hdparm to spindown the Seagate Ironwolf & Exos and instead have set the appropriate timers using the openSeaChest tool.

It gets the drives to spindown as expected and seems to work well, it also lets me configure the reduced rpm idle_c state which I was not able to access via hdparm before. I think it probably is the solution to the above problem in the case of the Ironwolf, and to confirm that I may switch back in the Seagate Ironwolf as Media6 once again. However, I’m still at a loss as regards what to do with spinning down the Toshiba N300.

I moved the problematic Toshiba N300 from Media6 to Media2 in my array and I have seen a similar issue when the drive is holding different data again confirming the drive is the cause of the issue given the problem follows the drive. This is the SnapRAID array again:

Code: Select all

SnapRAID status report:

   Files Fragmented Excess  Wasted  Used    Free  Use Name
            Files  Fragments  GB      GB      GB
   18990      44      92   -76.9    6223    3694  62% media1
   15396      18      27    -0.3    8410    1584  84% media2
   12481      14      17   -78.1    6887    3029  69% media3
    5933      25      71   -79.1    6440    3477  64% media4
   44320      51      74   -73.9    5665    4252  57% media5
  701721    1161    2296    15.9    4353    5546  44% media6
 --------------------------------------------------------------------------
  798841    1313    2577    15.9   37980   21585  63%

It doesn’t seem the Toshiba’s are very popular so it’s hard to find information on spinning them down in Linux. Spinning the N300 down via hdparm gives me the above issue, but I am unsure of another method to spin it down, hd-idle could be worth a shot which is something I haven’t tried before. Failing that I think it’s just a case of leaving it spinning all the time which isn’t ideal.

Maybe the moral of the story for me is to stick to Western Digital Reds or Seagate Ironwolfs in my server.

#19 Post by **donald** » 2023-03-19 21:45

@Best_Threads

Debian User Forums

[Solved] Weird I/O Errors: "Logical unit not ready, cause not reportable"

[Solved] Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: Weird I/O Errors

Re: Weird I/O Errors

Re: Weird I/O Errors

Re: Weird I/O Errors

Re: Weird I/O Errors

Re: Weird I/O Errors

Re: Weird I/O Errors

Re: Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: [Hardware] Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: [Hardware] Weird I/O Errors: "Logical unit not ready, cause not reportable"

Re: [Solved] Weird I/O Errors: "Logical unit not ready, cause not reportable"