UPDATE (19/03/2023): I think I can mark this as solved. The TLDR version is that the Seagate Ironwolf & Toshiba N300 10TB HDDs I am using seem to have a serious issue with being awoken from Standby when the standby/spindown timer is set via hdparm -S XXX <drive> in my use case, but Western Digital Red 10TB drives do not. Seagate maintain an opensource tool to configure their drives instead (openSeaChest), I've switched to this to get the Ironwolf to spindown and it seems to work quite well. However I'm unsure of what to do as regards the Toshiba N300, maybe I just have to leave it spinning all the time...
My server which is running Bullseye consists of an array of 8x10TB hard drives. I run MergerFS to pool the drives, and SnapRAID for parity protection of the data. The 8 Hard Drives are all connected to an LSI 9207-8i HBA card.
Recently I've been having I/O errors on a single drive repeatedly in my server for quite a while. It all started after I upgraded the final 4TB drive within the system to a 10TB drive, in this case a Seagate Ironwolf Non-Pro 10TB.
Here is a sample of the dmesg reported logs. The drive then either gets unmounted entirely or re-mounted read-only. The errors are always reported on one particular drive in the system, I have yet to see them on any other drive in the system.
Code: Select all
Jan 07 06:43:27 server.home.lan kernel: sd 0:0:2:0: attempting task abort!scmd(0x000000006689c7fe), outstanding for 30464 ms & timeout 30000 ms
Jan 07 06:43:27 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8278 CDB: Read(16) 88 00 00 00 00 00 52 40 89 00 00 00 00 08 00 00
Jan 07 06:43:27 server.home.lan kernel: scsi target0:0:2: handle(0x000a), sas_address(0x4433221102000000), phy(2)
Jan 07 06:43:27 server.home.lan kernel: scsi target0:0:2: enclosure logical id(0x500605b006bbdb30), slot(1)
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: task abort: SUCCESS scmd(0x000000006689c7fe)
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8308 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=34s
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8308 Sense Key : Not Ready [current]
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8308 Add. Sense: Logical unit not ready, cause not reportable
Jan 07 06:43:31 server.home.lan kernel: sd 0:0:2:0: [sdc] tag#8308 CDB: Read(16) 88 00 00 00 00 00 52 40 89 00 00 00 00 08 00 00
Jan 07 06:43:31 server.home.lan kernel: blk_update_request: I/O error, dev sdc, sector 1379961088 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
Jan 07 06:43:31 server.home.lan kernel: EXT4-fs warning (device sdc1): htree_dirblock_to_tree:1042: inode #21561345: lblock 0: comm Thread Pool Wor: error -5 reading directory block
Jan 07 06:43:33 server.home.lan kernel: [UFW BLOCK] IN=enp35s0 OUT= MAC=d0:50:99:d6:ed:6b:52:54:00:c1:9c:45:08:00 SRC=10.1.9.192 DST=10.1.9.190 LEN=296 TOS=0x00 PREC=0x00 TTL=64 ID>
Jan 07 06:43:53 server.home.lan kernel: [UFW BLOCK] IN=enp35s0 OUT= MAC=d0:50:99:d6:ed:6b:52:54:00:c1:9c:45:08:00 SRC=10.1.9.192 DST=10.1.9.190 LEN=296 TOS=0x00 PREC=0x00 TTL=64 ID>
Jan 07 06:44:01 server.home.lan mono[2954]: [Error] DownloadedEpisodesImportService: Import failed, path does not exist or is not accessible by Sonarr: /Storage1/Torrents/tv-sonarr>
Jan 07 06:44:07 server.home.lan kernel: sd 0:0:2:0: device_block, handle(0x000a)
Jan 07 06:44:09 server.home.lan kernel: sd 0:0:2:0: device_unblock and setting to running, handle(0x000a)
Jan 07 06:44:09 server.home.lan kernel: blk_update_request: I/O error, dev sdc, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
Jan 07 06:44:09 server.home.lan kernel: sd 0:0:2:0: [sdc] Synchronizing SCSI cache
Jan 07 06:44:09 server.home.lan kernel: sd 0:0:2:0: [sdc] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan 07 06:44:09 server.home.lan systemd[1]: Unmounting /Media6...
Jan 07 06:44:09 server.home.lan systemd[26428]: Media6.mount: Succeeded.
Jan 07 06:44:09 server.home.lan systemd[1]: Media6.mount: Succeeded.
Jan 07 06:44:09 server.home.lan systemd[1]: Unmounted /Media6.
Jan 07 06:44:09 server.home.lan kernel: mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221102000000)
Jan 07 06:44:09 server.home.lan kernel: mpt2sas_cm0: removing handle(0x000a), sas_addr(0x4433221102000000)
Jan 07 06:44:09 server.home.lan kernel: mpt2sas_cm0: enclosure logical id(0x500605b006bbdb30), slot(1)
Code: Select all
SnapRAID status report:
Files Fragmented Excess Wasted Used Free Use Name
Files Fragments GB GB GB
18801 42 90 -76.9 5939 3978 59% media1
15233 17 26 -77.8 8216 1700 82% media2
11843 14 17 -78.1 6419 3498 64% media3
5867 18 64 -1.5 6270 3724 62% media4
44713 36 55 -73.7 5531 4386 55% media5
690930 1076 1843 14.8 4126 5774 42% media6
--------------------------------------------------------------------------
787387 1203 2095 14.8 36503 23063 61%
Normally, these type of errors are down to a failing hard drive, and in almost all cases in the past for me that has indeed been the case, but now I'm not even convinced it's a hardware problem anymore. Here is the list of things I've tried - SMART Data is all Clean.
- Long, Short & Extended SMART Tests on the Drive - All come up OK
- Swapping the Data Cable (SFF8087 to 4X SATA) Twice
- Running the badblocks utility on the Drive - No badblocks are found
- Swapping the drive for a different one - a Toshiba N300 10TB - Both show the same issues.
- Moving the Drive from the LSI HBA card to the onboard motherboards SATA ports - Same issue
- Swapping the Power Supply - Issue re-occurs
Things I intend on trying:
- Different Kernel Versions - As of this morning, I've switched to Kernel 6.0.0 from the Bullseye backports, from the 5.10 that's in Bullseye
- Disabling the Standby timeout on the drive that shows the issue.