[Solved] Software raid5 and possible failed disk

Getting your soundcard to work, using Debian on non-i386 hardware, etc

[Solved] Software raid5 and possible failed disk

Postby coppolino97 » 2020-01-25 18:07

Hi all,
I have bought an used an HP Microserver as home server as DLNA, shares files and printers.
Today I started to configure it installing Debian on it (there is a dedicated SSD for operating system).
Later I started to create a software Raid. I decided to use Raid 5 due to there are 4 disk that I will use to keep my data.

The raid creation is very slow so I tried to understand the reason.
After some checks I have found this information:

Code: Select all
 4320.966575] sd 2:0:0:0: [sdc] tag#22 CDB: Read(10) 28 00 00 40 17 58 00 00 08 00
[ 4320.966581] blk_update_request: I/O error, dev sdc, sector 4200284 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[ 4320.966606] md/raid:md0: read error not correctable (sector 4198232 on sdc1).
[ 4353.206788] sd 2:0:0:0: [sdc] tag#6 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4353.206795] sd 2:0:0:0: [sdc] tag#6 Sense Key : Medium Error [current]
[ 4353.206802] sd 2:0:0:0: [sdc] tag#6 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4353.206809] sd 2:0:0:0: [sdc] tag#6 CDB: Read(10) 28 00 00 40 21 e0 00 00 08 00
[ 4353.206816] blk_update_request: I/O error, dev sdc, sector 4202976 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[ 4353.206842] md/raid:md0: read error not correctable (sector 4200928 on sdc1).
[ 4375.038385] sd 2:0:0:0: [sdc] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4375.038393] sd 2:0:0:0: [sdc] tag#26 Sense Key : Medium Error [current]
[ 4375.038399] sd 2:0:0:0: [sdc] tag#26 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4375.038406] sd 2:0:0:0: [sdc] tag#26 CDB: Read(10) 28 00 00 40 21 e8 00 00 08 00
[ 4375.038413] blk_update_request: I/O error, dev sdc, sector 4202984 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[ 4375.038440] md/raid:md0: read error not correctable (sector 4200936 on sdc1).
[ 4474.201327] sd 2:0:0:0: [sdc] tag#29 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4474.201334] sd 2:0:0:0: [sdc] tag#29 Sense Key : Medium Error [current]
[ 4474.201341] sd 2:0:0:0: [sdc] tag#29 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4474.201347] sd 2:0:0:0: [sdc] tag#29 CDB: Read(10) 28 00 00 40 10 b0 00 00 08 00
[ 4474.201355] blk_update_request: I/O error, dev sdc, sector 4198576 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[ 4474.201380] md/raid:md0: read error not correctable (sector 4196528 on sdc1).
[ 4476.839511] sd 2:0:0:0: [sdc] tag#22 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4476.839518] sd 2:0:0:0: [sdc] tag#22 Sense Key : Medium Error [current]
[ 4476.839524] sd 2:0:0:0: [sdc] tag#22 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4476.839531] sd 2:0:0:0: [sdc] tag#22 CDB: Read(10) 28 00 00 40 10 b8 00 00 08 00
[ 4476.839537] blk_update_request: I/O error, dev sdc, sector 4198584 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[ 4476.839563] md/raid:md0: read error not correctable (sector 4196536 on sdc1).
[ 4479.475997] sd 2:0:0:0: [sdc] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4479.476004] sd 2:0:0:0: [sdc] tag#23 Sense Key : Medium Error [current]
[ 4479.476010] sd 2:0:0:0: [sdc] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4479.476017] sd 2:0:0:0: [sdc] tag#23 CDB: Read(10) 28 00 00 40 10 c0 00 00 08 00
[ 4479.476024] blk_update_request: I/O error, dev sdc, sector 4198592 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[ 4479.476051] md/raid:md0: read error not correctable (sector 4196544 on sdc1).
[ 4482.120801] sd 2:0:0:0: [sdc] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4482.120808] sd 2:0:0:0: [sdc] tag#7 Sense Key : Medium Error [current]
[ 4482.120814] sd 2:0:0:0: [sdc] tag#7 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4482.120821] sd 2:0:0:0: [sdc] tag#7 CDB: Read(10) 28 00 00 40 10 c8 00 00 08 00
[ 4482.120828] blk_update_request: I/O error, dev sdc, sector 4198600 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[ 4482.120853] md/raid:md0: read error not correctable (sector 4196552 on sdc1).
[ 4494.781246] sd 2:0:0:0: [sdc] tag#20 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4494.781254] sd 2:0:0:0: [sdc] tag#20 Sense Key : Medium Error [current]
[ 4494.781260] sd 2:0:0:0: [sdc] tag#20 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4494.781267] sd 2:0:0:0: [sdc] tag#20 CDB: Read(10) 28 00 00 40 17 78 00 00 08 00
[ 4494.781274] blk_update_request: I/O error, dev sdc, sector 4200312 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[ 4494.781301] md/raid:md0: read error not correctable (sector 4198264 on sdc1).
[ 4497.440733] sd 2:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 4497.440740] sd 2:0:0:0: [sdc] tag#0 Sense Key : Medium Error [current]
[ 4497.440747] sd 2:0:0:0: [sdc] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[ 4497.440754] sd 2:0:0:0: [sdc] tag#0 CDB: Read(10) 28 00 00 40 17 80 00 00 08 00
[ 4497.440761] blk_update_request: I/O error, dev sdc, sector 4200320 op 0x0:(READ) flags 0x4000 phys_seg 1 prio class 0
[ 4497.440787] md/raid:md0: read error not correctable (sector 4198272 on sdc1).


Is this drive faulty in your opinion? I have formatted all disk before start with raid setup.

How can I identify the failed physical disk? There are four connected disks for data.
Thanks so much! :)
Last edited by coppolino97 on 2020-02-02 11:05, edited 3 times in total.
HP Elitebook 840 G3 | 8Gbyte of RAM | Intel core i5 | SSD 250GB | Debian 10
coppolino97
 
Posts: 61
Joined: 2018-06-05 15:23

Re: Software raid5 and possible failed disk

Postby Head_on_a_Stick » 2020-01-25 18:38

Have you tried smartmontools? Those I/O errors don't look good but I have zero experience with RAID.
User avatar
Head_on_a_Stick
 
Posts: 11204
Joined: 2014-06-01 17:46
Location: /dev/chair

Re: Software raid5 and possible failed disk

Postby Bloom » 2020-01-25 20:16

sdc looks indeed faulty. Be safe and replace it.
User avatar
Bloom
 
Posts: 203
Joined: 2017-11-11 12:23

Re: Software raid5 and possible failed disk

Postby coppolino97 » 2020-01-26 09:27

After one night the raid5 seems to be ready

Code: Select all
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd1[4](S) sdc1[2] sdb1[1] sda1[0]
      2929886208 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      bitmap: 0/8 pages [0KB], 65536KB chunk
unused devices: <none>


I tried with smartmontools. This is the output:

Code: Select all
 sudo smartctl -s on -a /dev/sdc
[sudo] password for federico:
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.3.0-26-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital AV-GP
Device Model:     WDC WD10EVVS-63M5B0
Serial Number:    WD-WCAV5C471935
LU WWN Device Id: 5 0014ee 2af10c229
Firmware Version: 01.00A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sun Jan 26 10:32:41 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)   Offline data collection activity
               was suspended by an interrupting command from host.
               Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)   The previous self-test routine completed
               without error or no self-test has ever
               been run.
Total time to complete Offline
data collection:       (19980) seconds.
Offline data collection
capabilities:           (0x7b) SMART execute Offline immediate.
               Auto Offline data collection on/off support.
               Suspend Offline collection upon new
               command.
               Offline surface scan supported.
               Self-test supported.
               Conveyance Self-test supported.
               Selective Self-test supported.
SMART capabilities:            (0x0003)   Saves SMART data before entering
               power-saving mode.
               Supports SMART auto save timer.
Error logging capability:        (0x01)   Error logging supported.
               General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 230) minutes.
Conveyance self-test routine
recommended polling time:     (   5) minutes.
SCT capabilities:           (0x303f)   SCT Status supported.
               SCT Error Recovery Control supported.
               SCT Feature Control supported.
               SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   136   110   021    Pre-fail  Always       -       6183
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       132
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3250
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       131
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       112
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3978
194 Temperature_Celsius     0x0022   118   090   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       41
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       3
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      3232         4183466

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

federico@server:~$ sudo smartctl -a /dev/sdc
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.3.0-26-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital AV-GP
Device Model:     WDC WD10EVVS-63M5B0
Serial Number:    WD-WCAV5C471935
LU WWN Device Id: 5 0014ee 2af10c229
Firmware Version: 01.00A01
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Sun Jan 26 10:33:27 2020 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)   Offline data collection activity
               was suspended by an interrupting command from host.
               Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)   The previous self-test routine completed
               without error or no self-test has ever
               been run.
Total time to complete Offline
data collection:       (19980) seconds.
Offline data collection
capabilities:           (0x7b) SMART execute Offline immediate.
               Auto Offline data collection on/off support.
               Suspend Offline collection upon new
               command.
               Offline surface scan supported.
               Self-test supported.
               Conveyance Self-test supported.
               Selective Self-test supported.
SMART capabilities:            (0x0003)   Saves SMART data before entering
               power-saving mode.
               Supports SMART auto save timer.
Error logging capability:        (0x01)   Error logging supported.
               General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 230) minutes.
Conveyance self-test routine
recommended polling time:     (   5) minutes.
SCT capabilities:           (0x303f)   SCT Status supported.
               SCT Error Recovery Control supported.
               SCT Feature Control supported.
               SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   136   110   021    Pre-fail  Always       -       6183
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       132
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3250
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       131
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       112
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3978
194 Temperature_Celsius     0x0022   118   090   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       41
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       3
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      3232         4183466

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


I will look to identify the right physical disk to replace (there are four disk in HP N36L and there are not any led to identify them).

I will keep you update!
Thanks
HP Elitebook 840 G3 | 8Gbyte of RAM | Intel core i5 | SSD 250GB | Debian 10
coppolino97
 
Posts: 61
Joined: 2018-06-05 15:23

Re: Software raid5 and possible failed disk

Postby coppolino97 » 2020-01-26 13:03

Hi all,
I identified the faulty hard disk and I removed it.
At the moment I am working with three disks, before with all four hard drives.
Now the raid 5 setup is working very fast (It will take more or less three hours) and there aren't any errors about them.

At the moment I have just these logs about connected drives:
Code: Select all
root@server:~# dmesg | grep sda
[   13.915415] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 TB/932 GiB)
[   14.115087] sd 0:0:0:0: [sda] Write Protect is off
[   14.201325] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[   14.553413] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
[   14.861605]  sda: sda1
[   14.862342] sd 0:0:0:0: [sda] Attached SCSI disk
[ 8564.535313] md/raid:md0: device sda1 operational as raid disk 0


I have same logs for /dev/sdb and /dev/sdc.

I think that all is working fine now! :D
HP Elitebook 840 G3 | 8Gbyte of RAM | Intel core i5 | SSD 250GB | Debian 10
coppolino97
 
Posts: 61
Joined: 2018-06-05 15:23


Return to Hardware

Who is online

Users browsing this forum: No registered users and 4 guests

fashionable