Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

HDD errors

If none of the specific sub-forums seem right for your thread, ask here.
Message
Author
Ltlbkofjim
Posts: 8
Joined: 2018-11-19 19:25

HDD errors

#1 Post by Ltlbkofjim »

Hi

Not sure if this is the right place to request help with this, if not please direct me to a more appropriate group.

Had a small incident with my raspberry pi running raspbian the other day where a power interruption led to the external HDD (ext4 fs)not mounting on boot, therefore putting the it into single user emergency mode.
After repairing the superblock and replacing it with one of the backups the drive now mounts and the it boots correctly.
However it’s displaying some funny behaviour. The files are still on the drive and they are all still accessible, du reports the mount point being 186GB but df reports the mount point being only 77MB in size?? - I’ve seen df reporting a higher usage than du before due to deleted files being used by running processes but never seen it reporting a practically empty drive that has plenty of data on.
Running fsck on the drive initially reports a file system with errors but gets to “pass 5: checking group summary information” and just reports fsck exited with signal 9, or just killed, but I am not killing the process nor is any other user.
Currently running bad blocks on the drive but looks like this could take a few days on the size of the drive, so far no errors found at 54% in.
Like I said I still do have access to all the files and looks like the data is intact but I worry about just carrying on and using it if there’s something wrong underneath that I’m missing

Any ideas welcome

Thanks

milomak
Posts: 2168
Joined: 2009-06-09 22:20
Been thanked: 2 times

Re: HDD errors

#2 Post by milomak »

it is often easier for us reading your posts if you actually post the output of the commands you have run. so we can see the actual results rather than trying to build them in our minds.

you may also find that someone may spot something else you have missed.

also post the dmesg output for when that drive is being mounted
Desktop: A320M-A PRO MAX, AMD Ryzen 5 3600, GALAX GeForce RTX™ 2060 Super EX (1-Click OC) - Sid, Win10, Arch Linux, Gentoo, Solus
Laptop: hp 250 G8 i3 11th Gen - Sid
Kodi: AMD Athlon 5150 APU w/Radeon HD 8400 - Sid

Segfault
Posts: 993
Joined: 2005-09-24 12:24
Has thanked: 5 times
Been thanked: 17 times

Re: HDD errors

#3 Post by Segfault »

First, you should check the SMART data. However, SMART may not report errors it is not aware of. To make it aware you need to force write. This is where badblocks -n comes in. After running it run smartctl -t long /dev/<device>. After it finishes run smartctl -a, see if there are any errors. OTOH, if the test errors out and does not finish then it is time to replace the drive.
Next layer is filesystem. There is not much point repairing it if there are bad blocks and I/O errors indeed.

User avatar
debiman
Posts: 3063
Joined: 2013-03-12 07:18

Re: HDD errors

#4 Post by debiman »

i take it the drive does not contain the operating system?
but couldn't that also have been damaged?
maybe you should check the root partition as well.

apart from that the answers to your problem (data loss through power outage) are just a few web searches away.
but we are here to help, regardless, so please do provide what was requested & answer our questions.

milomak
Posts: 2168
Joined: 2009-06-09 22:20
Been thanked: 2 times

Re: HDD errors

#5 Post by milomak »

i used to run raspbian (on the original pi) some years ago and this was very common when a power failure happened

my external was an lvm setup (ext4) which required a bit more work to fix. but usually once i had recovered the lvm, an e2fsck would work.
Desktop: A320M-A PRO MAX, AMD Ryzen 5 3600, GALAX GeForce RTX™ 2060 Super EX (1-Click OC) - Sid, Win10, Arch Linux, Gentoo, Solus
Laptop: hp 250 G8 i3 11th Gen - Sid
Kodi: AMD Athlon 5150 APU w/Radeon HD 8400 - Sid

Ltlbkofjim
Posts: 8
Joined: 2018-11-19 19:25

Re: HDD errors

#6 Post by Ltlbkofjim »

Thanks for the responses all - hopefully ill answer all your questions to assist
milomak wrote:it is often easier for us reading your posts if you actually post the output of the commands you have run. so we can see the actual results rather than trying to build them in our minds.

you may also find that someone may spot something else you have missed.

also post the dmesg output for when that drive is being mounted
Here are the original df and du commands that led me to see something was wrong

Code: Select all

pi@raspberrypi:~ $ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root        29G  1.3G   26G   5% /
devtmpfs        460M     0  460M   0% /dev
tmpfs           464M     0  464M   0% /dev/shm
tmpfs           464M   12M  452M   3% /run
tmpfs           5.0M  4.0K  5.0M   1% /run/lock
tmpfs           464M     0  464M   0% /sys/fs/cgroup
/dev/sda1       916G  77M  916G  1% /mnt/ext1
/dev/mmcblk0p1   43M   22M   21M  52% /boot
tmpfs            93M     0   93M   0% /run/user/1000
pi@raspberrypi:~ $ du -sh /mnt/ext1
186G	/mnt/ext1
pi@raspberrypi:~ $ 
Output from fsck

Code: Select all

pi@raspberrypi:~ $ sudo fsck -Vt ext4 /dev/sda1
fsck from util-linux 2.29.2
[/sbin/fsck.ext4 (1) -- /mnt/ext1] fsck.ext4 /dev/sda1 
e2fsck 1.43.4 (31-Jan-2017)
/dev/sda1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
fsck: Warning... fsck.ext4 for device /dev/sda1 exited with signal 9.
Output from dmesg

Code: Select all

pi@raspberrypi:~ $ sudo dmesg | grep sda1
[    4.741198]  sda: sda1
[    5.378211] EXT4-fs (sda1): warning: mounting unchecked fs, running e2fsck is recommended
[    5.439727] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[  174.839239] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 1: bad block bitmap checksum
[  175.588450] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 2: bad block bitmap checksum
[  175.604480] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 3: bad block bitmap checksum
[  175.638518] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 4: bad block bitmap checksum
[  175.654950] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 5: bad block bitmap checksum
[  175.672891] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 6: bad block bitmap checksum
[  175.691135] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 7: bad block bitmap checksum
[  175.708230] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 8: bad block bitmap checksum
[  175.731107] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 9: bad block bitmap checksum
[  175.748550] EXT4-fs error (device sda1): ext4_validate_block_bitmap:387: comm kworker/u8:2: bg 10: bad block bitmap checksum
[  310.234981] EXT4-fs (sda1): error count since last fsck: 14137
[  310.235027] EXT4-fs (sda1): initial error at time 1542582174: ext4_validate_inode_bitmap:101
[  310.235053] EXT4-fs (sda1): last error at time 1542747678: ext4_validate_block_bitmap:387
debiman wrote:i take it the drive does not contain the operating system?
but couldn't that also have been damaged?
maybe you should check the root partition as well.

apart from that the answers to your problem (data loss through power outage) are just a few web searches away.
but we are here to help, regardless, so please do provide what was requested & answer our questions.
The OS is on a separate flash drive and Yes that is a possibility that the OS on the flash drive is also damaged, and I will look into this, but at the moment there doesn't "seem" to be any issues with the OS, but I will check soon regardless.
I have done quite a number of web searches on this issue but did not find anything at all that helped me, hence the post to the forum. The confusing part is there doesn't seem to be any data loss at all that I can see, I've just pulled a random 5Gb from the drive and there wasn't any problems with that sample.
Segfault wrote:First, you should check the SMART data. However, SMART may not report errors it is not aware of. To make it aware you need to force write. This is where badblocks -n comes in. After running it run smartctl -t long /dev/<device>. After it finishes run smartctl -a, see if there are any errors. OTOH, if the test errors out and does not finish then it is time to replace the drive.
Next layer is filesystem. There is not much point repairing it if there are bad blocks and I/O errors indeed.
I have just ran the smart tools and the results didn't give me exactly what I was expecting, doesn't look like it can even run the tests

Code: Select all

pi@raspberrypi:~ $ sudo smartctl -t long /dev/sda1
smartctl 6.6 2016-05-31 r4324 [armv7l-linux-4.14.70-v7+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Extended Background Self Test has begun
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
Use smartctl -X to abort test
pi@raspberrypi:~ $ sudo smartctl -a /dev/sda1
smartctl 6.6 2016-05-31 r4324 [armv7l-linux-4.14.70-v7+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               Seagate
Product:              Desktop
Revision:             0130
User Capacity:        1,000,204,886,016 bytes [1.00 TB]
Logical block size:   512 bytes
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
scsiModePageOffset: response length too short, resp_len=4 offset=4 bd_len=0
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.
The results of Badblocks is also in

Code: Select all

sudo@raspberrypi:~ $ badblocks -svn /dev/sda1
Checking for bad blocks in non-destructive read-write mode
From block 0 to 976761559
Checking for bad blocks (non-destructive read-write test)
Pass completed , 0 bad blocks found. (0/0/0 errors)
milomak wrote:i used to run raspbian (on the original pi) some years ago and this was very common when a power failure happened

my external was an lvm setup (ext4) which required a bit more work to fix. but usually once i had recovered the lvm, an e2fsck would work.
Yes the previous iteration of this pi I used to run as an lvm setup and IMO it was more hassle that it was worth, but now after this I'm thinking it may just be the pi itself.

If I've missed anything helpful its please let me know, I think Ive covered all the questions

Segfault
Posts: 993
Joined: 2005-09-24 12:24
Has thanked: 5 times
Been thanked: 17 times

Re: HDD errors

#7 Post by Segfault »

RE: smartctl failure. It probably has to do with your USB-SATA adapter, it may not support smartctl commands. I'd say hook it up to a real SATA port for diagnostics.

Edit: Just noticed you are trying to run smartctl on partition sda1, it must be run on raw device sda.

Ltlbkofjim
Posts: 8
Joined: 2018-11-19 19:25

Re: HDD errors

#8 Post by Ltlbkofjim »

Segfault wrote:RE: smartctl failure. It probably has to do with your USB-SATA adapter, it may not support smartctl commands. I'd say hook it up to a real SATA port for diagnostics.

Edit: Just noticed you are trying to run smartctl on partition sda1, it must be run on raw device sda.
Thanks, spot on! sometimes you stare at something for so long you just miss the obvious, its now running!

EDIT: And the results of the smart test

Code: Select all

pi@raspberrypi:~ $ sudo smartctl -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [armv7l-linux-4.14.70-v7+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.12
Device Model:     ST31000528AS
Serial Number:    6VP23DPQ
LU WWN Device Id: 5 000c50 01b7f7a77
Firmware Version: CC38
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Wed Nov 21 07:35:58 2018 UTC

==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/213891en

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (  25)	The self-test routine was aborted by
					the host.
Total time to complete Offline 
data collection: 		(  600) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 175) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   114   099   006    Pre-fail  Always       -       81162197
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   095   095   020    Old_age   Always       -       5978
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   072   060   030    Pre-fail  Always       -       4310433297
  9 Power_On_Hours          0x0032   073   071   000    Old_age   Always       -       24166
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       122
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       210456608817
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   078   037   045    Old_age   Always   In_the_past 22 (Min/Max 22/44 #3586)
194 Temperature_Celsius     0x0022   022   063   000    Old_age   Always       -       22 (0 9 0 0 0)
195 Hardware_ECC_Recovered  0x001a   041   023   000    Old_age   Always       -       81162197
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       9732 (113 126 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       617330298
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1866303093

SMART Error Log Version: 1
ATA Error Count: 1
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 22222 hours (925 days + 22 hours)
  When the command that caused the error occurred, the device was in an unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 00 00 00 00 00 ff  10d+03:35:16.783  NOP [Abort queued commands]
  b0 d4 00 82 4f c2 00 00  10d+03:32:41.652  SMART EXECUTE OFF-LINE IMMEDIATE
  b0 d0 01 00 4f c2 00 00  10d+03:32:41.619  SMART READ DATA
  ec 00 01 00 00 00 00 00  10d+03:32:41.611  IDENTIFY DEVICE
  b0 d5 01 09 4f c2 00 00  10d+03:26:12.455  SMART READ LOG

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               90%     24159         -
# 2  Extended captive    Interrupted (host reset)      90%     22222         -
# 3  Short captive       Completed without error       00%     22222         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

vbrummond
Posts: 4432
Joined: 2010-03-02 01:42

Re: HDD errors

#9 Post by vbrummond »

I thought this line was disconcerting...

Code: Select all

Error 1 occurred at disk power-on lifetime: 22222 hours (925 days + 22 hours)
  When the command that caused the error occurred, the device was in an unknown state.
But that actually seems to be just a failure to read smart data. It might be because of running the command you did above.

Looks like the drive is fine, might be an error reporting space by df. Can you access all of the data just fine?

If you are worried about this I would copy all of the data off, reformat it fresh, and copy it back. Also always have backups.
Always on Debian Testing

Segfault
Posts: 993
Joined: 2005-09-24 12:24
Has thanked: 5 times
Been thanked: 17 times

Re: HDD errors

#10 Post by Segfault »

Code: Select all

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Aborted by host               90%     24159         -
The test failed. You can't say the drive is fine until the test passes and you can see the actual results.

User avatar
llivv
Posts: 5340
Joined: 2007-02-14 18:10
Location: cold storage

Re: HDD errors

#11 Post by llivv »

Ltlbkofjim wrote:Hi

After repairing the superblock and replacing it with one of the backups the drive now mounts and the it boots correctly.

Running fsck on the drive initially reports a file system with errors but gets to “pass 5: checking group summary information” and just reports fsck exited with signal 9, or just killed, but I am not killing the process nor is any other user.

Any ideas welcome
brummond - long time - good to see

HDD errors
If I had this issue I'd look at ext4 tools since fsck borked.
ext4
e2fsck - all the current options for ext2 3 and 4 - note read the - see also: section at the bottom of the man page.
e2image - for filesystem, superblock and inode backup
debugfs - might be able to show you more of what's going on and/or why fsck borked.

if all the disk needs is to clean the superblocks and inodes from the restore?

Edit: I see milomak suggested e2fsck back in this post. :wink:
http://forums.debian.net/viewtopic.php? ... 06#p685285
In memory of Ian Ashley Murdock (1973 - 2015) founder of the Debian project.

User avatar
debiman
Posts: 3063
Joined: 2013-03-12 07:18

Re: HDD errors

#12 Post by debiman »

i'd say that the filesystem on /dev/sda1 is partially corrupt.
this just seems the most plausible explanation at this moment.
smartmontools look for other, more physical problems, they will not see a broken filesystem.

you need to fix the filesystem:
https://www.startpage.com/do/dsearch?qu ... filesystem
please have a good look at these results.
this problem is not distro-specific, most solutions should apply.

another option is to pull of the data (1% of 1TB - should be possible) - since you say it is accessible, then just reformat the whole thing.

Segfault
Posts: 993
Joined: 2005-09-24 12:24
Has thanked: 5 times
Been thanked: 17 times

Re: HDD errors

#13 Post by Segfault »

There is a reason why filesystem gets corrupted. Often this reason is failing hard drive. As we see the drive is not healthy, it is not passing the test. IMHO repairing the filesystem or reformatting is fixing consequences, not reasons.

Ltlbkofjim
Posts: 8
Joined: 2018-11-19 19:25

Re: HDD errors

#14 Post by Ltlbkofjim »

Thanks all for getting back to me
vbrummond wrote:I thought this line was disconcerting...

Code: Select all

Error 1 occurred at disk power-on lifetime: 22222 hours (925 days + 22 hours)
  When the command that caused the error occurred, the device was in an unknown state.
But that actually seems to be just a failure to read smart data. It might be because of running the command you did above.

Looks like the drive is fine, might be an error reporting space by df. Can you access all of the data just fine?

If you are worried about this I would copy all of the data off, reformat it fresh, and copy it back. Also always have backups.
Yes from the few gb of data I have pulled off and tested, it seems like its fine (obviously I can't say 100% for the rest of the data, but it seems that way)
I don't know how df works behind the scenes, is that a possibility that it could be a problem with just what it's reporting?
I may end up copying all the data and reformatting anyway, especially if its looking like it going to be a difficult fix. But its always useful to know how to fix these things rather than just restore from a backup, which I do actually have of most the important data, its just a bit of a pain to restore it from where its all stored.
llivv wrote:
Ltlbkofjim wrote:Hi

After repairing the superblock and replacing it with one of the backups the drive now mounts and the it boots correctly.

Running fsck on the drive initially reports a file system with errors but gets to “pass 5: checking group summary information” and just reports fsck exited with signal 9, or just killed, but I am not killing the process nor is any other user.

Any ideas welcome
brummond - long time - good to see

HDD errors
If I had this issue I'd look at ext4 tools since fsck borked.
ext4
e2fsck - all the current options for ext2 3 and 4 - note read the - see also: section at the bottom of the man page.
e2image - for filesystem, superblock and inode backup
debugfs - might be able to show you more of what's going on and/or why fsck borked.

if all the disk needs is to clean the superblocks and inodes from the restore?

Edit: I see milomak suggested e2fsck back in this post. :wink:
http://forums.debian.net/viewtopic.php? ... 06#p685285
Sorry not really following what your suggesting I should do, I have already ran e2fsck above but the process looks like it gets killed at some point for some reason. And after reading the man pages of the other two I am clueless where to even start to be honest - although more than welcome to give it a go if someone can point me in the right direction
debiman wrote:i'd say that the filesystem on /dev/sda1 is partially corrupt.
this just seems the most plausible explanation at this moment.
smartmontools look for other, more physical problems, they will not see a broken filesystem.

you need to fix the filesystem:
https://www.startpage.com/do/dsearch?qu ... filesystem
please have a good look at these results.
this problem is not distro-specific, most solutions should apply.

another option is to pull of the data (1% of 1TB - should be possible) - since you say it is accessible, then just reformat the whole thing.
Thanks I will have a look through these later on tonight when I get a chance to have a thorough look, but as you say I may just end up pulling the data and reformatting if I don't get much success.
Segfault wrote:There is a reason why filesystem gets corrupted. Often this reason is failing hard drive. As we see the drive is not healthy, it is not passing the test. IMHO repairing the filesystem or reformatting is fixing consequences, not reasons.
I appreciate what you're saying but from what I think you were saying before you only think its failing because of the previous smart test result, which didn't in fact say it failed, it just said it was aborted. I did a little digging and it appeared it was aborted when the drive merely went to sleep after so many minutes, a small script to keep the drive alive every 60 seconds for the duration of the test revealed the following result result.

Code: Select all

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     24192         -
If I've missed something and you think the drive is failing for another reason I am missing then please let me know as if its necessary I would rather replace it than have it completely fail.

Segfault
Posts: 993
Joined: 2005-09-24 12:24
Has thanked: 5 times
Been thanked: 17 times

Re: HDD errors

#15 Post by Segfault »

Test passed is good, you may want to look at the results, too. Below are the ones to look at.

Code: Select all

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       16

milomak
Posts: 2168
Joined: 2009-06-09 22:20
Been thanked: 2 times

Re: HDD errors

#16 Post by milomak »

what is the fsck or e2fsck command you run?
Desktop: A320M-A PRO MAX, AMD Ryzen 5 3600, GALAX GeForce RTX™ 2060 Super EX (1-Click OC) - Sid, Win10, Arch Linux, Gentoo, Solus
Laptop: hp 250 G8 i3 11th Gen - Sid
Kodi: AMD Athlon 5150 APU w/Radeon HD 8400 - Sid

User avatar
llivv
Posts: 5340
Joined: 2007-02-14 18:10
Location: cold storage

Re: HDD errors

#17 Post by llivv »

besides milomak inquiry above regarding the full commands used for fsck and e2fsck
(if they were just plain ole # fsck and # e2fsck that's ok because that is usually all a user needs most of the time )

my questions are:
is the 186GB disk one ext4 partition?

what was the badblocks command you used?
example #badblocks -svn /dev/sdb

and did it report any badblocks ?

finally
is the disk mounted when running the commands?
In memory of Ian Ashley Murdock (1973 - 2015) founder of the Debian project.

Ltlbkofjim
Posts: 8
Joined: 2018-11-19 19:25

Re: HDD errors

#18 Post by Ltlbkofjim »

When i first ran fsck on the /dev/sda1 it was giving errors about superblock and therefore would not run (I haven't got the exact wording as I wasn't recording them at the time)

And therefore I ran
either e2fsck -b 32768 /dev/sda or fsck -b 32768 /dev/sda , I can't remember which, but it ended with

Code: Select all

Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Killed
But after this it then allowed me to run
fsck -Vt ext4 /dev/sda1

Which went a lot further than it did before but still ended with

Code: Select all

Pass 5: Checking group summary information
fsck: Warning... fsck.ext4 for device /dev/sda1 exited with signal 9.
e2fsck /dev/sda1 also ended the same way as fsck

In answer to your other questions
the 186GB is on a single disk which has just one partition
I used the command badblocks -svn /dev/sda1 which resulted in 0 bad blocks found (0/0/0 errors)
The drive was unmounted for all the above commands

I hope I haven't missed anything

User avatar
llivv
Posts: 5340
Joined: 2007-02-14 18:10
Location: cold storage

Re: HDD errors

#19 Post by llivv »

I'd read the man page again for both fsck and e2fsck looking specifically for exit code 9
I don't find it in the debian man pages for those commands.
I also don't see an option -b for fsck
there is an option -b for e2fsck
but there is also a warning in the fsck man about issuing options from specific filesystem checkers to generic fsck
saying options from specific filesystem checkers don't take arguments when runing fsck because fsck has not way to guess what the arguments are and results may not be what's expected.

after checking the raspbian man page to see if the options for e2fsck -pv are p=preen v=verbose
try

Code: Select all

e2fsck -pv /dev/sda1
and post the results
In memory of Ian Ashley Murdock (1973 - 2015) founder of the Debian project.

Ltlbkofjim
Posts: 8
Joined: 2018-11-19 19:25

Re: HDD errors

#20 Post by Ltlbkofjim »

Yeah Ive read the man pages for both, and like the debian ones, the raspbian man pages don't mention 9, seems to be every other option but 9. I seem to remember someone on a different forum commenting 9 was supposedly a process externally killed, but can't confirm this anywhere reputable.

I checked the man page for e2fsck and pv was preen and verbose so after running the commond I was given this

Code: Select all

e2fsck -pv /dev/sda1
/dev/sda1 contains a file system with errors, check forced.
Killed
It did take quite a while to do this, say 5-10mins

I also ran -b with e2fsck rather than against generic fsck for completeness as I see what you mean about the arguments not being passed along correctly, and this is what I got

Code: Select all

e2fsck -b 32768 /dev/sda1
e2fsck 1.43.4 (31-Jan-2017)
/dev/sda1 was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Killed
I have managed to pull all the data off the drive so would be quite trivial to just start again from fresh, but at this stage I think I'm just quite interested to find out what happened and how to fix it in case I have issues in the future where I can't simply just pull the data

Post Reply