?howto troubleshoot? systems dies

If none of the more specific forums is the right place to ask

?howto troubleshoot? systems dies

Postby te36 » 2020-01-04 21:55

I am out of idead what i could do to troubleshoot the situation when in a certain condition the system just dies/hangs.
I see nothing on console nor on any log files after reboot.

Any options to enable more troubleshooting of the kernel or the like ?

The specific condition under which this happens is when hd-idle (SCSI stop unit ioctl()) is executed on one disk while other disks perform a lot of I/O (e.g.: copy from disk 1 to disk 2, switch off disk 3). If disks do not have a lot of I/O activity, it works.

I first had this on a banana PI with 5 SATA HDD connected via SATA port port multiplier, but i attributed the problem to old bananian, or port multiplier issues.

Now i upgraded to rockpi4b where the 5 SATA HDD are connected via a 5x SATA miniPCIe card, system running the standard debian stretch (4.4 kernel) from radxas web page. Same effect.

Of course, i could blame the non-amd64 hardware and try to put the miniPCIe card into an amd64 system, or blame the non-latest kernel and try to update it, but what would i do if i had the same problem on amd64 with the latest kernel ? Nobody could try to figure out the problem with more than NO diagnostics that i have right now.
te36
 
Posts: 11
Joined: 2018-01-29 17:36

Re: ?howto troubleshoot? systems dies

Postby ComputerBob » 2020-01-05 00:50

Many years ago, I ran into a similar problem. Sudden, random freezes, with nothing in the log files to indicate what had gone wrong.

In my case, the problem ended up being just ONE of the drive connector "tubes" fitting onto one of the drive pins too loosely (apparently, microscopically), so that, when the drive had to do a lot of work, that one pin vibrated enought (again, apparently microscopically) to cause it to intermittently disconnect. Pinching it a tiny bit, to "tighten it", solved the problem to this day.

It sounds like that may NOT be your problem, but maybe it could give you some ideas of things to check on your system. Good luck -- I know it can be very frustrating.
ComputerBob - Making Geek-Speak Chic (TM)
ComputerBob.com - Nearly 6,000 Posts and 22 Million Views
My Ministry
My Massive Stroke
User avatar
ComputerBob
 
Posts: 1193
Joined: 2007-11-30 04:49
Location: The Beautiful Sunshine State

Re: ?howto troubleshoot? systems dies

Postby Bulkley » 2020-01-05 01:08

Intermittent problems can drive one crazy. Keep a pencil and pad next to your machine and note every detail when it crashes.

Intermittent freezes are frequently hardware related such as the bad connection ComputerBob found. A weak power supply can cause unpredictable failures. Resource happy software can stress components that normally cruise along.

Get the covers off an blow out the dust. Make sure all fans are loose and spin freely. Disconnect and reconnect every plug/jack including auxiliary circuit boards and memory sticks. Do a memory stress test.

As to software, have you installed anything not from an official Debian repository appropriate for your version?
Bulkley
 
Posts: 5875
Joined: 2006-02-11 18:35

Re: ?howto troubleshoot? systems dies

Postby eor2004 » 2020-01-05 02:17

Hi, in my experience system hangs are related to either CPU damage or CPU High Temperatures or HDD going bad, check your CPU cooling system for dust and/or dirt collection or dry thermal grease, also check your HDD for "SMART" warnings, also check if the HDD LED activity light keeps bright on steadily all the time and doesn't blink at all when you're having the issue, this can indicate HDD problems, to test if the power supply is sending enough electric power, disconnect any device that is not necesary to boot the system and see if the issue comes back again, good luck!

P.S. I would also make a visual inspection of the motherboard for any damaged circuits like capacitors and connectors, ect...
OS: Debian 10 Buster 64-bit DE: MATE 1.20 CPU: AMD Phenom II X4 925 @ 2.8GHZ RAM: 8GB CORSAIR XMS2 PC2-6400U DDR2 (CM2X2048-6400C5C) GPU: ATI Radeon HD 3200 Mobo: Gigabyte GA-MA78GPM-DS2H HDD: Hitachi 2TB (HUA723020ALA641) 7200RPM
User avatar
eor2004
 
Posts: 213
Joined: 2013-10-01 22:49
Location: Puerto Rico

Re: ?howto troubleshoot? systems dies

Postby te36 » 2020-01-05 07:10

ComputerBob wrote:In my case, the problem ended up being just ONE of the drive connector "tubes" fitting onto one of the drive pins too loosely (apparently, microscopically), so that, when the drive had to do a lot of work, that one pin vibrated enought (again, apparently microscopically) to cause it to intermittently disconnect. Pinching it a tiny bit, to "tighten it", solved the problem to this day.


Thanks. My issue always only happens under reproducable circumstances: high-load on two disks, triggering hd-idle on another.
Doing hd-idle all day long when there is no load on other disks works fine.

Of course it could still be an electrical issue of some timing / signal on the PCIex4 to the host going awray, one never knows, but i was hoping that anything like this should never cause the CPU to freeze up. Especially given how i can pefectly run the 5 disks in parallel at sustained 430MByte/sec, which is a lot more load than the test where the SCSI stop causes the issue.
te36
 
Posts: 11
Joined: 2018-01-29 17:36

Re: ?howto troubleshoot? systems dies

Postby Head_on_a_Stick » 2020-01-05 10:05

Enable persistent logging:
Code: Select all
# mkdir -p /var/log/journal

Then use the systemd journal to investigate.
User avatar
Head_on_a_Stick
 
Posts: 11021
Joined: 2014-06-01 17:46
Location: /dev/chair


Return to General Questions

Who is online

Users browsing this forum: No registered users and 8 guests

fashionable