Catastrophic system failure :)

If none of the more specific forums is the right place to ask

Catastrophic system failure :)

Postby Lysander » 2020-01-08 20:27

Well, this hasn't happened to me in Slackware, but then I'd probably never pushed the system like this. I was playing a [kind of graphically intensive] game, then rather than closing it I minimised it and started watching a Youtube video. After a couple of mins the audio started to stutter and then the screen went black and turned off. The audio continued. I watched this in interest for about ten seconds and tried REISUB, which didn't work. After a couple of goes, I had to do a hard reset. The logs came up with this, I've included all the stuff that was red:

Code: Select all
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 0: f200084000000800
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: TSC 0
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1578516588 SOCKET 0 APIC 0 microcode a0b
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: Machine check events logged
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: f200001034000e0f
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: TSC 0
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1578516588 SOCKET 0 APIC 0 microcode a0b
Jan 08 20:49:52 psychopig-xxxvii kernel: Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
Jan 08 20:49:52 psychopig-xxxvii kernel: ... version:                2
Jan 08 20:49:52 psychopig-xxxvii kernel: ... bit width:              40
Jan 08 20:49:52 psychopig-xxxvii kernel: ... generic registers:      2
Jan 08 20:49:52 psychopig-xxxvii kernel: ... value mask:             000000ffffffffff
Jan 08 20:49:52 psychopig-xxxvii kernel: ... max period:             000000007fffffff
Jan 08 20:49:52 psychopig-xxxvii kernel: ... fixed-purpose events:   3
Jan 08 20:49:52 psychopig-xxxvii kernel: ... event mask:             0000000700000003
Jan 08 20:49:52 psychopig-xxxvii kernel: rcu: Hierarchical SRCU implementation.
Jan 08 20:49:52 psychopig-xxxvii kernel: NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
Jan 08 20:49:52 psychopig-xxxvii kernel: smp: Bringing up secondary CPUs ...
Jan 08 20:49:52 psychopig-xxxvii kernel: x86: Booting SMP configuration:
Jan 08 20:49:52 psychopig-xxxvii kernel: .... node  #0, CPUs:      #1
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: f200001010000e0f
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: TSC 0
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1578516588 SOCKET 0 APIC 1 microcode a0b
Jan 08 20:49:52 psychopig-xxxvii kernel:  #2
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 0: f200084000000800
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: TSC 0
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1578516588 SOCKET 0 APIC 2 microcode a0b
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: f200000034000e0f
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: TSC 0
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1578516588 SOCKET 0 APIC 2 microcode a0b
Jan 08 20:49:52 psychopig-xxxvii kernel:  #3
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: f200000010000e0f
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: TSC 0
Jan 08 20:49:52 psychopig-xxxvii kernel: mce: [Hardware Error]: PROCESSOR 0:1067a TIME 1578516588 SOCKET 0 APIC 3 microcode a0b
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdb [SAT], can't monitor Current_Pending_Sector count - no Attribute 19
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdb [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdb [SAT], is SMART capable. Adding to "monitor" list.
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdb [SAT], state read from /var/lib/smartmontools/smartd.GIGABYTE_GP_GS
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdc, type changed from 'scsi' to 'sat'
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdc [SAT], opened
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdc [SAT], INTEL SSDSA2M080G2GC, S/N:CVPO012502X3080JGN, WWN:5-001517-9
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdc [SAT], found in smartd database: Intel X18-M/X25-M/X25-V G2 SSDs
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdc [SAT], WARNING: This drive may require a firmware update to
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: fix possible drive hangs when reading SMART self-test log:
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=18363
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdc [SAT], can't monitor Current_Pending_Sector count - no Attribute 19
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdc [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdc [SAT], is SMART capable. Adding to "monitor" list.
Jan 08 20:49:57 psychopig-xxxvii smartd[547]: Device: /dev/sdc [SAT], state read from /var/lib/smartmontools/smartd.INTEL_SSDSA2M0
Jan 08 20:50:21 psychopig-xxxvii gnome-session[771]: gnome-session-binary[771]: WARNING: App 'org.gnome.Shell.desktop' exited with
Jan 08 20:50:21 psychopig-xxxvii gnome-session-binary[771]: Unrecoverable failure in required component org.gnome.Shell.desktop
Jan 08 20:50:21 psychopig-xxxvii gnome-session-binary[771]: WARNING: App 'org.gnome.Shell.desktop' exited with code 1


Now, I'm not sure whose 'fault' this is - mine, Debian's, systemd or GNOME. But I was pushing things a little bit. Can anyone verify what was the main culprit here?
User avatar
Lysander
 
Posts: 593
Joined: 2017-02-23 10:07
Location: London

Re: Catastrophic system failure :)

Postby stevepusser » 2020-01-08 23:25

Have you ruled out hardware overheating as a cause?
The MX Linux repositories: Backports galore! If we don't have something, just ask and we'll try--we like challenges. New packages: Kodi 18.5, Featherpad 0.12.0, PulseEffects 4.7.0, KeepassXC 2.5.2, SuperTuxKart 1.1, Waterfox 2019.12
User avatar
stevepusser
 
Posts: 11397
Joined: 2009-10-06 05:53

Re: Catastrophic system failure :)

Postby esp7 » 2020-01-09 00:33

this is clearly failing hardware
ThinkPad X220: i5-2520M CPU 2.5GHz - 8GB RAM 1333 MHz - SSD 860 EVO 250GB - Debian Stable - ME_cleaned
ThinkPad X230: i5-3320M CPU 3.3GHz - 8GB RAM 1600 MHz - SSD 860 EVO 500GB - Debian Stable - ME_cleaned
User avatar
esp7
 
Posts: 160
Joined: 2013-06-23 20:31

Re: Catastrophic system failure :)

Postby Head_on_a_Stick » 2020-01-09 05:28

CPU µcode?
User avatar
Head_on_a_Stick
 
Posts: 11021
Joined: 2014-06-01 17:46
Location: /dev/chair

Re: Catastrophic system failure :)

Postby Lysander » 2020-01-09 08:23

stevepusser wrote:Have you ruled out hardware overheating as a cause?


No, I haven't. I suppose the most likely would be the GPU, though the error log shows the CPU. The other possibility is insufficient power - this computer has a 550W PSU, but even at all components at 90% load, it would only just breach the 500W mark. And I think if that were the cause, the PC would just reboot. Also I am interested in why REISUB didn't work.

esp7 wrote:this is clearly failing hardware


Could you please be a little more specific in your definitions of the words 'clearly' and 'failing'? The log points to a hardware issue, but is the one and only obvious cause? And when you say 'failing' do you mean in this one instance because it was 'pushed' too far, or because the hardware is increasing with age?

Head_on_a_Stick wrote:CPU µcode?


Yes, do you think this is a problem with the microcode? I haven't manually installed any. Would you recommend I do so?

One thing I should have done but which I omitted to do, was to provide some specs of this box:

CPU - Intel Q8400
RAM - 6GB DDR2
CPU - Radeon HD 5870

The machine also has a new SSD that was installed a few weeks ago and which is the main drive which Debian is installed on. There are another three internal drives.

Most of the system [including all the above components bar the new SSD] was bought and installed in 2010.
User avatar
Lysander
 
Posts: 593
Joined: 2017-02-23 10:07
Location: London

Re: Catastrophic system failure :)

Postby CwF » 2020-01-09 13:20

Lysander wrote:a new SSD that was installed a few weeks ago


I'd start there, just my two pence.
CwF
 
Posts: 545
Joined: 2018-06-20 15:16

Re: Catastrophic system failure :)

Postby Head_on_a_Stick » 2020-01-09 14:07

Lysander wrote:Yes, do you think this is a problem with the microcode? I haven't manually installed any. Would you recommend I do so?

Oh yes, Intel are really shit at making processors and the µcode is needed to correct all the mistakes they make, the Haswell generation lock up regularly without the fixes.
User avatar
Head_on_a_Stick
 
Posts: 11021
Joined: 2014-06-01 17:46
Location: /dev/chair

Re: Catastrophic system failure :)

Postby CwF » 2020-01-09 14:44

Head_on_a_Stick wrote:Oh yes


Except for the fact the cpu in question is ancient. That one has been figured out for awhile now.
CwF
 
Posts: 545
Joined: 2018-06-20 15:16

Re: Catastrophic system failure :)

Postby Head_on_a_Stick » 2020-01-09 14:59

CwF wrote:That one has been figured out for awhile now.

Haswell has been out for five years and they still need the fix. It won't hurt the OP to try, they can always remove it afterwards if it doesn't help.
User avatar
Head_on_a_Stick
 
Posts: 11021
Joined: 2014-06-01 17:46
Location: /dev/chair

Re: Catastrophic system failure :)

Postby pylkko » 2020-01-09 15:29

also kernel appears to be warning that intel ssd possibly needs firmware update
User avatar
pylkko
 
Posts: 1619
Joined: 2014-11-06 19:02

Re: Catastrophic system failure :)

Postby Lysander » 2020-01-09 19:05

Head_on_a_Stick wrote:
CwF wrote:That one has been figured out for awhile now.

Haswell has been out for five years and they still need the fix. It won't hurt the OP to try, they can always remove it afterwards if it doesn't help.


Hmm well, I wish it were that easy, but it was already installed.

Code: Select all
root@psychopig-xxxvii:~# apt install intel-microcode
Reading package lists... Done
Building dependency tree       
Reading state information... Done
intel-microcode is already the newest version (3.20191115.2~deb10u1).


pylkko wrote:also kernel appears to be warning that intel ssd possibly needs firmware update


Indeed, though that drive is approaching ten years old and I don't think it was even mounted at the time.

I have checked the CPU fan and it was clear of dust apart from a very small amount.

I am thinking the most likely cause is the CPU overheating as a result of the game being minimised [which I note does take 30-40% CPU power with certain games] and streaming an HD video. This CPU normally operates around 30-40C when idle and shouldn't be more than 70C ideally. I can't think what else it would be [esp given the error logs] - if it were the GPU I'd probably be able to tell from the fan noise since it does whine when the temp gets high.
User avatar
Lysander
 
Posts: 593
Joined: 2017-02-23 10:07
Location: London

Re: Catastrophic system failure :)

Postby pendrachken » 2020-01-09 21:40

My bet would be dried up crusty thermal paste on the CPU die. Get a tube of some decent thermal paste and slap some in there, making sure the cooler snaps into the socket decently, as the stock ones at the time were also known to break (usually ) one mount leg and lift from the CPU causing random thermal shutdowns. It could be lifted enough to cause some thermal lockups, but not enough to cause a thermal shutdown.
fortune -o
Your love life will be... interesting.
:twisted: How did it know?

The U.S. uses the metric system too, we have tenths, hundredths and thousandths of inches :-P
pendrachken
 
Posts: 1358
Joined: 2007-03-04 21:10
Location: U.S.A. - WI.

Re: Catastrophic system failure :)

Postby Lysander » 2020-01-10 21:44

pendrachken wrote:My bet would be dried up crusty thermal paste on the CPU die. Get a tube of some decent thermal paste and slap some in there, making sure the cooler snaps into the socket decently, as the stock ones at the time were also known to break (usually ) one mount leg and lift from the CPU causing random thermal shutdowns. It could be lifted enough to cause some thermal lockups, but not enough to cause a thermal shutdown.


OK pendrachken, I took your advice and checked out the thermal paste which was last applied several years ago. This is what I was met with:

Image

Image

Doesn't look like a great contact to me. I cleaned it up with lighter fluid and applied some new paste. Temps seem good for now. Will see how this goes but this looks like it needed to be done. Thanks for the assistance.
User avatar
Lysander
 
Posts: 593
Joined: 2017-02-23 10:07
Location: London


Return to General Questions

Who is online

Users browsing this forum: No registered users and 7 guests

fashionable