mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Getting your soundcard to work, using Debian on non-i386 hardware, etc

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-07-24 07:13

Hello,

I have a server with Debian 10 (Proxmox) that restarts autumatically. Frankly, I can't fingure out where the isse is.

Code: Select all
ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2020-07-19 03:43:06 +0200 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c0e, status=0x9400004000040150, addr=0x1ffff9c8e93c0, tsc=0x199d94e3f312c, walltime=0x5f13a52a, cpu=0x00000001, cpuid=0x000906ec, apicid=0x00000002
2 2020-07-19 03:55:10 +0200 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c0e, status=0x9000004000010005, tsc=0x19c37efad6712, walltime=0x5f13a7fe, cpu=0x00000001, cpuid=0x000906ec, apicid=0x00000002
3 2020-07-23 15:37:51 +0200 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c0e, status=0x9400004000040150, addr=0x974d56e7, tsc=0x91c13254a62a, walltime=0x5f1992af, cpu=0x00000001, cpuid=0x000906ec, apicid=0x00000002


Code: Select all
Jul 23 16:30:10 E2S kernel: smpboot: CPU0: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (family: 0x6, model: 0x9e, stepping: 0xc)
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: be00000000800400
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: TSC 0 ADDR 63de0dd1 MISC 63de0dd1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1595514604 SOCKET 0 APIC 0 microcode d6
...
Jul 23 16:30:10 E2S kernel: .... node  #0, CPUs:        #1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: TSC 0 ADDR 63de0dd1 MISC 63de0dd1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1595514604 SOCKET 0 APIC 2 microcode d6


Is that hardware, firmware of software error?
Where should I dig in?

Thanks.
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby LE_746F6D617A7A69 » 2020-07-24 09:41

Machine check exceptions are triggered by hardware faults - caused by physical problems with the hardware (overheating, unstable power, damaged CPU) or by regressions in the firmware.

I9 are prone to overheating - what temps do You have?
Will it crash if You set constant CPU clock well below the maximum (using linux-cpupower or cpufrequtils)?

If You have upgraded the firmware/BIOS recently, You may try to use the previous version - to confirm or reject the possibility of firmware regression.

Anyway, I would say that You should contact Proxmox regarding this issue.
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed
LE_746F6D617A7A69
 
Posts: 394
Joined: 2020-05-03 14:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-07-24 13:40

LE_746F6D617A7A69 wrote:Machine check exceptions are triggered by hardware faults - caused by physical problems with the hardware (overheating, unstable power, damaged CPU) or by regressions in the firmware.

I9 are prone to overheating - what temps do You have?
Will it crash if You set constant CPU clock well below the maximum (using linux-cpupower or cpufrequtils)?

If You have upgraded the firmware/BIOS recently, You may try to use the previous version - to confirm or reject the possibility of firmware regression.

Anyway, I would say that You should contact Proxmox regarding this issue.


Thanks for the hints.
I didn't monitor CPU temp yet, but noticed that server restarts after increased loading.

Is it possible to load CPU firmware that I want to?
Because I have the same CPU and software that work perfectly, but it has lower firmware. I'd like to test it as well.

I posted it on Proxmox forum, but they didn't response it. Maybe didn't notice, so I have to find a solution myself.
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby CwF » 2020-07-24 14:02

perhaps check in the bios. Many have log options, check them and turn them on.
CwF
 
Posts: 790
Joined: 2018-06-20 15:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-07-24 14:08

CwF wrote:perhaps check in the bios. Many have log options, check them and turn them on.

I can't check the bios now because the server is located in a datacenter. I don't have physical access to it.
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby LE_746F6D617A7A69 » 2020-07-24 16:10

GeNe64 wrote:Is it possible to load CPU firmware that I want to?

Of course. There are at least 2 ways to do this:
1. Use apt-cache policy intel-microcode to view available versions and then install older version using apt-get install intel-microcode=<version>
2. Download older version of the intel-microcode package from http://snapshot.debian.org/
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed
LE_746F6D617A7A69
 
Posts: 394
Joined: 2020-05-03 14:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby Bulkley » 2020-07-24 19:48

A weak power supply can drive you crazy. So can a flaky on-off switch.

I'd blow the dust out and then re-seat every electrical connection including boards and memory.

I once had a mother board with a cold-solder joint causing intermittent freezes.

Sometimes you can find the problem while running with the cover off and very carefully tapping components with something that does not conduct electricity. A drinking straw, a plastic pencil or a fine wood dowel work. Be gentle. Set your machine to run something demanding like a movie and poke around among the works. If the machine quits while you are knocking around you know roughly where the problem is. I can't stress enough to be careful at this and if you are not comfortable with it don't do it.
Bulkley
 
Posts: 6004
Joined: 2006-02-11 18:35

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby CwF » 2020-07-24 20:18

Bulkley wrote:I'd blow... Be gentle..something demanding...knocking...roughly...problem is...don't do it.

ya, when they get to the bios...
! read much?
CwF
 
Posts: 790
Joined: 2018-06-20 15:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby Bulkley » 2020-07-24 20:31

CwF wrote:ya, when they get to the bios...
! read much?


Yes, the BIOS. I saw that. However I've seen resets caused by hardware malfunction. You probably have too. Both need to be checked.
Bulkley
 
Posts: 6004
Joined: 2006-02-11 18:35

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby CwF » 2020-07-24 20:40

Bulkley wrote: You probably have too

Oh ya, just saying somewhere 'in a datacenter' means it may not be touched. Time for the browser plugin tunneling into the public ipmi data connect to check the bios, and we'll just hope the roomba is around to clean it.
..just kidding..
CwF
 
Posts: 790
Joined: 2018-06-20 15:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-07-25 07:14

LE_746F6D617A7A69 wrote:
GeNe64 wrote:Is it possible to load CPU firmware that I want to?

Of course. There are at least 2 ways to do this:
1. Use apt-cache policy intel-microcode to view available versions and then install older version using apt-get install intel-microcode=<version>
2. Download older version of the intel-microcode package from http://snapshot.debian.org/

Many thanks, I've downgraded firmware to 3.20191115.2~deb10u1.
At least I don't see now these errors
Code: Select all
Jul 23 16:30:10 E2S kernel: smpboot: CPU0: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (family: 0x6, model: 0x9e, stepping: 0xc)
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: be00000000800400
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: TSC 0 ADDR 63de0dd1 MISC 63de0dd1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1595514604 SOCKET 0 APIC 0 microcode d6


CPU Temperature test 89°C) for 10 hrs was passed.
Testing it now with downgraded firmware...
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-07-25 07:33

Bulkley wrote:A weak power supply can drive you crazy. So can a flaky on-off switch.

I'd blow the dust out and then re-seat every electrical connection including boards and memory.

I once had a mother board with a cold-solder joint causing intermittent freezes.

Sometimes you can find the problem while running with the cover off and very carefully tapping components with something that does not conduct electricity. A drinking straw, a plastic pencil or a fine wood dowel work. Be gentle. Set your machine to run something demanding like a movie and poke around among the works. If the machine quits while you are knocking around you know roughly where the problem is. I can't stress enough to be careful at this and if you are not comfortable with it don't do it.

The server was ordered as a dedicated server, I have remote access only.
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-08-01 06:07

Guys, I couldn't resolve the issues with the server above and ordered new one with the same specs (Intel® Core™ i9-9900K etc), but I'm still getting errors like these
Code: Select all
Jul 30 03:01:16 E3S kernel: [21083.991177] mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991178] mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991179] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991179] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991180] mce: CPU13: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991181] mce: CPU14: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991182] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991340] mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.992167] mce: CPU7: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992168] mce: CPU4: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992168] mce: CPU11: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992169] mce: CPU15: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992170] mce: CPU12: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992171] mce: CPU3: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992204] mce: CPU2: Core temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992205] mce: CPU2: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992208] mce: CPU5: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992209] mce: CPU13: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992210] mce: CPU0: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992210] mce: CPU8: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992211] mce: CPU1: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992212] mce: CPU9: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992213] mce: CPU6: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992213] mce: CPU14: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992235] mce: CPU10: Core temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.995378] mce: CPU10: Package temperature/speed normal
Jul 30 03:50:03 E3S kernel: [24010.129044] perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Jul 30 05:40:02 E3S kernel: [30609.900970] mce: [Hardware Error]: Machine check events logged
Jul 31 00:00:01 E3S rsyslogd:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="725" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Jul 31 03:01:12 E3S kernel: [107480.111168] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)
Jul 31 03:01:12 E3S kernel: [107480.111168] mce: CPU8: Core temperature above threshold, cpu clock throttled (total events = 1)
Jul 31 03:01:12 E3S kernel: [107480.111169] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111171] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111172] mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111173] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111201] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111203] mce: CPU14: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111204] mce: CPU15: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111205] mce: CPU11: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111210] mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111210] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111211] mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111212] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 2)


Then I lose access to my server.
Any tips?
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby cuckooflew » 2020-08-01 13:07

The server was ordered as a dedicated server, I have remote access only.

I think I would be communicating with the provider of this server, if it is hardware, and it sounds like it is, then someone with physical access will need to do the "mechanic" part.
Please Read What we expect you have already Done
Search Engines know a lot, and
"If God had wanted computers to work all the time, He wouldn't have invented RESET buttons"
and
Just say NO to help vampires!
cuckooflew
 
Posts: 683
Joined: 2018-05-10 19:34
Location: Some where out west

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby Deb-fan » 2020-08-01 14:32

Only a random thought, same challenges faced by Debian stable users with newer hardware on (desktop), certainly has to hold true in any other use(servers.) Perhaps consider installing a newer kernel and firmware versions etc. Rather than downgrading would likely be going the other way, hopefully providing improved support for the chosen hardware. Should install some monitoring software onto a server anyway. Dead simple way to rule out hardware and determine if it is Os misconfig, install a gnu/nix distro like Ubuntu onto it, does that install show the same quirks and problems ?

If runs smoother/better w/o displaying similar negative behavior, hardware's fine, Debian's not setup right for that system.
Most powerful FREE tech-support tool on the planet * HERE. *
Deb-fan
 
Posts: 954
Joined: 2012-08-14 12:27

Next

Return to Hardware

Who is online

Users browsing this forum: No registered users and 15 guests

fashionable