mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Message

GeNe64 · #1 Post by **GeNe64** » 2020-07-24 07:13

Hello,

I have a server with Debian 10 (Proxmox) that restarts autumatically. Frankly, I can't fingure out where the isse is.

Code: Select all

ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2020-07-19 03:43:06 +0200 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c0e, status=0x9400004000040150, addr=0x1ffff9c8e93c0, tsc=0x199d94e3f312c, walltime=0x5f13a52a, cpu=0x00000001, cpuid=0x000906ec, apicid=0x00000002
2 2020-07-19 03:55:10 +0200 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c0e, status=0x9000004000010005, tsc=0x19c37efad6712, walltime=0x5f13a7fe, cpu=0x00000001, cpuid=0x000906ec, apicid=0x00000002
3 2020-07-23 15:37:51 +0200 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c0e, status=0x9400004000040150, addr=0x974d56e7, tsc=0x91c13254a62a, walltime=0x5f1992af, cpu=0x00000001, cpuid=0x000906ec, apicid=0x00000002

Code: Select all

Jul 23 16:30:10 E2S kernel: smpboot: CPU0: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (family: 0x6, model: 0x9e, stepping: 0xc)
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: be00000000800400
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: TSC 0 ADDR 63de0dd1 MISC 63de0dd1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1595514604 SOCKET 0 APIC 0 microcode d6
...
Jul 23 16:30:10 E2S kernel: .... node  #0, CPUs:        #1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: TSC 0 ADDR 63de0dd1 MISC 63de0dd1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1595514604 SOCKET 0 APIC 2 microcode d6

Is that hardware, firmware of software error?
Where should I dig in?

Thanks.

LE_746F6D617A7A69 · #2 Post by **LE_746F6D617A7A69** » 2020-07-24 09:41

Machine check exceptions are triggered by hardware faults - caused by physical problems with the hardware (overheating, unstable power, damaged CPU) or by regressions in the firmware.

I9 are prone to overheating - what temps do You have?
Will it crash if You set constant CPU clock well below the maximum (using linux-cpupower or cpufrequtils)?

If You have upgraded the firmware/BIOS recently, You may try to use the previous version - to confirm or reject the possibility of firmware regression.

Anyway, I would say that You should contact Proxmox regarding this issue.

GeNe64 · #3 Post by **GeNe64** » 2020-07-24 13:40

LE_746F6D617A7A69 wrote:Machine check exceptions are triggered by hardware faults - caused by physical problems with the hardware (overheating, unstable power, damaged CPU) or by regressions in the firmware.

I9 are prone to overheating - what temps do You have?
Will it crash if You set constant CPU clock well below the maximum (using linux-cpupower or cpufrequtils)?

If You have upgraded the firmware/BIOS recently, You may try to use the previous version - to confirm or reject the possibility of firmware regression.

Anyway, I would say that You should contact Proxmox regarding this issue.

Thanks for the hints.
I didn't monitor CPU temp yet, but noticed that server restarts after increased loading.

Is it possible to load CPU firmware that I want to?
Because I have the same CPU and software that work perfectly, but it has lower firmware. I'd like to test it as well.

I posted it on Proxmox forum, but they didn't response it. Maybe didn't notice, so I have to find a solution myself.

#4 Post by **CwF** » 2020-07-24 14:02

perhaps check in the bios. Many have log options, check them and turn them on.

GeNe64 · #5 Post by **GeNe64** » 2020-07-24 14:08

CwF wrote:perhaps check in the bios. Many have log options, check them and turn them on.

I can't check the bios now because the server is located in a datacenter. I don't have physical access to it.

LE_746F6D617A7A69 · #6 Post by **LE_746F6D617A7A69** » 2020-07-24 16:10

GeNe64 wrote:Is it possible to load CPU firmware that I want to?

Of course. There are at least 2 ways to do this:
1. Use apt-cache policy intel-microcode to view available versions and then install older version using apt-get install intel-microcode=<version>
2. Download older version of the intel-microcode package from http://snapshot.debian.org/

Bulkley · #7 Post by **Bulkley** » 2020-07-24 19:48

A weak power supply can drive you crazy. So can a flaky on-off switch.

I'd blow the dust out and then re-seat every electrical connection including boards and memory.

I once had a mother board with a cold-solder joint causing intermittent freezes.

Sometimes you can find the problem while running with the cover off and very carefully tapping components with something that does not conduct electricity. A drinking straw, a plastic pencil or a fine wood dowel work. Be gentle. Set your machine to run something demanding like a movie and poke around among the works. If the machine quits while you are knocking around you know roughly where the problem is. I can't stress enough to be careful at this and if you are not comfortable with it don't do it.

#8 Post by **CwF** » 2020-07-24 20:18

Bulkley wrote:I'd blow... Be gentle..something demanding...knocking...roughly...problem is...don't do it.

ya, when they get to the bios...
! read much?

Bulkley · #9 Post by **Bulkley** » 2020-07-24 20:31

CwF wrote:ya, when they get to the bios...
! read much?

Yes, the BIOS. I saw that. However I've seen resets caused by hardware malfunction. You probably have too. Both need to be checked.

#10 Post by **CwF** » 2020-07-24 20:40

Bulkley wrote: You probably have too

Oh ya, just saying somewhere 'in a datacenter' means it may not be touched. Time for the browser plugin tunneling into the public ipmi data connect to check the bios, and we'll just hope the roomba is around to clean it.
..just kidding..

GeNe64 · #11 Post by **GeNe64** » 2020-07-25 07:14

LE_746F6D617A7A69 wrote:
GeNe64 wrote:Is it possible to load CPU firmware that I want to?
Of course. There are at least 2 ways to do this:
1. Use apt-cache policy intel-microcode to view available versions and then install older version using apt-get install intel-microcode=<version>
2. Download older version of the intel-microcode package from http://snapshot.debian.org/

Many thanks, I've downgraded firmware to 3.20191115.2~deb10u1.
At least I don't see now these errors

Code: Select all

Jul 23 16:30:10 E2S kernel: smpboot: CPU0: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (family: 0x6, model: 0x9e, stepping: 0xc)
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: be00000000800400
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: TSC 0 ADDR 63de0dd1 MISC 63de0dd1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1595514604 SOCKET 0 APIC 0 microcode d6

CPU Temperature test 89°C) for 10 hrs was passed.
Testing it now with downgraded firmware...

GeNe64 · #12 Post by **GeNe64** » 2020-07-25 07:33

Bulkley wrote:A weak power supply can drive you crazy. So can a flaky on-off switch.

I'd blow the dust out and then re-seat every electrical connection including boards and memory.

I once had a mother board with a cold-solder joint causing intermittent freezes.

Sometimes you can find the problem while running with the cover off and very carefully tapping components with something that does not conduct electricity. A drinking straw, a plastic pencil or a fine wood dowel work. Be gentle. Set your machine to run something demanding like a movie and poke around among the works. If the machine quits while you are knocking around you know roughly where the problem is. I can't stress enough to be careful at this and if you are not comfortable with it don't do it.

The server was ordered as a dedicated server, I have remote access only.

GeNe64 · #13 Post by **GeNe64** » 2020-08-01 06:07

Guys, I couldn't resolve the issues with the server above and ordered new one with the same specs (Intel® Core™ i9-9900K etc), but I'm still getting errors like these

Code: Select all

Jul 30 03:01:16 E3S kernel: [21083.991177] mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991178] mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991179] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991179] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991180] mce: CPU13: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991181] mce: CPU14: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991182] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991340] mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.992167] mce: CPU7: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992168] mce: CPU4: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992168] mce: CPU11: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992169] mce: CPU15: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992170] mce: CPU12: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992171] mce: CPU3: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992204] mce: CPU2: Core temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992205] mce: CPU2: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992208] mce: CPU5: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992209] mce: CPU13: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992210] mce: CPU0: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992210] mce: CPU8: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992211] mce: CPU1: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992212] mce: CPU9: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992213] mce: CPU6: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992213] mce: CPU14: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992235] mce: CPU10: Core temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.995378] mce: CPU10: Package temperature/speed normal
Jul 30 03:50:03 E3S kernel: [24010.129044] perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Jul 30 05:40:02 E3S kernel: [30609.900970] mce: [Hardware Error]: Machine check events logged
Jul 31 00:00:01 E3S rsyslogd:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="725" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Jul 31 03:01:12 E3S kernel: [107480.111168] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)
Jul 31 03:01:12 E3S kernel: [107480.111168] mce: CPU8: Core temperature above threshold, cpu clock throttled (total events = 1)
Jul 31 03:01:12 E3S kernel: [107480.111169] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111171] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111172] mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111173] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111201] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111203] mce: CPU14: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111204] mce: CPU15: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111205] mce: CPU11: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111210] mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111210] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111211] mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111212] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 2)

Then I lose access to my server.
Any tips?

cuckooflew · #14 Post by **cuckooflew** » 2020-08-01 13:07

The server was ordered as a dedicated server, I have remote access only.

I think I would be communicating with the provider of this server, if it is hardware, and it sounds like it is, then someone with physical access will need to do the "mechanic" part.

Deb-fan · #15 Post by **Deb-fan** » 2020-08-01 14:32

Only a random thought, same challenges faced by Debian stable users with newer hardware on (desktop), certainly has to hold true in any other use(servers.) Perhaps consider installing a newer kernel and firmware versions etc. Rather than downgrading would likely be going the other way, hopefully providing improved support for the chosen hardware. Should install some monitoring software onto a server anyway. Dead simple way to rule out hardware and determine if it is Os misconfig, install a gnu/nix distro like Ubuntu onto it, does that install show the same quirks and problems ?

If runs smoother/better w/o displaying similar negative behavior, hardware's fine, Debian's not setup right for that system.

GeNe64 · #16 Post by **GeNe64** » 2020-08-01 15:30

cuckooflew wrote:
The server was ordered as a dedicated server, I have remote access only.
I think I would be communicating with the provider of this server, if it is hardware, and it sounds like it is, then someone with physical access will need to do the "mechanic" part.

They've tested it and said it's ok.

GeNe64 · #17 Post by **GeNe64** » 2020-08-01 15:37

Deb-fan wrote:Only a random thought, same challenges faced by Debian stable users with newer hardware on (desktop), certainly has to hold true in any other use(servers.) Perhaps consider installing a newer kernel and firmware versions etc. Rather than downgrading would likely be going the other way, hopefully providing improved support for the chosen hardware. Should install some monitoring software onto a server anyway. Dead simple way to rule out hardware and determine if it is Os misconfig, install a gnu/nix distro like Ubuntu onto it, does that install show the same quirks and problems ?

If runs smoother/better w/o displaying similar negative behavior, hardware's fine, Debian's not setup right for that system.

I need only Debian to install Proxmox on it. When I run it unloaded (OS, Proxmox and a 10 unloaded VMs) it's ok and works for 6+ days. If I start to load VMs (CPU, SSD, Network) then it crashes or something else in 1/2 days on both servers in different datacenters.

I can't find any useful in standard logs. Any suggestions regarding monitoring software?

Deb-fan · #18 Post by **Deb-fan** » 2020-08-01 15:59

Nope .. only vaguely aware of wth proxmox even is. A virtualization container type thing. Badly lacking in learning about virtualization all around.

So what/which tools or how to approach trouble shooting it(proxmox) is beyond me ... Sorry, though from what those techs apparently said to you and what you're describing, not hardware but software problems. If they have help forums and likely do would spend time and ask the people there, who use that software for help n pointers on running down issues and possible fixes.

LE_746F6D617A7A69 · #19 Post by **LE_746F6D617A7A69** » 2020-08-01 16:02

GeNe64 wrote:
cuckooflew wrote:
The server was ordered as a dedicated server, I have remote access only.
I think I would be communicating with the provider of this server, if it is hardware, and it sounds like it is, then someone with physical access will need to do the "mechanic" part.
They've tested it and said it's ok.

Thermal throttling alerts are saying something else - the machine has a problem with cooling, what suggest that f.e. it's too hot in the server room.
Temperatures from SMART could show a more clear picture of what is happening in that Data Center.

Deb-fan · #20 Post by **Deb-fan** » 2020-08-01 16:23

Come on now. The server room is too hot in a couple datacenters? Errr or, incorrectly configged software w processes hammering hell out of cpus causing heat and crash issues ? Which of these more likely? Still again ... asking this in a mostly desktop oriented Debian gnu/nix community vs asking in proxmox or kvm ones?

Debian User Forums

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: