mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Need help with peripherals or devices?
Message
Author
GeNe64
Posts: 10
Joined: 2020-07-24 07:05

mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#1 Post by GeNe64 »

Hello,

I have a server with Debian 10 (Proxmox) that restarts autumatically. Frankly, I can't fingure out where the isse is.

Code: Select all

ras-mc-ctl --errors
No Memory errors.

No PCIe AER errors.

No Extlog errors.

MCE events:
1 2020-07-19 03:43:06 +0200 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c0e, status=0x9400004000040150, addr=0x1ffff9c8e93c0, tsc=0x199d94e3f312c, walltime=0x5f13a52a, cpu=0x00000001, cpuid=0x000906ec, apicid=0x00000002
2 2020-07-19 03:55:10 +0200 error: Internal parity error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c0e, status=0x9000004000010005, tsc=0x19c37efad6712, walltime=0x5f13a7fe, cpu=0x00000001, cpuid=0x000906ec, apicid=0x00000002
3 2020-07-23 15:37:51 +0200 error: Instruction CACHE Level-0 Instruction-Fetch Error, mcg mcgstatus=0, mci Corrected_error Error_enabled, mcgcap=0x00000c0e, status=0x9400004000040150, addr=0x974d56e7, tsc=0x91c13254a62a, walltime=0x5f1992af, cpu=0x00000001, cpuid=0x000906ec, apicid=0x00000002

Code: Select all

Jul 23 16:30:10 E2S kernel: smpboot: CPU0: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (family: 0x6, model: 0x9e, stepping: 0xc)
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: be00000000800400
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: TSC 0 ADDR 63de0dd1 MISC 63de0dd1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1595514604 SOCKET 0 APIC 0 microcode d6
...
Jul 23 16:30:10 E2S kernel: .... node  #0, CPUs:        #1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 3: be00000000800400
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: TSC 0 ADDR 63de0dd1 MISC 63de0dd1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1595514604 SOCKET 0 APIC 2 microcode d6
Is that hardware, firmware of software error?
Where should I dig in?

Thanks.

LE_746F6D617A7A69
Posts: 521
Joined: 2020-05-03 14:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#2 Post by LE_746F6D617A7A69 »

Machine check exceptions are triggered by hardware faults - caused by physical problems with the hardware (overheating, unstable power, damaged CPU) or by regressions in the firmware.

I9 are prone to overheating - what temps do You have?
Will it crash if You set constant CPU clock well below the maximum (using linux-cpupower or cpufrequtils)?

If You have upgraded the firmware/BIOS recently, You may try to use the previous version - to confirm or reject the possibility of firmware regression.

Anyway, I would say that You should contact Proxmox regarding this issue.
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed

GeNe64
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#3 Post by GeNe64 »

LE_746F6D617A7A69 wrote:Machine check exceptions are triggered by hardware faults - caused by physical problems with the hardware (overheating, unstable power, damaged CPU) or by regressions in the firmware.

I9 are prone to overheating - what temps do You have?
Will it crash if You set constant CPU clock well below the maximum (using linux-cpupower or cpufrequtils)?

If You have upgraded the firmware/BIOS recently, You may try to use the previous version - to confirm or reject the possibility of firmware regression.

Anyway, I would say that You should contact Proxmox regarding this issue.
Thanks for the hints.
I didn't monitor CPU temp yet, but noticed that server restarts after increased loading.

Is it possible to load CPU firmware that I want to?
Because I have the same CPU and software that work perfectly, but it has lower firmware. I'd like to test it as well.

I posted it on Proxmox forum, but they didn't response it. Maybe didn't notice, so I have to find a solution myself.

CwF
Posts: 1187
Joined: 2018-06-20 15:16
Has thanked: 2 times
Been thanked: 6 times

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#4 Post by CwF »

perhaps check in the bios. Many have log options, check them and turn them on.

GeNe64
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#5 Post by GeNe64 »

CwF wrote:perhaps check in the bios. Many have log options, check them and turn them on.
I can't check the bios now because the server is located in a datacenter. I don't have physical access to it.

LE_746F6D617A7A69
Posts: 521
Joined: 2020-05-03 14:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#6 Post by LE_746F6D617A7A69 »

GeNe64 wrote:Is it possible to load CPU firmware that I want to?
Of course. There are at least 2 ways to do this:
1. Use apt-cache policy intel-microcode to view available versions and then install older version using apt-get install intel-microcode=<version>
2. Download older version of the intel-microcode package from http://snapshot.debian.org/
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed

Bulkley
Posts: 6173
Joined: 2006-02-11 18:35
Been thanked: 2 times

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#7 Post by Bulkley »

A weak power supply can drive you crazy. So can a flaky on-off switch.

I'd blow the dust out and then re-seat every electrical connection including boards and memory.

I once had a mother board with a cold-solder joint causing intermittent freezes.

Sometimes you can find the problem while running with the cover off and very carefully tapping components with something that does not conduct electricity. A drinking straw, a plastic pencil or a fine wood dowel work. Be gentle. Set your machine to run something demanding like a movie and poke around among the works. If the machine quits while you are knocking around you know roughly where the problem is. I can't stress enough to be careful at this and if you are not comfortable with it don't do it.

CwF
Posts: 1187
Joined: 2018-06-20 15:16
Has thanked: 2 times
Been thanked: 6 times

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#8 Post by CwF »

Bulkley wrote:I'd blow... Be gentle..something demanding...knocking...roughly...problem is...don't do it.
ya, when they get to the bios...
! read much?

Bulkley
Posts: 6173
Joined: 2006-02-11 18:35
Been thanked: 2 times

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#9 Post by Bulkley »

CwF wrote:ya, when they get to the bios...
! read much?
Yes, the BIOS. I saw that. However I've seen resets caused by hardware malfunction. You probably have too. Both need to be checked.

CwF
Posts: 1187
Joined: 2018-06-20 15:16
Has thanked: 2 times
Been thanked: 6 times

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#10 Post by CwF »

Bulkley wrote: You probably have too
Oh ya, just saying somewhere 'in a datacenter' means it may not be touched. Time for the browser plugin tunneling into the public ipmi data connect to check the bios, and we'll just hope the roomba is around to clean it.
..just kidding..

GeNe64
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#11 Post by GeNe64 »

LE_746F6D617A7A69 wrote:
GeNe64 wrote:Is it possible to load CPU firmware that I want to?
Of course. There are at least 2 ways to do this:
1. Use apt-cache policy intel-microcode to view available versions and then install older version using apt-get install intel-microcode=<version>
2. Download older version of the intel-microcode package from http://snapshot.debian.org/
Many thanks, I've downgraded firmware to 3.20191115.2~deb10u1.
At least I don't see now these errors

Code: Select all

Jul 23 16:30:10 E2S kernel: smpboot: CPU0: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz (family: 0x6, model: 0x9e, stepping: 0xc)
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: Machine check events logged
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: be00000000800400
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: TSC 0 ADDR 63de0dd1 MISC 63de0dd1
Jul 23 16:30:10 E2S kernel: mce: [Hardware Error]: PROCESSOR 0:906ec TIME 1595514604 SOCKET 0 APIC 0 microcode d6
CPU Temperature test 89°C) for 10 hrs was passed.
Testing it now with downgraded firmware...

GeNe64
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#12 Post by GeNe64 »

Bulkley wrote:A weak power supply can drive you crazy. So can a flaky on-off switch.

I'd blow the dust out and then re-seat every electrical connection including boards and memory.

I once had a mother board with a cold-solder joint causing intermittent freezes.

Sometimes you can find the problem while running with the cover off and very carefully tapping components with something that does not conduct electricity. A drinking straw, a plastic pencil or a fine wood dowel work. Be gentle. Set your machine to run something demanding like a movie and poke around among the works. If the machine quits while you are knocking around you know roughly where the problem is. I can't stress enough to be careful at this and if you are not comfortable with it don't do it.
The server was ordered as a dedicated server, I have remote access only.

GeNe64
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#13 Post by GeNe64 »

Guys, I couldn't resolve the issues with the server above and ordered new one with the same specs (Intel® Core™ i9-9900K etc), but I'm still getting errors like these

Code: Select all

Jul 30 03:01:16 E3S kernel: [21083.991177] mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991178] mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991179] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991179] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991180] mce: CPU13: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991181] mce: CPU14: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991182] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.991340] mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 1)
Jul 30 03:01:16 E3S kernel: [21083.992167] mce: CPU7: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992168] mce: CPU4: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992168] mce: CPU11: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992169] mce: CPU15: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992170] mce: CPU12: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992171] mce: CPU3: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992204] mce: CPU2: Core temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992205] mce: CPU2: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992208] mce: CPU5: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992209] mce: CPU13: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992210] mce: CPU0: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992210] mce: CPU8: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992211] mce: CPU1: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992212] mce: CPU9: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992213] mce: CPU6: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992213] mce: CPU14: Package temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.992235] mce: CPU10: Core temperature/speed normal
Jul 30 03:01:16 E3S kernel: [21083.995378] mce: CPU10: Package temperature/speed normal
Jul 30 03:50:03 E3S kernel: [24010.129044] perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79750
Jul 30 05:40:02 E3S kernel: [30609.900970] mce: [Hardware Error]: Machine check events logged
Jul 31 00:00:01 E3S rsyslogd:  [origin software="rsyslogd" swVersion="8.1901.0" x-pid="725" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Jul 31 03:01:12 E3S kernel: [107480.111168] mce: CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)
Jul 31 03:01:12 E3S kernel: [107480.111168] mce: CPU8: Core temperature above threshold, cpu clock throttled (total events = 1)
Jul 31 03:01:12 E3S kernel: [107480.111169] mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111171] mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111172] mce: CPU10: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111173] mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111201] mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111203] mce: CPU14: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111204] mce: CPU15: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111205] mce: CPU11: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111210] mce: CPU8: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111210] mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111211] mce: CPU9: Package temperature above threshold, cpu clock throttled (total events = 2)
Jul 31 03:01:12 E3S kernel: [107480.111212] mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 2)
Then I lose access to my server.
Any tips?

cuckooflew
Posts: 680
Joined: 2018-05-10 19:34
Location: Some where out west

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#14 Post by cuckooflew »

The server was ordered as a dedicated server, I have remote access only.
I think I would be communicating with the provider of this server, if it is hardware, and it sounds like it is, then someone with physical access will need to do the "mechanic" part.
Please Read What we expect you have already Done
Search Engines know a lot, and
"If God had wanted computers to work all the time, He wouldn't have invented RESET buttons"
and
Just say NO to help vampires!

Deb-fan
Posts: 1042
Joined: 2012-08-14 12:27

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#15 Post by Deb-fan »

Only a random thought, same challenges faced by Debian stable users with newer hardware on (desktop), certainly has to hold true in any other use(servers.) Perhaps consider installing a newer kernel and firmware versions etc. Rather than downgrading would likely be going the other way, hopefully providing improved support for the chosen hardware. Should install some monitoring software onto a server anyway. Dead simple way to rule out hardware and determine if it is Os misconfig, install a gnu/nix distro like Ubuntu onto it, does that install show the same quirks and problems ?

If runs smoother/better w/o displaying similar negative behavior, hardware's fine, Debian's not setup right for that system.
Most powerful FREE tech-support tool on the planet * HERE. *

GeNe64
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#16 Post by GeNe64 »

cuckooflew wrote:
The server was ordered as a dedicated server, I have remote access only.
I think I would be communicating with the provider of this server, if it is hardware, and it sounds like it is, then someone with physical access will need to do the "mechanic" part.
They've tested it and said it's ok.

GeNe64
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#17 Post by GeNe64 »

Deb-fan wrote:Only a random thought, same challenges faced by Debian stable users with newer hardware on (desktop), certainly has to hold true in any other use(servers.) Perhaps consider installing a newer kernel and firmware versions etc. Rather than downgrading would likely be going the other way, hopefully providing improved support for the chosen hardware. Should install some monitoring software onto a server anyway. Dead simple way to rule out hardware and determine if it is Os misconfig, install a gnu/nix distro like Ubuntu onto it, does that install show the same quirks and problems ?

If runs smoother/better w/o displaying similar negative behavior, hardware's fine, Debian's not setup right for that system.
I need only Debian to install Proxmox on it. When I run it unloaded (OS, Proxmox and a 10 unloaded VMs) it's ok and works for 6+ days. If I start to load VMs (CPU, SSD, Network) then it crashes or something else in 1/2 days on both servers in different datacenters.

I can't find any useful in standard logs. Any suggestions regarding monitoring software?

Deb-fan
Posts: 1042
Joined: 2012-08-14 12:27

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#18 Post by Deb-fan »

Nope .. only vaguely aware of wth proxmox even is. A virtualization container type thing. Badly lacking in learning about virtualization all around. :( So what/which tools or how to approach trouble shooting it(proxmox) is beyond me ... Sorry, though from what those techs apparently said to you and what you're describing, not hardware but software problems. If they have help forums and likely do would spend time and ask the people there, who use that software for help n pointers on running down issues and possible fixes.
Most powerful FREE tech-support tool on the planet * HERE. *

LE_746F6D617A7A69
Posts: 521
Joined: 2020-05-03 14:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#19 Post by LE_746F6D617A7A69 »

GeNe64 wrote:
cuckooflew wrote:
The server was ordered as a dedicated server, I have remote access only.
I think I would be communicating with the provider of this server, if it is hardware, and it sounds like it is, then someone with physical access will need to do the "mechanic" part.
They've tested it and said it's ok.
Thermal throttling alerts are saying something else - the machine has a problem with cooling, what suggest that f.e. it's too hot in the server room.
Temperatures from SMART could show a more clear picture of what is happening in that Data Center.
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed

Deb-fan
Posts: 1042
Joined: 2012-08-14 12:27

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

#20 Post by Deb-fan »

Come on now. The server room is too hot in a couple datacenters? Errr or, incorrectly configged software w processes hammering hell out of cpus causing heat and crash issues ? Which of these more likely? Still again ... asking this in a mostly desktop oriented Debian gnu/nix community vs asking in proxmox or kvm ones? :)
Most powerful FREE tech-support tool on the planet * HERE. *

Post Reply