mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Getting your soundcard to work, using Debian on non-i386 hardware, etc

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-08-01 15:30

cuckooflew wrote:
The server was ordered as a dedicated server, I have remote access only.

I think I would be communicating with the provider of this server, if it is hardware, and it sounds like it is, then someone with physical access will need to do the "mechanic" part.

They've tested it and said it's ok.
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-08-01 15:37

Deb-fan wrote:Only a random thought, same challenges faced by Debian stable users with newer hardware on (desktop), certainly has to hold true in any other use(servers.) Perhaps consider installing a newer kernel and firmware versions etc. Rather than downgrading would likely be going the other way, hopefully providing improved support for the chosen hardware. Should install some monitoring software onto a server anyway. Dead simple way to rule out hardware and determine if it is Os misconfig, install a gnu/nix distro like Ubuntu onto it, does that install show the same quirks and problems ?

If runs smoother/better w/o displaying similar negative behavior, hardware's fine, Debian's not setup right for that system.


I need only Debian to install Proxmox on it. When I run it unloaded (OS, Proxmox and a 10 unloaded VMs) it's ok and works for 6+ days. If I start to load VMs (CPU, SSD, Network) then it crashes or something else in 1/2 days on both servers in different datacenters.

I can't find any useful in standard logs. Any suggestions regarding monitoring software?
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby Deb-fan » 2020-08-01 15:59

Nope .. only vaguely aware of wth proxmox even is. A virtualization container type thing. Badly lacking in learning about virtualization all around. :( So what/which tools or how to approach trouble shooting it(proxmox) is beyond me ... Sorry, though from what those techs apparently said to you and what you're describing, not hardware but software problems. If they have help forums and likely do would spend time and ask the people there, who use that software for help n pointers on running down issues and possible fixes.
Most powerful FREE tech-support tool on the planet * HERE. *
Deb-fan
 
Posts: 968
Joined: 2012-08-14 12:27

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby LE_746F6D617A7A69 » 2020-08-01 16:02

GeNe64 wrote:
cuckooflew wrote:
The server was ordered as a dedicated server, I have remote access only.

I think I would be communicating with the provider of this server, if it is hardware, and it sounds like it is, then someone with physical access will need to do the "mechanic" part.

They've tested it and said it's ok.

Thermal throttling alerts are saying something else - the machine has a problem with cooling, what suggest that f.e. it's too hot in the server room.
Temperatures from SMART could show a more clear picture of what is happening in that Data Center.
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed
LE_746F6D617A7A69
 
Posts: 414
Joined: 2020-05-03 14:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby Deb-fan » 2020-08-01 16:23

Come on now. The server room is too hot in a couple datacenters? Errr or, incorrectly configged software w processes hammering hell out of cpus causing heat and crash issues ? Which of these more likely? Still again ... asking this in a mostly desktop oriented Debian gnu/nix community vs asking in proxmox or kvm ones? :)
Most powerful FREE tech-support tool on the planet * HERE. *
Deb-fan
 
Posts: 968
Joined: 2012-08-14 12:27

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby LE_746F6D617A7A69 » 2020-08-01 16:34

Thermal throttling means insufficient cooling - this is a fact, not a guess.
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed
LE_746F6D617A7A69
 
Posts: 414
Joined: 2020-05-03 14:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby CwF » 2020-08-01 20:20

GeNe64 wrote:Any tips?

...get your money back.
First off, that isn't exactly server grade hardware.
Most server grade stuff would never throttle due to core temp. The socket temp would be the trigger in a proper setup.
CwF
 
Posts: 814
Joined: 2018-06-20 15:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby LE_746F6D617A7A69 » 2020-08-02 00:11

CwF wrote:
GeNe64 wrote:Any tips?

...get your money back.
First off, that isn't exactly server grade hardware.
Most server grade stuff would never throttle due to core temp. The socket temp would be the trigger in a proper setup.
+1 ;)
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed
LE_746F6D617A7A69
 
Posts: 414
Joined: 2020-05-03 14:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-08-12 07:51

Finally, I've solved the issue by adding intel_idle.max_cstate=1 to the file /etc/default/grub
Code: Select all
GRUB_CMDLINE_LINUX_DEFAULT="consoleblank=0 intel_idle.max_cstate=1"

Code: Select all
# update-grub

and rebooting.
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby LE_746F6D617A7A69 » 2020-08-12 14:01

GeNe64 wrote:Finally, I've solved the issue by adding intel_idle.max_cstate=1
This makes completely no sense.
In Your previous post You said that the connection was lost after a series of warnings saying that the critical temperature has been reached. Practically this means, that the CPU has nearly melted down (max Tjunction is 100deg.C, and the treshold is 95deg.C).
The kernel parameter intel_idle.max_cstate=1 completely disables power saving - how could it help in this situation?
Yes, there was a problem with hard lookups in BayTrail CPUs, where this option was used as a workaround - but this not the case here.

I think that some other factors could have come into play here - like f.e. someone have "fixed" a problem with air conditioning system by opening all the doors and windows in that not-so-cold room ;)
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed
LE_746F6D617A7A69
 
Posts: 414
Joined: 2020-05-03 14:16

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby GeNe64 » 2020-08-12 14:26

LE_746F6D617A7A69 wrote:This makes completely no sense.
In Your previous post You said that the connection was lost after a series of warnings saying that the critical temperature has been reached. Practically this means, that the CPU has nearly melted down (max Tjunction is 100deg.C, and the treshold is 95deg.C).
The kernel parameter intel_idle.max_cstate=1 completely disables power saving - how could it help in this situation?
Yes, there was a problem with hard lookups in BayTrail CPUs, where this option was used as a workaround - but this not the case here.

I think that some other factors could have come into play here - like f.e. someone have "fixed" a problem with air conditioning system by opening all the doors and windows in that not-so-cold room ;)

Yep, that's the problem. The bug is very strange and described here https://forum.proxmox.com/threads/rando ... 597/page-3
It's not possible to find anything useful in logs but server crashes all the time.
I was trying to link any weird messages in logs (temp, mce, etc) and crashing but couldn't resolve it anyway.
It's a bug of Intel CPUs that can be fixed by adding intel_idle.max_cstate=1
GeNe64
 
Posts: 10
Joined: 2020-07-24 07:05

Re: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:

Postby LE_746F6D617A7A69 » 2020-08-12 15:51

GeNe64 wrote:It's a bug of Intel CPUs that can be fixed by adding intel_idle.max_cstate=1
So it happened Again?! :shock:
Anyway, it's good to know...
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed
LE_746F6D617A7A69
 
Posts: 414
Joined: 2020-05-03 14:16

Previous

Return to Hardware

Who is online

Users browsing this forum: No registered users and 8 guests

fashionable