[Hardware] Regular Freeze when computing

Message

furby_goulag · #1 Post by **furby_goulag** » 2024-09-03 14:00

Hello everyone,

As this is my first post do not hesitate to mak me udpate or move if it is not in the correct sub-topic.
I consider myself as a beginner on Debian.

I setup a server for CFD calculations for our small company with debian 12 on it.
It has a dual Xeon CPU and some RAM on it, it is a second hand hardware.

When I first setup the server a month ago, it seemed as a RAM memory was not seen in the bios, I switched between 2 RAM memories and it worked, so I setup memtest86 in GRUB and checked all the RAM during a few hours and no errors.
But you will see with the current bug I still suspect the RAM.

It worked quite flawlessly the first times, but now it is often freezing (once a day now) when the CPU usage is high (a simulation running).
What I do not understand is that now my memtest86 is not working "error file "EFI/memtest86/BOOTX64.efi" not available. I am currently trying to solve this as the RAM is my first suspect.

When the crashes started to occur I updated the system, I also saw that nvidia drivers could caus crash so I changed them to proprietary drivers.
But it does not change anything.

The logs I have are the following

DEBIAN 12 crash log

sudo journalctl --since "1 hour ago"

Code: Select all

sept. 03 11:35:17 SIL3XHPC01 sshd[6037]: Accepted password for matthieu from 81.220.138.244 port 49916 ssh2
sept. 03 11:35:17 SIL3XHPC01 sshd[6037]: pam_unix(sshd:session): session opened for user matthieu(uid=1001) by (uid=0)
sept. 03 11:35:17 SIL3XHPC01 systemd-logind[1785]: New session 11 of user matthieu.
sept. 03 11:35:17 SIL3XHPC01 systemd[1]: Started session-11.scope - Session 11 of User matthieu.
sept. 03 11:35:17 SIL3XHPC01 sshd[6037]: pam_env(sshd:session): deprecated reading of user environment enabled
sept. 03 11:37:00 SIL3XHPC01 kernel: hugetlbfs: cs_solver (6656): Using mlock ulimits for SHM_HUGETLB is obsolete
sept. 03 11:38:09 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Activating service name='org.gnome.gedit' requested by ':1.100' (uid=1001 pid=4415 comm="/usr/bin/nautilus --gapplication-service")
sept. 03 11:38:09 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Successfully activated service 'org.gnome.gedit'
sept. 03 11:38:16 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Activating service name='org.gnome.gedit' requested by ':1.100' (uid=1001 pid=4415 comm="/usr/bin/nautilus --gapplication-service")
sept. 03 11:38:16 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Successfully activated service 'org.gnome.gedit'
sept. 03 11:39:36 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Activating service name='org.gnome.gedit' requested by ':1.100' (uid=1001 pid=4415 comm="/usr/bin/nautilus --gapplication-service")
sept. 03 11:39:36 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Successfully activated service 'org.gnome.gedit'
sept. 03 11:41:39 SIL3XHPC01 gnome-shell[3360]: Can't update stage views actor <unnamed>[<MetaWindowActorX11>:0x5584fe7dc750] is on because it needs an allocation.
sept. 03 11:41:39 SIL3XHPC01 gnome-shell[3360]: Can't update stage views actor <unnamed>[<MetaSurfaceActorX11>:0x5584f8fb99a0] is on because it needs an allocation.
sept. 03 11:41:45 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Activating service name='org.gnome.gedit' requested by ':1.100' (uid=1001 pid=4415 comm="/usr/bin/nautilus --gapplication-service")
sept. 03 11:41:45 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Successfully activated service 'org.gnome.gedit'
sept. 03 11:41:46 SIL3XHPC01 gnome-shell[3360]: Can't update stage views actor <unnamed>[<MetaWindowActorX11>:0x5584f80d7f30] is on because it needs an allocation.
sept. 03 11:41:46 SIL3XHPC01 gnome-shell[3360]: Can't update stage views actor <unnamed>[<MetaSurfaceActorX11>:0x5584f8fb9610] is on because it needs an allocation.
sept. 03 11:44:47 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Activating service name='org.gnome.gedit' requested by ':1.100' (uid=1001 pid=4415 comm="/usr/bin/nautilus --gapplication-service")
sept. 03 11:44:47 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Successfully activated service 'org.gnome.gedit'
sept. 03 11:44:48 SIL3XHPC01 gnome-shell[3360]: Can't update stage views actor <unnamed>[<MetaWindowActorX11>:0x5584fe7dc750] is on because it needs an allocation.
sept. 03 11:44:48 SIL3XHPC01 gnome-shell[3360]: Can't update stage views actor <unnamed>[<MetaSurfaceActorX11>:0x5584f8fb9d30] is on because it needs an allocation.
sept. 03 11:45:01 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Activating service name='org.gnome.gedit' requested by ':1.100' (uid=1001 pid=4415 comm="/usr/bin/nautilus --gapplication-service")
sept. 03 11:45:01 SIL3XHPC01 dbus-daemon[3028]: [session uid=1001 pid=3028] Successfully activated service 'org.gnome.gedit'
sept. 03 11:45:02 SIL3XHPC01 gnome-shell[3360]: Can't update stage views actor <unnamed>[<MetaWindowActorX11>:0x5584f80d7f30] is on because it needs an allocation.
sept. 03 11:45:02 SIL3XHPC01 gnome-shell[3360]: Can't update stage views actor <unnamed>[<MetaSurfaceActorX11>:0x5584f8fb99a0] is on because it needs an allocation.
sept. 03 11:45:03 SIL3XHPC01 gnome-shell[3360]: Object .Gjs_ui_messageTray_Notification (0x5584f9255b40), has been already disposed — impossible to emit any signal on it. This might be caused by the object h>
sept. 03 11:45:03 SIL3XHPC01 gnome-shell[3360]: == Stack trace for context 0x5584f7734190 ==
sept. 03 11:45:03 SIL3XHPC01 gnome-shell[3360]: #0   5584f7ec5418 i   resource:///org/gnome/shell/ui/messageTray.js:493 (18e5e6821a10 @ 69)

~$ inxi -b

Code: Select all

System:
  Host: SIL3XHPC01 Kernel: 6.1.0-25-amd64 arch: x86_64 bits: 64 Console: pty pts/0 Distro: Debian
    GNU/Linux 12 (bookworm)
Machine:
  Type: Desktop System: HP product: HP Z8 G4 Workstation v: SBKPF,DWKSBLF
    serial: <superuser required>
  Mobo: HP model: 81C7 v: MVB 0C serial: <superuser required> UEFI: HP v: P60 v02.94
    date: 05/17/2024
CPU:
  Info: 2x 28-core Intel Xeon Platinum 8276 [MT MCP SMP] speed (MHz): avg: 1000 min/max: 1000/4000
Graphics:
  Device-1: NVIDIA GP107GL [Quadro P1000] driver: nvidia v: 535.183.01
  Display: server: X.org v: 1.21.1.7 with: Xwayland v: 22.1.9 driver: X: loaded: nvidia
    unloaded: fbdev,modesetting,nouveau,vesa gpu: nvidia tty: 208x30
  API: OpenGL Message: GL data unavailable in console. Try -G --display
Network:
  Device-1: Intel Ethernet I219-LM driver: e1000e
  Device-2: Intel Ethernet X722 driver: N/A
  Device-3: Intel Ethernet X722 for 1GbE driver: i40e
Drives:
  Local Storage: total: 25.55 TiB used: 776.18 GiB (3.0%)
Info:
  Processes: 980 Uptime: 37m Memory: 376.58 GiB used: 5.32 GiB (1.4%) Init: systemd
  target: graphical (5) Shell: Bash inxi: 3.3.26

If you have any clues it would be very helpful! Thanks

FG

#2 Post by **CwF** » 2024-09-03 14:59

Check or redo cpu cooler mounting.
Nouveau may be more stable if proprietary features are not needed.
memtester can be used instead, though won't test all at once.
https://packages.debian.org/bookworm/memtester

furby_goulag · #3 Post by **furby_goulag** » 2024-09-03 16:36

ok thanks I suspect also temperature issues.
I will redo that.

Solved memtest86 issues and is currently running.

furby_goulag · #4 Post by **furby_goulag** » 2024-09-04 11:21

I solved the memtest86 issues and ran the whole series of tests. No errors detected.

So I opened the computer to check for issues. All fans are running correctly, everything seems connected correctly.
I moved the pc into a more opened area for better cooling.

It ran for a few hours but then crashed again.

I haven't got thermal paste with me right now so I wait before removing the twi heatsinks.

Any ideas how can I narrow my search either a temerature monitoring or somtehnig elses?

Thanks

#5 Post by **CwF** » 2024-09-04 14:35

If this is a sudden freeze and no auto reboot and no beeps lending to heat related hardware failure theories, then do you happen to have Commander Tucker's laser tamp gauge? Those temp gadgets that we decided could accurately measure people temp during covid are helpful for finding abnormal hot spots, not on people. Check for any out of the ordinary areas like around the power regulator section and auxiliary chips that have small heatsinks. Use a erasure end of a pencil or something to nudge things checking for looseness. Also consider the power supply itself.

Since you mentioned it is a dual socket big boy there is a tactic that often leads to rejection and not solution. In the bios of this type you should find some clocking controls, names and places vary, but you should be able to limit things like base duty clock, turbo parameters, memory clock, and cpu interconnect clock (QPI?). Slow it down and test again. Cut the memory and qpi clocks in half to start.

Often with big server boards they are designed for case flow cooling and not desktop style cpu only cooling. Often there is no cpu fan just a big heatsink, expecting 3-4+ case fans all in one direction, PS fan not included. Boards designed for a ducted case are difficult for the laymen to transplant.

The best verification is alternate hardware running the same installation to verify the software as stable. Nvidia complicates simply moving the install, downgrade to nouveau first regardless of the destination if alternate hardware is a possibility for testing.

furby_goulag wrote: 2024-09-04 11:21 Any ideas how can I narrow my search either a temerature monitoring

Here's a fun one with info on desktop agnostic ways to find temps in any version;
https://forums.debian.net/viewtopic.php ... 70#p804870

Good luck!

furby_goulag · #6 Post by **furby_goulag** » 2024-09-04 21:04

Thanks @CwF

I created a small script to log the CPU and NVME disks temperatures every minute.
Here is an extract when running calculations:

Code: Select all

2024-09-04 22:37:55, CPU0 Temp: +89.0°C, CPU1 Temp: +90.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +26.9°C, CPU0 Load: 100.00%, CPU1 Load: 100.00%
2024-09-04 22:39:05, CPU0 Temp: +90.0°C, CPU1 Temp: +89.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +26.9°C, CPU0 Load: 100.00%, CPU1 Load: 100.00%
2024-09-04 22:40:15, CPU0 Temp: +89.0°C, CPU1 Temp: +88.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +26.9°C, CPU0 Load: 100.00%, CPU1 Load: 100.00%
2024-09-04 22:41:25, CPU0 Temp: +90.0°C, CPU1 Temp: +89.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +27.9°C, CPU0 Load: 100.00%, CPU1 Load: 100.00%
2024-09-04 22:42:35, CPU0 Temp: +90.0°C, CPU1 Temp: +88.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +26.9°C, CPU0 Load: 84.69%, CPU1 Load: 100.00%
2024-09-04 22:43:45, CPU0 Temp: +90.0°C, CPU1 Temp: +89.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +26.9°C, CPU0 Load: 100.00%, CPU1 Load: 100.00%
2024-09-04 22:44:55, CPU0 Temp: +90.0°C, CPU1 Temp: +89.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +26.9°C, CPU0 Load: 100.00%, CPU1 Load: 100.00%
2024-09-04 22:46:05, CPU0 Temp: +89.0°C, CPU1 Temp: +89.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +26.9°C, CPU0 Load: 100.00%, CPU1 Load: 100.00%
2024-09-04 22:47:15, CPU0 Temp: +90.0°C, CPU1 Temp: +89.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +26.9°C, CPU0 Load: 100.00%, CPU1 Load: 70.00%
2024-09-04 22:48:25, CPU0 Temp: +91.0°C, CPU1 Temp: +91.0°C, nvme-pci-1500: +27.9°C, nvme-pci-9900: +26.9°C, CPU0 Load: 100.00%, CPU1 Load: 100.00%

At full load both CPUs run around 90°C seems high but not terrifying.

I happen to have a thermal camera so I had also done a check.

The target is on one of the CPU coolers, above the hotspots are the RAM below also.

The coolers on the CPUs are both big and also fitted with a fan on it.

Tomorrow I will remove the heatsinks of the CPUs to see if they are correctly mounted with thermal paste.

FG

#7 Post by **CwF** » 2024-09-04 23:33

furby_goulag wrote: 2024-09-04 21:04 I happen to have a thermal camera so I had also done a check.

Excellent.
90C for a xeon under full load is fine. Some get there easier than others, most have headroom left at that temp. Memory does get hot! I assume the power input area and regulators are under control too? I am always skeptical of nvme heat issues. You're on track.

For fun, add 'cat /sys/devices/system/cpu/cpu0/thermal_throttle/package_throttle_count'

furby_goulag · #8 Post by **furby_goulag** » 2024-09-05 06:03

For fun, add 'cat /sys/devices/system/cpu/cpu0/thermal_throttle/package_throttle_count'

returns 0

The system as ran 12hours under full load without issues

NVME temperatures seem fine

furby_goulag · #9 Post by **furby_goulag** » 2024-09-05 20:52

Hi, just unmounted the cpu coolers (at first view clean and not dry thermal paste)
Cleanded everything and re-applyed thermal paste.

Temperatures under full load the same.

No crash today.

mrmazda · #10 Post by **mrmazda** » 2024-09-07 05:21

How old is this PSU? Is it known good? Adequate? If out of warranty, pop its cover and make sure it has no leaky or swollen electrolytic capacitors to cause instability.

Does your Quadro require use of a separate power supply? Is it connected?

Is there a motherboard GPU you could test with instead of using the Quadro? Inxi might not show a disabled GPU.

Also, Bookworm's inxi is ancient and broken. If you edit /etc/inxi.conf to disable the internal updater, you can update to latest version 3.3.36 using its -U switch as root user or with sudo. Also -F instead of -b will show more information, and much more yet with -Faz.

Debian User Forums

[Hardware] Regular Freeze when computing

[Hardware] Regular Freeze when computing

Re: [Hardware] Regular Freeze when computing

Re: [Hardware] Regular Freeze when computing

Re: [Hardware] Regular Freeze when computing

Re: [Hardware] Regular Freeze when computing

Re: [Hardware] Regular Freeze when computing

Re: [Hardware] Regular Freeze when computing

Re: [Hardware] Regular Freeze when computing

Re: [Hardware] Regular Freeze when computing

Re: [Hardware] Regular Freeze when computing