Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

HP ProLiant ML350 G6 Hangs while booting

Need help with peripherals or devices?
Post Reply
Message
Author
nottoday
Posts: 7
Joined: 2022-08-09 12:24

HP ProLiant ML350 G6 Hangs while booting

#1 Post by nottoday »

I have an HP ProLiant ML350 G6 sever with Debian 11 bullseye, but for some reason it sometimes hangs on boot. It only happens occasionally. I do not have a graphical interface installed.

I've tried to install the firmware-linux-nonfree package, but it did not help.

I looked in the boot logs for a boot where it did and did not boot. But except for that the failed boot does stop logging at some point they don't seem to differ.

Does anyone have any suggestions what might cause this or ideas how to diagnose the problem?

These are the logs for when it did not boot.

Code: Select all

journalctl -b -1 --priority 4

Code: Select all

-- Journal begins at Tue 2021-10-19 14:39:39 CEST, ends at Fri 2022-07-15 15:16:49 CEST. --
Jul 15 14:37:59 sever-Remi kernel: ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20200925/tbfadt-669)
Jul 15 14:37:59 sever-Remi kernel: ACPI BIOS Warning (bug): Invalid length for FADT/Pm2ControlBlock: 32, using default 8 (20200925/tbfadt-669)
Jul 15 14:37:59 sever-Remi kernel: ACPI: SPCR: Unexpected SPCR Access Width.  Defaulting to byte size
Jul 15 14:37:59 sever-Remi kernel: DMAR-IR: This system BIOS has enabled interrupt remapping
                                   on a chipset that contains an erratum making that
                                   feature unstable.  To maintain system stability
                                   interrupt remapping is being disabled.  Please
                                   contact your BIOS vendor for an update
Jul 15 14:37:59 sever-Remi kernel: [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
Jul 15 14:37:59 sever-Remi kernel: Intel PMU driver.
Jul 15 14:37:59 sever-Remi kernel: core: CPUID marked event: 'bus cycles' unavailable
Jul 15 14:37:59 sever-Remi kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
Jul 15 14:37:59 sever-Remi kernel:   #5  #6  #7
Jul 15 14:37:59 sever-Remi kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
Jul 15 14:37:59 sever-Remi kernel: ERST: Failed to get Error Log Address Range.
Jul 15 14:37:59 sever-Remi kernel: ACPI Warning: SystemIO range 0x0000000000000928-0x000000000000092F conflicts with OpRegion 0x0000000000000920-0x000000000000092F (\SGPE) (20200925/utaddress-204)
Jul 15 14:37:59 sever-Remi kernel: lpc_ich: Resource conflict(s) found affecting gpio_ich
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:0:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:1:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:2:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:3:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:4:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:5:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:6:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:7:0: Power-on or device reset occurred
And these for when it did.

Code: Select all

journalctl -b 0 --priority 4

Code: Select all

-- Journal begins at Tue 2021-10-19 14:39:39 CEST, ends at Fri 2022-07-15 15:14:38 CEST. --
Jul 15 14:58:14 sever-Remi kernel: ACPI BIOS Warning (bug): Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20200925/tbfadt-669)
Jul 15 14:58:14 sever-Remi kernel: ACPI BIOS Warning (bug): Invalid length for FADT/Pm2ControlBlock: 32, using default 8 (20200925/tbfadt-669)
Jul 15 14:58:14 sever-Remi kernel: ACPI: SPCR: Unexpected SPCR Access Width.  Defaulting to byte size
Jul 15 14:58:14 sever-Remi kernel: DMAR-IR: This system BIOS has enabled interrupt remapping
                                   on a chipset that contains an erratum making that
                                   feature unstable.  To maintain system stability
                                   interrupt remapping is being disabled.  Please
                                   contact your BIOS vendor for an update
Jul 15 14:58:14 sever-Remi kernel: [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
Jul 15 14:58:14 sever-Remi kernel: Intel PMU driver.
Jul 15 14:58:14 sever-Remi kernel: core: CPUID marked event: 'bus cycles' unavailable
Jul 15 14:58:14 sever-Remi kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
Jul 15 14:58:14 sever-Remi kernel:   #5  #6  #7
Jul 15 14:58:14 sever-Remi kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
Jul 15 14:58:14 sever-Remi kernel: ERST: Failed to get Error Log Address Range.
Jul 15 14:58:14 sever-Remi kernel: ACPI Warning: SystemIO range 0x0000000000000928-0x000000000000092F conflicts with OpRegion 0x0000000000000920-0x000000000000092F (\SGPE) (20200925/utaddress-204)
Jul 15 14:58:14 sever-Remi kernel: lpc_ich: Resource conflict(s) found affecting gpio_ich
Jul 15 14:58:14 sever-Remi kernel: scsi 0:0:0:0: Power-on or device reset occurred
Jul 15 14:58:14 sever-Remi kernel: scsi 0:0:1:0: Power-on or device reset occurred
Jul 15 14:58:14 sever-Remi kernel: scsi 0:0:2:0: Power-on or device reset occurred
Jul 15 14:58:14 sever-Remi kernel: scsi 0:0:3:0: Power-on or device reset occurred
Jul 15 14:58:14 sever-Remi kernel: scsi 0:0:4:0: Power-on or device reset occurred
Jul 15 14:58:14 sever-Remi kernel: scsi 0:0:5:0: Power-on or device reset occurred
Jul 15 14:58:14 sever-Remi kernel: scsi 0:0:6:0: Power-on or device reset occurred
Jul 15 14:58:14 sever-Remi kernel: scsi 0:0:7:0: Power-on or device reset occurred
Jul 15 14:58:16 sever-Remi kernel: pcc_cpufreq_init: Too many CPUs, dynamic performance scaling disabled
Jul 15 14:58:16 sever-Remi kernel: pcc_cpufreq_init: Try to enable another scaling driver through BIOS settings
Jul 15 14:58:16 sever-Remi kernel: pcc_cpufreq_init: and complain to the system vendor
Jul 15 14:58:16 sever-Remi kernel: cpufreq: Can't use schedutil governor as dynamic switching is disallowed. Fallback to performance governor
Jul 15 14:58:16 sever-Remi kernel: cpufreq: Can't use schedutil governor as dynamic switching is disallowed. Fallback to performance governor
Jul 15 14:58:16 sever-Remi kernel: cpufreq: Can't use schedutil governor as dynamic switching is disallowed. Fallback to performance governor
Jul 15 14:58:16 sever-Remi kernel: cpufreq: Can't use schedutil governor as dynamic switching is disallowed. Fallback to performance governor
Jul 15 14:58:16 sever-Remi kernel: cpufreq: Can't use schedutil governor as dynamic switching is disallowed. Fallback to performance governor
Jul 15 14:58:16 sever-Remi kernel: cpufreq: Can't use schedutil governor as dynamic switching is disallowed. Fallback to performance governor
Jul 15 14:58:16 sever-Remi kernel: cpufreq: Can't use schedutil governor as dynamic switching is disallowed. Fallback to performance governor
Jul 15 14:58:16 sever-Remi kernel: cpufreq: Can't use schedutil governor as dynamic switching is disallowed. Fallback to performance governor
Jul 15 14:58:16 sever-Remi kernel: power_meter ACPI000D:00: Ignoring unsafe software power cap!
Jul 15 14:58:16 sever-Remi kernel: power_meter ACPI000D:00: hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
Jul 15 14:58:16 sever-Remi kernel: ipmi_si 0000:01:04.6: Could not setup I/O space
Jul 15 14:58:16 sever-Remi kernel: spl: loading out-of-tree module taints kernel.
Jul 15 14:58:16 sever-Remi kernel: znvpair: module license 'CDDL' taints kernel.
Jul 15 14:58:16 sever-Remi kernel: Disabling lock debugging due to kernel taint
Jul 15 14:58:16 sever-Remi kernel: kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
Jul 15 14:58:21 sever-Remi NetworkManager[876]: <warn>  [1657889901.0283] ifupdown: interfaces file /etc/network/interfaces.d/* doesn't exist

User avatar
bw123
Posts: 4015
Joined: 2011-05-09 06:02
Has thanked: 1 time
Been thanked: 28 times

Re: HP ProLiant ML350 G6 Hangs while booting

#2 Post by bw123 »

I get warm boot hang occasionally. Seems to happen less frequently when nothing is plugged into usb.
There are several kernels available right now,
linux-image-5.10.0-13-amd64 -thru- linux-image-5.10.0-16-amd64, and 5.18.2-1~bpo11+1
so If it's a big problem I might try to find one that works reliably and stick with it for a bit.

p.s. did you look into the 'the BIOS has corrupted hw-PMU resources" thing?
resigned by AI ChatGPT

nottoday
Posts: 7
Joined: 2022-08-09 12:24

Re: HP ProLiant ML350 G6 Hangs while booting

#3 Post by nottoday »

Thanks for your response bw123,

I did look into the “corrupted hw-PMU” thing, but nothing helpful yet :(

I did try to reset the NVRAM, but no result again. So I guess I keep at it for now.

Sorry for the late response didn't have much time to dig in to it.

User avatar
Head_on_a_Stick
Posts: 14114
Joined: 2014-06-01 17:46
Location: London, England
Has thanked: 81 times
Been thanked: 132 times

Re: HP ProLiant ML350 G6 Hangs while booting

#4 Post by Head_on_a_Stick »

Next time it happens post (a link to) the full journal. Filtering by priority can miss important context.
deadbang

nottoday
Posts: 7
Joined: 2022-08-09 12:24

Re: HP ProLiant ML350 G6 Hangs while booting

#5 Post by nottoday »

Head_on_a_Stick wrote: 2022-08-31 15:06 Next time it happens post (a link to) the full journal. Filtering by priority can miss important context.
I can't the there's a character limit of 60 000 characters. So, I made a link.

https://docs.google.com/document/d/1NLH ... 0b7rM/edit

I'm sorry for such a late response again. I'm not giving the impression I would like to give. :(

cynwulf

Re: HP ProLiant ML350 G6 Hangs while booting

#6 Post by cynwulf »

Research these ones:

Code: Select all

Jul 15 14:37:59 sever-Remi kernel: ACPI: SPCR: Unexpected SPCR Access Width.  Defaulting to byte size
Jul 15 14:37:59 sever-Remi kernel: DMAR-IR: This system BIOS has enabled interrupt remapping
                                   on a chipset that contains an erratum making that
                                   feature unstable.  To maintain system stability
                                   interrupt remapping is being disabled.  Please
                                   contact your BIOS vendor for an update
Jul 15 14:37:59 sever-Remi kernel: [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
Jul 15 14:37:59 sever-Remi kernel: Intel PMU driver.
Jul 15 14:37:59 sever-Remi kernel: core: CPUID marked event: 'bus cycles' unavailable
Jul 15 14:37:59 sever-Remi kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
Jul 15 14:37:59 sever-Remi kernel:   #5  #6  #7
Jul 15 14:37:59 sever-Remi kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
Jul 15 14:37:59 sever-Remi kernel: ERST: Failed to get Error Log Address Range.
Jul 15 14:37:59 sever-Remi kernel: ACPI Warning: SystemIO range 0x0000000000000928-0x000000000000092F conflicts with OpRegion 0x0000000000000920-0x000000000000092F (\SGPE) (20200925/utaddress-204)
Jul 15 14:37:59 sever-Remi kernel: lpc_ich: Resource conflict(s) found affecting gpio_ich
Consider a BIOS update.

I have two older rack servers (HP ProLiant DL380 G6), albeit running FreeBSD, rather than Linux:

I see the same two "FADT" warnings...

Code: Select all

ACPI APIC Table: <HP     ProLiant>
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 2 package(s) x 4 core(s)
random: unblocking device.
Firmware Warning (ACPI): Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20201113/tbfadt-850)
Firmware Warning (ACPI): Invalid length for FADT/Pm2ControlBlock: 32, using default 8 (20201113/tbfadt-850)
... but not the part quoted above and no issues.

nottoday
Posts: 7
Joined: 2022-08-09 12:24

Re: HP ProLiant ML350 G6 Hangs while booting

#7 Post by nottoday »

cynwulf wrote: 2022-09-16 15:45 Consider a BIOS update.
 
I did update the BIOS, but still the same problem.  
cynwulf wrote: 2022-09-16 15:45 Research these ones:
I'm still looking in to it, but nothing useful so far.

Thanks for the response.

cynwulf

Re: HP ProLiant ML350 G6 Hangs while booting

#8 Post by cynwulf »

If you've investigated all of the ACPI bugs and found them to be benign, then that just leaves the point where boot hangs:
nottoday wrote: 2022-08-09 13:01

Code: Select all

Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:0:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:1:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:2:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:3:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:4:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:5:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:6:0: Power-on or device reset occurred
Jul 15 14:37:59 sever-Remi kernel: scsi 0:0:7:0: Power-on or device reset occurred
The device resets at that point, for what looks like all (?) devices connected to the SAS controller, doesn't seem normal to me. Have you researched that? Could be a driver or hardware problem. I would install a much newer kernel and do some reboots with that to see if the same entries appear and the same hangs. If it persists, it could be anything from a bad power supply to various problems with the SAS enclosure, controller, cabling, etc.

nottoday
Posts: 7
Joined: 2022-08-09 12:24

Re: HP ProLiant ML350 G6 Hangs while booting

#9 Post by nottoday »

I've tried upgrading to kernel 5.18.0-0.deb11.4-amd64, witch did not help.

Code: Select all

Jul 15 14:37:59 sever-Remi kernel: ACPI: SPCR: Unexpected SPCR Access Width.  Defaulting to byte size
Jul 15 14:37:59 sever-Remi kernel: DMAR-IR: This system BIOS has enabled interrupt remapping
                                   on a chipset that contains an erratum making that
                                   feature unstable.  To maintain system stability
                                   interrupt remapping is being disabled.  Please
                                   contact your BIOS vendor for an update
Jul 15 14:37:59 sever-Remi kernel: [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)
Jul 15 14:37:59 sever-Remi kernel: Intel PMU driver.
Jul 15 14:37:59 sever-Remi kernel: core: CPUID marked event: 'bus cycles' unavailable
Jul 15 14:37:59 sever-Remi kernel: MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
Jul 15 14:37:59 sever-Remi kernel:   #5  #6  #7
Jul 15 14:37:59 sever-Remi kernel: ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
Jul 15 14:37:59 sever-Remi kernel: ERST: Failed to get Error Log Address Range.
Jul 15 14:37:59 sever-Remi kernel: ACPI Warning: SystemIO range 0x0000000000000928-0x000000000000092F conflicts with OpRegion 0x0000000000000920-0x000000000000092F (\SGPE) (20200925/utaddress-204)
Jul 15 14:37:59 sever-Remi kernel: lpc_ich: Resource conflict(s) found affecting gpio_ich
I do not get mush wiser by googling the logs above. Anything I can find says it is not something imported. So, I think it has nothing to do with my problem, but I'm hesitant, because I do not exactly know what it all means.
cynwulf wrote: 2022-09-20 12:18 If it persists, it could be anything from a bad power supply to various problems with the SAS enclosure, controller, cabling, etc.
If you're thinking of a hardware issue, I don't know how to effectively troubleshoot that. Except for replacing all hardware one by one. Witch would be cumbersome and expensive.

cynwulf

Re: HP ProLiant ML350 G6 Hangs while booting

#10 Post by cynwulf »

As you can see, all 8 drives in the sas enclosure reset at the same time. When it does continue to boot, it's after about a 2 minute delay waiting for the arrays to come back online.

When it doesn't, it just hangs.

That points to the sas enclosure or controller, rather than a specific drive. To me it looks suspiciously like a PSU issue, but by no means certain yet. How many PSUs in that model?

Is the problem more easily reproducible on cold boots, as opposed to warm reboots?

nottoday
Posts: 7
Joined: 2022-08-09 12:24

Re: HP ProLiant ML350 G6 Hangs while booting

#11 Post by nottoday »

Sorry for the late response, Didn't have much time to look in to it.
cynwulf wrote: 2022-09-27 14:45 How many PSUs in that model?
I only have one power supply installed, and I only own one.
cynwulf wrote: 2022-09-27 14:45 Is the problem more easily reproducible on cold boots, as opposed to warm reboots?
That's not my experience. I did, at some point, had the feeling that, if it hung, it has is more likely to boot the next time it I puled the power cord. I've tested cold vs warm vs power-cutoff (eight boots for each), but It does not seam to mater. It seems to blow very hard when if boots after a power-cutoff.
This is the result of my test.

Code: Select all

1. warm fail
2. cold pass
3. warm fail
4. cold pass
5. warm fail
6. cold pass
7. warm pass
8. warm pass
9. warm pass
10. warm pass
11. warm pass
12. cold fail
13. cold pass
14. cold pass
15. cold pass
16. cold pass
17. power-cutoff fail
18. power-cutoff fail
19. power-cutoff pass
20. power-cutoff pass
21. power-cutoff pass
22. power-cutoff fail
23. power-cutoff pass
24. power-cutoff pass
25. power-cutoff fail
I don't have the log files right now, but I try to post them this week.

nottoday
Posts: 7
Joined: 2022-08-09 12:24

Re: HP ProLiant ML350 G6 Hangs while booting

#12 Post by nottoday »

Sorry form my late response. I have the logs in the following link.

https://drive.google.com/drive/folders/ ... sp=sharing

Post Reply