Hang on boot sometimes on 5.8 kernels, but every boot on 5.9

If none of the more specific forums is the right place to ask

Hang on boot sometimes on 5.8 kernels, but every boot on 5.9

Postby allnikol » 2020-11-04 03:00

Running Bullseye on a Sager laptop, I have a hang-during-boot issue, and I'm not sure where to start debugging.

For months it was intermittent (would hang on boot approximately half the time), so a hard poweroff and boot again would get me running when it occurred, and I was too lazy to track the problem down. After upgrading to kernel 5.9.0-1, however, it seems to happen on every boot. I tried ~10 times in a row, then reverted to kernel 5.8.0-2, at which point it went back to being intermittent. Obviously I don't want to be stuck on kernel 5.8 forever, so that workaround is not a long term solution.

(Note: these are all Debian kernels pulled in by linux-image-amd64.)

It seems to be inconsistent as to how far along the boot process gets before hanging, which I suppose doesn't really reveal anything because of the highly parallel systemd boot process in which things don't happen in a consistent order from one boot to the next. Sometimes it even gets as far as showing the text console login prompt and then hangs (typing username doesn't produce any output at the prompt, typing username Enter password Enter has no effect, can't switch VTs with Alt-Fn or Ctrl-Alt-Fn, Ctrl-Alt-Del does nothing, and a regular press of the power button does nothing). But if the boot succeeds and gets me to the lightdm login screen, then I can always log in, and the system is fine and stable and can run for days and weeks at a time without a hiccup.

From watching the output during boot, a few things have jumped out at me:
There are several SKIP messages that refer to "Ordering cycle found". This seems a little weird in that they refer to services that are in default installed state with no custom config or tweaks to the systemd unit dependencies or anything. But I get them on a successful boot too, so I've been inclined to guess they're non-fatal and probably unimportant.

The following message looks kinda bad:
Code: Select all
iwlwifi 0000:00:14.3: firmware: failed to load iwl-debug-yoyo.bin (-2)

But according to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=969264 and https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=966218, it's harmless and the fix from upstream simply suppresses the message.

This looks interesting:
Code: Select all
xhci_hcd 0000:01:00.2: can't change power state from D3cold to D0 (config space inaccessible)
xhci_hcd 0000:01:00.2: can't change power state from D3hot to D0 (config space inaccessible)
xhci_hcd 0000:01:00.2: PCI post-resume error -19!
xhci_hcd 0000:01:00.2: HC died; Cleaning up

(Apologies for any typos in these messages; they're transcribed from a photo of the monitor.)
Apparently xhci_hcd has to do with USB3, but every reference I've found to those errors has been people complaining about USB problems on their systems. Once my system successfully boots, I haven't had any USB problems so far as I've noticed.

I don't know what logs get written to disk and when during boot (and don't get overwritten on a subsequent boot), so any guidance on what log files or journalctl incantations or whatever might shed some light on this would be highly welcome!
allnikol
 
Posts: 4
Joined: 2020-11-04 01:25

Re: Hang on boot sometimes on 5.8 kernels, but every boot on

Postby Head_on_a_Stick » 2020-11-04 17:43

Read
Code: Select all
man journald.conf

That will tell you how to enable persistence for the systemd journal (hint: it's the Storage= option).

See also https://wiki.debian.org/DebianKernel/GitBisect
Black Lives Matter

Debian buster-backports ISO image: for new hardware support
User avatar
Head_on_a_Stick
 
Posts: 12795
Joined: 2014-06-01 17:46
Location: /dev/chair

Re: Hang on boot sometimes on 5.8 kernels, but every boot on

Postby allnikol » 2020-11-04 19:44

Head_on_a_Stick wrote:Read
Code: Select all
man journald.conf

That will tell you how to enable persistence for the systemd journal (hint: it's the Storage= option).

Thanks for that starting point.

Between that hint and some pointers from https://askubuntu.com/questions/765315/how-to-find-previous-boot-log-after-ubuntu-16-04-restarts I have verified that persistent systemd journal storage is already enabled: it's set to Storage=auto and /var/log/journal already exists (which I believe is the current Debian default... or at least I don't remember creating it manually). Also, journalctl --list-boots shows entries going back months, with reference IDs up to -61.

So it seems that I have a wealth of log data to work with, at least.



I think that might be a bit premature for my situation. Given the apparent absence of other people reporting this problem, wouldn't it be reasonable to guess that it likely stems from a configuration problem on my system, and not jump into trying to debug the kernel code itself quite yet? At this point, I would really like to figure out what is the last thing the system was trying to do when it hung. If I can make the hangs go away by blacklisting a module or disabling some service, that would narrow things down dramatically.

So far, though, I don't see any smoking gun in the logs. All the unsuccessful boots log as far as a message from systemd-journald recording "Time spent on flushing to /var/log/journal/(long string) is is x ms for y entries", and end there. The immediately preceding entries vary from one failed boot to the next and don't appear (to my not-very-trained eye) to indicate anything that might be a fatal error. The logs for a successful boot are much longer. (Side note: hot damn the PIA daemon is verbose! Maybe I should set that to start with a quiet flag.)

Perhaps interestingly, the xhci_hcd errors I mentioned do not appear in the journals for failed boots, though they do show on screen during those same failed boots. The journal for my current successful boot does contain those messages. Um. Might that imply that whatever is hanging the system coincides with the xhci_hcd stuff and is hanging the system up hard in the interval between showing those entries on screen and writing them to disk?
allnikol
 
Posts: 4
Joined: 2020-11-04 01:25


Return to General Questions

Who is online

Users browsing this forum: No registered users and 25 guests

fashionable