Dazed and confused, but trying to continue

Linux Kernel, Network, and Services configuration.
Post Reply
Message
Author
CwF
Global Moderator
Global Moderator
Posts: 3134
Joined: 2018-06-20 15:16
Location: Colorado
Has thanked: 63 times
Been thanked: 265 times

Dazed and confused, but trying to continue

#1 Post by CwF »

To start at the end of my thinking; is there mechanism for excluding recommends according to machine type?

Yes, super general question! I just wondered if laptop-detect for example is useful for desktops to flag the machine does not have things, as well as my assumed function to tell things it does have things. It appears that package doesn't help the system past initial setup. So, are there any other packages I haven't found that help control hardware related recommends?

So, the back story, now in the middle is this wonderfully meaningful error message:

Code: Select all

kernel: Uhhuh. NMI received for unknown reason 30 on CPU 0.
kernel: Do you have a strange power saving mode enabled?
kernel: Dazed and confused, but trying to continue
It doesn't really show up if you stay in a DE GUI, but it will interrupt you every few minutes in a VT.

This is a very old issue. The reason can be 0,20,30,?. Googling will give thousands of irrelevant results. I think it can come from many sources.

The beginning of the story is my last upgrade in late August of my 'base' file I use for vm's. On one of the vm versions above I added midori. I like it and it will hopefully update to 'very useful' in a few months. Allowing the recommends, through webkit it brings in a few extras that lead to this error.

Pulling libmm-glib0, which takes geoclue-2.0 and its recommends with it, eliminated the error. I suppose it's doing some hardware scan for sensors I don't have, don't know?

So I'm wondering if there is a 'Not_smartphone_tablet_laptop-detect' package? Is this a udev detection issue? Or are developers simply gearing things to find the sensors without any consideration or consequence for not finding them?
Mottainai

mrmazda
Posts: 510
Joined: 2023-06-02 02:22
Has thanked: 14 times
Been thanked: 66 times

Re: Dazed and confused, but trying to continue

#2 Post by mrmazda »

I have one multiboot PC running on a last iteration of Pentium 4, with zero VMs. It has server motherboard XGI video with no means to disable in BIOS, no PCIe or AGP slots, and no kernel graphics driver, so I use a PCI NVidia GPU. It repeatedly gets dazed and confused in Bullseye, Bookworm and Trixie, and the Debians that came before, but not in Fedora, openSUSE or Mageia. I've never found anything about how to stop it from happening, only gobs of hits asking how. Obviously there must be something different about Debian's kernel construction, right?

CwF
Global Moderator
Global Moderator
Posts: 3134
Joined: 2018-06-20 15:16
Location: Colorado
Has thanked: 63 times
Been thanked: 265 times

Re: Dazed and confused, but trying to continue

#3 Post by CwF »

mrmazda wrote: 2024-10-11 00:31 Pentium 4, with zero VMs
It's been awhile...Not sure, I do have a few notes but haven't seen the error in some time, and never not caused by a vm setup. My best guess from memory is it has something to do with power management and power states, so on bare metal maybe acpi/bios issues could trigger it, so try various settings in the bios up to disabling power saving features.

Are you running i386 by chance? I think I saw it mostly with i3/686 vm's as I remember. I never suffered through P4's, my dual P3T's went for a decade until the good stuff started coming out.
Mottainai

Aki
Global Moderator
Global Moderator
Posts: 4036
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 112 times
Been thanked: 532 times

Re: Dazed and confused, but trying to continue

#4 Post by Aki »

Hello,
CwF wrote: 2018-10-14 15:29 [..]
laptop-detect [..] are there any other packages I haven't found that help control hardware related recommends ?
[..]
The laptop-detect program is not currently used by the Debian Installer [0]. This program return 0 or 1 according it "senses" that is running on a laptop or not (according to a list of acpi computer models). It seems it is orphaned and it is not very accurate in detection.

The Debian Installer uses the discover program [1] to detect installed hardware.

Once the installation is complete, the user is responsible for installing the packages required to enable specific hardware features.
CwF wrote: 2018-10-14 15:29 [..] are developers simply gearing things to find the sensors without any consideration or consequence for not finding them?
I assume that developers don't want to mess with users' hardware, but sometimes things don't go as expected, especially on not-so-new hardware, or if the problem is difficult to reproduce.

Searching the Debian source code for your error, you are probably using a 5.x kernel, since it matches exactly your error message in the function unknown_nmi_error:

  • sources / linux / 5.10.223-1 / arch / x86 / kernel / nmi.c al line 274 :

    Code: Select all

    [..]
    unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
    {
    	int handled;
    
    	/*
    	 * Use 'false' as back-to-back NMIs are dealt with one level up.
    	 * Of course this makes having multiple 'unknown' handlers useless
    	 * as only the first one is ever run (unless it can actually determine
    	 * if it caused the NMI)
    	 */
    	handled = nmi_handle(NMI_UNKNOWN, regs);
    	if (handled) {
    		__this_cpu_add(nmi_stats.unknown, handled);
    		return;
    	}
    
    	__this_cpu_add(nmi_stats.unknown, 1);
    
    	pr_emerg("Uhhuh. NMI received for unknown reason %02x on CPU %d.\n",
    		 reason, smp_processor_id());
    
    	pr_emerg("Do you have a strange power saving mode enabled?\n");
    	if (unknown_nmi_panic || panic_on_unrecovered_nmi)
    		nmi_panic(regs, "NMI: Not continuing");
    
    	pr_emerg("Dazed and confused, but trying to continue\n");
    }
    NOKPROBE_SYMBOL(unknown_nmi_error);
    [..]
    
This code has been changed in the next kernel versions.

Then unknown_nmi_error function is called in the function default_do_nmi here:

  • sources / linux / 5.10.223-1 / arch / x86 / kernel / nmi.c at line 416:

    Code: Select all

    static noinstr void default_do_nmi(struct pt_regs *regs)
    {
    	unsigned char reason = 0;
    	int handled;
    	bool b2b = false;
    
    	/*
    	 * CPU-specific NMI must be processed before non-CPU-specific
    	 * NMI, otherwise we may lose it, because the CPU-specific
    	 * NMI can not be detected/processed on other CPUs.
    	 */
    
    	/*
    	 * Back-to-back NMIs are interesting because they can either
    	 * be two NMI or more than two NMIs (any thing over two is dropped
    	 * due to NMI being edge-triggered).  If this is the second half
    	 * of the back-to-back NMI, assume we dropped things and process
    	 * more handlers.  Otherwise reset the 'swallow' NMI behaviour
    	 */
    	if (regs->ip == __this_cpu_read(last_nmi_rip))
    		b2b = true;
    	else
    		__this_cpu_write(swallow_nmi, false);
    
    	__this_cpu_write(last_nmi_rip, regs->ip);
    
    	instrumentation_begin();
    
    	handled = nmi_handle(NMI_LOCAL, regs);
    	__this_cpu_add(nmi_stats.normal, handled);
    	if (handled) {
    		/*
    		 * There are cases when a NMI handler handles multiple
    		 * events in the current NMI.  One of these events may
    		 * be queued for in the next NMI.  Because the event is
    		 * already handled, the next NMI will result in an unknown
    		 * NMI.  Instead lets flag this for a potential NMI to
    		 * swallow.
    		 */
    		if (handled > 1)
    			__this_cpu_write(swallow_nmi, true);
    		goto out;
    	}
    
    	/*
    	 * Non-CPU-specific NMI: NMI sources can be processed on any CPU.
    	 *
    	 * Another CPU may be processing panic routines while holding
    	 * nmi_reason_lock. Check if the CPU issued the IPI for crash dumping,
    	 * and if so, call its callback directly.  If there is no CPU preparing
    	 * crash dump, we simply loop here.
    	 */
    	while (!raw_spin_trylock(&nmi_reason_lock)) {
    		run_crash_ipi_callback(regs);
    		cpu_relax();
    	}
    
    	reason = x86_platform.get_nmi_reason();
    
    	if (reason & NMI_REASON_MASK) {
    		if (reason & NMI_REASON_SERR)
    			pci_serr_error(reason, regs);
    		else if (reason & NMI_REASON_IOCHK)
    			io_check_error(reason, regs);
    #ifdef CONFIG_X86_32
    		/*
    		 * Reassert NMI in case it became active
    		 * meanwhile as it's edge-triggered:
    		 */
    		reassert_nmi();
    #endif
    		__this_cpu_add(nmi_stats.external, 1);
    		raw_spin_unlock(&nmi_reason_lock);
    		goto out;
    	}
    	raw_spin_unlock(&nmi_reason_lock);
    
    	/*
    	 * Only one NMI can be latched at a time.  To handle
    	 * this we may process multiple nmi handlers at once to
    	 * cover the case where an NMI is dropped.  The downside
    	 * to this approach is we may process an NMI prematurely,
    	 * while its real NMI is sitting latched.  This will cause
    	 * an unknown NMI on the next run of the NMI processing.
    	 *
    	 * We tried to flag that condition above, by setting the
    	 * swallow_nmi flag when we process more than one event.
    	 * This condition is also only present on the second half
    	 * of a back-to-back NMI, so we flag that condition too.
    	 *
    	 * If both are true, we assume we already processed this
    	 * NMI previously and we swallow it.  Otherwise we reset
    	 * the logic.
    	 *
    	 * There are scenarios where we may accidentally swallow
    	 * a 'real' unknown NMI.  For example, while processing
    	 * a perf NMI another perf NMI comes in along with a
    	 * 'real' unknown NMI.  These two NMIs get combined into
    	 * one (as described above).  When the next NMI gets
    	 * processed, it will be flagged by perf as handled, but
    	 * no one will know that there was a 'real' unknown NMI sent
    	 * also.  As a result it gets swallowed.  Or if the first
    	 * perf NMI returns two events handled then the second
    	 * NMI will get eaten by the logic below, again losing a
    	 * 'real' unknown NMI.  But this is the best we can do
    	 * for now.
    	 */
    	if (b2b && __this_cpu_read(swallow_nmi))
    		__this_cpu_add(nmi_stats.swallow, 1);
    	else
    		unknown_nmi_error(reason, regs);
    
    out:
    	instrumentation_end();
    }
    
Based on the previous code, I assume the NMI is not triggered by I/O or PCI devices.

Very recent kernel versions have added new logic to track non-maskable interrupts, but these are supported by newer CPUs.

--
[0] Debian Bug report logs - #488386 - please provide an udeb
[1] https://packages.debian.org/bookworm/discover
[2] https://packages.debian.org/bookworm/libdiscover2
[3] https://www.intel.com/content/www/us/en ... -home.html
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

mrmazda
Posts: 510
Joined: 2023-06-02 02:22
Has thanked: 14 times
Been thanked: 66 times

Re: Dazed and confused, but trying to continue

#5 Post by mrmazda »

In my case, I know about these NMIs because they are accompanied by annoying speaker beeps. Thus, if they indicate a real problem, I'd like to get the problem solved. If the NMIs are not of real consequence, I'd like to get the beeping eliminated.

Aki
Global Moderator
Global Moderator
Posts: 4036
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 112 times
Been thanked: 532 times

Re: Dazed and confused, but trying to continue

#6 Post by Aki »

mrmazda wrote: 2024-10-11 20:25 In my case, I know about these NMIs because they are accompanied by annoying speaker beeps. Thus, if they indicate a real problem, I'd like to get the problem solved. If the NMIs are not of real consequence, I'd like to get the beeping eliminated.
Your problem probably deserves its own discussion (in the Hardware sub-forum, I suppose).

Have you already started such a discussion?
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

CwF
Global Moderator
Global Moderator
Posts: 3134
Joined: 2018-06-20 15:16
Location: Colorado
Has thanked: 63 times
Been thanked: 265 times

Re: Dazed and confused, but trying to continue

#7 Post by CwF »

Aki wrote: 2024-10-11 19:52 Searching the Debian source code for your error, you are probably using a 5.x kernel, since it matches exactly your error message in the function unknown_nmi_error:
You may have missed that my post was some 6 years ago, maybe 4.09 or 4.19!
mrmazda wrote: 2024-10-11 20:25 I'd like to get the beeping eliminated.
I stated, check in your bios...

I don't really care if this necro post is continued or moved, I've moved and haven't seen the error for years.

Have fun!
Mottainai

Post Reply