Looking for advice on a kernel panic issue

Linux Kernel, Network, and Services configuration.
Post Reply
Message
Author
rdghickman
Posts: 2
Joined: 2023-01-23 14:17

Looking for advice on a kernel panic issue

#1 Post by rdghickman »

I'm looking for a bit of advice regarding a kernel panic I'm seeing using Debian 11. As a disclaimer, this is currently running under virtualisation (VirtualBox), but I am seeking to test this on physical hardware.

Essentially, when using tc (traffic control) with a token bucket filter and some network emulation, I get relatively frequent lock-ups. The kernel dump can be obtained by having a console on a serial port, and it will only happen when a TCP connection is established between the VMs. Curiously, if the remote VM's ethernet is disconnected, it *will* crash the other VM (ie. the VM that still has its ethernet connected) if they had a TCP connection open between them (this seems to be 100% repeatable).

As I'm still trying to discern where this problem lies (ie. virtualisation, kernel, something else), I'm open to suggestions and/or ideas to try and take this any further. For reference, I have tried Debian kernels 5.10, 5.18, and 6.0 for the guest VMs. This happens under Windows or Linux hosts for the VM manager.

If anyone has a clue if I should file a "bug" for this or a better place to post/discuss I would be grateful. The kernel panic text follows.

[  109.810702] BUG: kernel NULL pointer dereference, address: 0000000000000000
[  109.812987] #PF: supervisor read access in kernel mode
[  109.814030] #PF: error_code(0x0000) - not-present page
[  109.815078] PGD 3b45067 P4D 3b45067 PUD 0 
[  109.815889] Oops: 0000 [#1] SMP NOPTI
[  109.817372] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.0-20-amd64 #1 Debian 5.10.158-2
[  109.818715] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  109.820062] RIP: 0010:rb_next+0x0/0x50
[  109.820932] Code: 89 fe 4c 89 c7 48 89 14 24 e8 dc 66 71 00 48 8b 43 10 48 8b 14 24 49 89 d8 e9 ee fe ff ff 48 c7 07 01 00 00 00 c3 cc cc cc cc <48> 8b 17 48 39 d7 74 3d 48 8b 47 08 48 85 c0 74 20 49 89 c0 48 8b
[  109.824077] RSP: 0000:ffffb1c000003e68 EFLAGS: 00010246
[  109.825030] RAX: 0000000000000000 RBX: ffff8cdbc5628278 RCX: 0000000000000000
[  109.826125] RDX: ffff8cdbc5628140 RSI: ffff8cdbc4a00400 RDI: 0000000000000000
[  109.827176] RBP: ffff8cdbc4a00000 R08: 0000000000000000 R09: 0000000000000001
[  109.828214] R10: ffff8cdbc5628140 R11: ffffffff8ee060c0 R12: 0000000000000000
[  109.829363] R13: ffff8cdbc5628000 R14: ffff8cdbc5628280 R15: ffff8cdbc5628140
[  109.830407] FS:  0000000000000000(0000) GS:ffff8cdbfec00000(0000) knlGS:0000000000000000
[  109.831645] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  109.832623] CR2: 0000000000000000 CR3: 0000000003b2a006 CR4: 00000000000706f0
[  109.834099] Call Trace:
[  109.834731]  <IRQ>
[  109.835306]  htb_dequeue+0x7c1/0x840 [sch_htb]
[  109.836114]  __qdisc_run+0x88/0x560
[  109.836923]  net_tx_action+0x105/0x270
[  109.837659]  __do_softirq+0xc5/0x279
[  109.838446]  asm_call_irq_on_stack+0x12/0x20
[  109.839322]  </IRQ>
[  109.839908]  do_softirq_own_stack+0x37/0x50
[  109.840776]  irq_exit_rcu+0x92/0xc0
[  109.841491]  sysvec_apic_timer_interrupt+0x36/0x80
[  109.842372]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  109.843320] RIP: 0010:mwait_idle+0x57/0x80
[  109.844118] Code: 89 d1 65 48 8b 04 25 c0 fb 01 00 0f 01 c8 48 8b 00 a8 08 75 33 0f 1f 44 00 00 0f 00 2d 0c 6d 50 00 31 c0 48 89 c1 fb 0f 01 c9 <65> 48 8b 04 25 c0 fb 01 00 3e 80 60 02 df c3 cc cc cc cc 0f ae f0
[  109.847096] RSP: 0000:ffffffff8ee03ec0 EFLAGS: 00000246
[  109.847965] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  109.848991] RDX: 0000000000000000 RSI: ffffffff8ee03e50 RDI: 00000019887e5976
[  109.850172] RBP: ffffffff8ee13940 R08: 0000000000000001 R09: 000000000007c000
[  109.851152] R10: 000000000007c000 R11: 0000000000000000 R12: 0000000000000000
[  109.852201] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  109.853216]  default_idle_call+0x3c/0xd0
[  109.853899]  do_idle+0x20c/0x2b0
[  109.854508]  cpu_startup_entry+0x19/0x20
[  109.855179]  start_kernel+0x574/0x599
[  109.855843]  secondary_startup_64_no_verify+0xb0/0xbb
[  109.856626] Modules linked in: sch_tbf sch_netem cls_u32 sch_htb intel_rapl_msr intel_rapl_common intel_pmc_core intel_powerclamp ghash_clmulni_intel aesni_intel libaes crypto_simd vmwgfx cryptd glue_helper rapl ttm drm_kms_helper pcspkr sg serio_raw joydev vboxguest evdev ac cec button drm fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic hid_generic usbhid hid sd_mod t10_pi crc_t10dif crct10dif_generic ohci_pci ehci_pci ohci_hcd ahci libahci ehci_hcd libata usbcore virtio_net net_failover failover scsi_mod crct10dif_pclmul crct10dif_common crc32_pclmul psmouse crc32c_intel virtio_pci virtio_ring i2c_piix4 usb_common virtio battery video
[  109.865010] CR2: 0000000000000000
[  109.865766] ---[ end trace 45651f0b70afd64a ]---
[  109.866674] RIP: 0010:rb_next+0x0/0x50
[  109.867388] Code: 89 fe 4c 89 c7 48 89 14 24 e8 dc 66 71 00 48 8b 43 10 48 8b 14 24 49 89 d8 e9 ee fe ff ff 48 c7 07 01 00 00 00 c3 cc cc cc cc <48> 8b 17 48 39 d7 74 3d 48 8b 47 08 48 85 c0 74 20 49 89 c0 48 8b
[  109.870318] RSP: 0000:ffffb1c000003e68 EFLAGS: 00010246
[  109.871171] RAX: 0000000000000000 RBX: ffff8cdbc5628278 RCX: 0000000000000000
[  109.872297] RDX: ffff8cdbc5628140 RSI: ffff8cdbc4a00400 RDI: 0000000000000000
[  109.873458] RBP: ffff8cdbc4a00000 R08: 0000000000000000 R09: 0000000000000001
[  109.874513] R10: ffff8cdbc5628140 R11: ffffffff8ee060c0 R12: 0000000000000000
[  109.875580] R13: ffff8cdbc5628000 R14: ffff8cdbc5628280 R15: ffff8cdbc5628140
[  109.876647] FS:  0000000000000000(0000) GS:ffff8cdbfec00000(0000) knlGS:0000000000000000
[  109.877760] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  109.878673] CR2: 0000000000000000 CR3: 0000000003b2a006 CR4: 00000000000706f0
[  109.879745] Kernel panic - not syncing: Fatal exception in interrupt
[  109.880779] Kernel Offset: 0xc800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  109.882950] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---


Aki
Posts: 453
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 6 times
Been thanked: 59 times

Re: Looking for advice on a kernel panic issue

#2 Post by Aki »

Hello,

The kernel's interested module is the Hierarchical Token Bucket (HTB), therefore the error is somehow related to the testing you are doing (traffic control with a token bucket filter and some network emulation).

As an example, here [1] you can find current bug reports about the HTB in the upstream linux BTS, but you can make other searches that are more suited to your needs.

It would be better that you make the issue replicable before opening a bug report on Debian BTS [2].

---
[1] https://bugzilla.kernel.org/buglist.cgi?quicksearch=HTB
[2] https://wiki.debian.org/reportbug
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

rdghickman
Posts: 2
Joined: 2023-01-23 14:17

Re: Looking for advice on a kernel panic issue

#3 Post by rdghickman »

Perfect, as it turned out this was just what I needed.

From browsing those former bug reports, I got enough of a clue to look into something and have managed to resolve the issue. Technically, it appears to be a misconfiguration of a qdisc tree, in this case, adding a non-work-conserving qdisc child (tbf) to a parent node that was work-conserving (htb). This is actually invalid, but Linux doesn't stop you, and since htb expects work tokens, it blindly dequeues, and if the child has none, then it seems you get a null pointer inside the interrupt handler. This seems pretty nasty, but I suppose it is "by design".

As such I'm uncertain whether this is a bug, or just a hazardous quirk of not fully understanding tc/qdisc, but thank you very much for pointing me in the right direction.

Aki
Posts: 453
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 6 times
Been thanked: 59 times

Re: Looking for advice on a kernel panic issue

#4 Post by Aki »

Glad to have helped you. Happy Debian & happy hacking.
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

Post Reply