[Testing - Trixie] Error Causing Crash

- - ALL UNSTABLE / TESTING THREADS SHOULD BE POSTED HERE - -
This sub-forum is the dedicated area for the ongoing Unstable/Testing releases of Debian. Advanced, or Experienced User support only. Use the software, give, and take advice with caution.
Post Reply
Message
Author
awptechnologies
Posts: 7
Joined: 2023-08-08 23:32

[Testing - Trixie] Error Causing Crash

#1 Post by awptechnologies »

Im not sure what this error in log is. It caused the computer to lockup and i had to reset. Can anyone explain this to me.

Code: Select all

BUG: Bad page state in process node  pfn:329600
Aug 14 22:01:53 net1server kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x7f681c000 pfn:0x329600
Aug 14 22:01:56 net1server kernel: flags: 0x57ffffc0000100(active|node=1|zone=2|lastcpupid=0x1fffff)
Aug 14 22:01:59 net1server kernel: raw: 0057ffffc0000100 dead000000000100 dead000000000122 0000000000000000
Aug 14 22:02:00 net1server kernel: raw: 00000007f681c000 0000000000000000 00000000ffffffff 0000000000000000
Aug 14 22:02:00 net1server kernel: page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
Aug 14 22:02:00 net1server kernel: Modules linked in: dm_mod xt_REDIRECT nvidia_uvm(PO) nfsv3 nfs_acl ip_vs_rr xt_ipvs ip_vs veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_bpf xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_ad>
Aug 14 22:02:00 net1server kernel:  configfs efi_pstore nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci efivarfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi hid_generic sr_mod usbhid cdrom crc64_rocksoft hid crc64 crc_t10dif crct10dif_generic>
Aug 14 22:02:00 net1server kernel: CPU: 3 PID: 3510135 Comm: node Tainted: P           O       6.10.3-amd64 #1  Debian 6.10.3-1
Aug 14 22:02:00 net1server kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2023.08-4 02/15/2024
Aug 14 22:02:00 net1server kernel: Call Trace:
Aug 14 22:02:00 net1server kernel:  <TASK>
Aug 14 22:02:00 net1server kernel:  dump_stack_lvl+0x64/0x80
Aug 14 22:02:00 net1server kernel:  bad_page+0x70/0x100
Aug 14 22:02:00 net1server kernel:  free_unref_page+0x323/0x4e0
Aug 14 22:02:00 net1server kernel:  migrate_misplaced_folio+0x395/0x410
Aug 14 22:02:00 net1server kernel:  __handle_mm_fault+0xaea/0x1060
Aug 14 22:02:00 net1server kernel:  handle_mm_fault+0x18d/0x320
Aug 14 22:02:00 net1server kernel:  do_user_addr_fault+0x177/0x6a0
Aug 14 22:02:00 net1server kernel:  exc_page_fault+0x7e/0x180
Aug 14 22:02:00 net1server kernel:  asm_exc_page_fault+0x26/0x30
Aug 14 22:02:00 net1server kernel: RIP: 0033:0x1fbd5d0
Aug 14 22:02:00 net1server kernel: Code: c3 66 0f 1f 84 00 00 00 00 00 49 39 dd 0f 84 8b 00 00 00 48 8b 75 c8 49 8b 04 24 48 83 c3 08 eb 89 66 0f 1f 84 00 00 00 00 00 <8b> 0e 83 f9 03 0f 85 a5 00 00 00 44 8b 6e 04 45 85 ed 0f 8e cd 00
Aug 14 22:02:00 net1server kernel: RSP: 002b:00007f6835fff920 EFLAGS: 00010293
Aug 14 22:02:00 net1server kernel: RAX: 00007f68280689c0 RBX: 0000000000000000 RCX: 00007f68180bc4b0
Aug 14 22:02:00 net1server kernel: RDX: 0000000000000000 RSI: 00007f681c0030d0 RDI: 00007f6835fff978
Aug 14 22:02:00 net1server kernel: RBP: 00007f6835fff960 R08: 00007f681c0030d0 R09: 000000002e15fa60
Aug 14 22:02:00 net1server kernel: R10: 0000000000000006 R11: 00007f6828000740 R12: 00007f6835fff978
Aug 14 22:02:00 net1server kernel: R13: 00007f6828068a00 R14: 000000002e0e4580 R15: 000000002e6ff8b8
Aug 14 22:02:00 net1server kernel:  </TASK>
Aug 14 22:02:00 net1server kernel: ------------[ cut here ]------------
Aug 14 22:02:00 net1server kernel: kernel BUG at mm/migrate.c:2658!
Aug 14 22:02:00 net1server kernel: Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
Aug 14 22:02:00 net1server kernel: CPU: 3 PID: 3510135 Comm: node Tainted: P    B      O       6.10.3-amd64 #1  Debian 6.10.3-1
Aug 14 22:02:00 net1server kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2023.08-4 02/15/2024
Aug 14 22:02:00 net1server kernel: RIP: 0010:migrate_misplaced_folio+0x3ae/0x410
Aug 14 22:02:00 net1server kernel: Code: e0 ea 84 ad e8 73 44 f6 ff 4c 89 f7 e8 eb 3e f5 ff 45 31 ed 8b 44 24 1c 85 c0 75 10 48 8b 44 24 20 48 39 d8 0f 84 8e fd ff ff <0f> 0b 41 89 c4 65 4c 01 25 55 f7 24 54 49 8b 3e 48 c1 ef 36 e8 79
Aug 14 22:02:00 net1server kernel: RSP: 0000:ffffaa890dd4fd58 EFLAGS: 00010206
Aug 14 22:02:00 net1server kernel: RAX: ffffd13f0ca5fa08 RBX: ffffaa890dd4fd78 RCX: 0000000000000027
Aug 14 22:02:00 net1server kernel: RDX: 000000000000027f RSI: 0000000000000001 RDI: ffff92b4b4c4cc80
Aug 14 22:02:00 net1server kernel: RBP: ffff92b47ffd5000 R08: 0000000000000000 R09: 0000000000000003
Aug 14 22:02:00 net1server kernel: R10: ffffaa890dd4fa70 R11: ffffffffad6c1048 R12: 00000000000001c3
Aug 14 22:02:00 net1server kernel: R13: 0000000000000000 R14: ffffd13f0ca58000 R15: ffffaa890dd4fd78
Aug 14 22:02:00 net1server kernel: FS:  00007f6836000700(0000) GS:ffff92b475b80000(0000) knlGS:0000000000000000
Aug 14 22:02:00 net1server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 14 22:02:00 net1server kernel: CR2: 00007f681c0030d0 CR3: 00000002c2ee2000 CR4: 00000000000406f0
Aug 14 22:02:00 net1server kernel: Call Trace:
Aug 14 22:02:00 net1server kernel:  <TASK>
Aug 14 22:02:00 net1server kernel:  ? die+0x36/0x90
Aug 14 22:02:00 net1server kernel:  ? do_trap+0xdd/0x100
Aug 14 22:02:00 net1server kernel:  ? migrate_misplaced_folio+0x3ae/0x410
Aug 14 22:02:00 net1server kernel:  ? do_error_trap+0x6a/0x90
Aug 14 22:02:00 net1server kernel:  ? migrate_misplaced_folio+0x3ae/0x410
Aug 14 22:02:00 net1server kernel:  ? exc_invalid_op+0x50/0x70
Aug 14 22:02:00 net1server kernel:  ? migrate_misplaced_folio+0x3ae/0x410
Aug 14 22:02:00 net1server kernel:  ? asm_exc_invalid_op+0x1a/0x20
Aug 14 22:02:00 net1server kernel:  ? migrate_misplaced_folio+0x3ae/0x410
Aug 14 22:02:00 net1server kernel:  ? migrate_misplaced_folio+0x3c7/0x410
Aug 14 22:02:00 net1server kernel:  __handle_mm_fault+0xaea/0x1060
Aug 14 22:02:00 net1server kernel:  handle_mm_fault+0x18d/0x320
Aug 14 22:02:00 net1server kernel:  do_user_addr_fault+0x177/0x6a0
Aug 14 22:02:00 net1server kernel:  exc_page_fault+0x7e/0x180
Aug 14 22:02:00 net1server kernel:  asm_exc_page_fault+0x26/0x30
Aug 14 22:02:00 net1server kernel: RIP: 0033:0x1fbd5d0
Aug 14 22:02:00 net1server kernel: Code: c3 66 0f 1f 84 00 00 00 00 00 49 39 dd 0f 84 8b 00 00 00 48 8b 75 c8 49 8b 04 24 48 83 c3 08 eb 89 66 0f 1f 84 00 00 00 00 00 <8b> 0e 83 f9 03 0f 85 a5 00 00 00 44 8b 6e 04 45 85 ed 0f 8e cd 00
Aug 14 22:02:00 net1server kernel: RSP: 002b:00007f6835fff920 EFLAGS: 00010293
Aug 14 22:02:00 net1server kernel: RAX: 00007f68280689c0 RBX: 0000000000000000 RCX: 00007f68180bc4b0
Aug 14 22:02:00 net1server kernel: RDX: 0000000000000000 RSI: 00007f681c0030d0 RDI: 00007f6835fff978
Aug 14 22:02:00 net1server kernel: RBP: 00007f6835fff960 R08: 00007f681c0030d0 R09: 000000002e15fa60
Aug 14 22:02:00 net1server kernel: R10: 0000000000000006 R11: 00007f6828000740 R12: 00007f6835fff978
Aug 14 22:02:00 net1server kernel: R13: 00007f6828068a00 R14: 000000002e0e4580 R15: 000000002e6ff8b8
Aug 14 22:02:00 net1server kernel:  </TASK>
Aug 14 22:02:00 net1server kernel: Modules linked in: dm_mod xt_REDIRECT nvidia_uvm(PO) nfsv3 nfs_acl ip_vs_rr xt_ipvs ip_vs veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_bpf xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_ad>
Aug 14 22:02:00 net1server kernel:  configfs efi_pstore nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci efivarfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi hid_generic sr_mod usbhid cdrom crc64_rocksoft hid crc64 crc_t10dif crct10dif_generic>
Aug 14 22:02:00 net1server kernel: ---[ end trace 0000000000000000 ]---
Aug 14 22:02:00 net1server kernel: RIP: 0010:migrate_misplaced_folio+0x3ae/0x410
Aug 14 22:02:00 net1server kernel: Code: e0 ea 84 ad e8 73 44 f6 ff 4c 89 f7 e8 eb 3e f5 ff 45 31 ed 8b 44 24 1c 85 c0 75 10 48 8b 44 24 20 48 39 d8 0f 84 8e fd ff ff <0f> 0b 41 89 c4 65 4c 01 25 55 f7 24 54 49 8b 3e 48 c1 ef 36 e8 79
Aug 14 22:02:00 net1server kernel: RSP: 0000:ffffaa890dd4fd58 EFLAGS: 00010206
Aug 14 22:02:00 net1server kernel: RAX: ffffd13f0ca5fa08 RBX: ffffaa890dd4fd78 RCX: 0000000000000027
Aug 14 22:02:00 net1server kernel: RDX: 000000000000027f RSI: 0000000000000001 RDI: ffff92b4b4c4cc80
Aug 14 22:02:00 net1server kernel: RBP: ffff92b47ffd5000 R08: 0000000000000000 R09: 0000000000000003
Aug 14 22:02:00 net1server kernel: R10: ffffaa890dd4fa70 R11: ffffffffad6c1048 R12: 00000000000001c3
Aug 14 22:02:00 net1server kernel: R13: 0000000000000000 R14: ffffd13f0ca58000 R15: ffffaa890dd4fd78
Aug 14 22:02:00 net1server kernel: FS:  00007f6836000700(0000) GS:ffff92b475b80000(0000) knlGS:0000000000000000
Aug 14 22:02:00 net1server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 14 22:02:00 net1server kernel: CR2: 00007f681c0030d0 CR3: 00000002c2ee2000 CR4: 00000000000406f0

User avatar
wizard10000
Global Moderator
Global Moderator
Posts: 1146
Joined: 2019-04-16 23:15
Location: southeastern us
Has thanked: 120 times
Been thanked: 198 times

Re: [Testing - Trixie] Error Causing Crash

#2 Post by wizard10000 »

Logs you posted point to a kernel bug; might be wise to file a bug report.

Also, the fact that you're running Trixie in a VM is a fairly important piece of information, would be most helpful to mention that in the future :)
we see things not as they are, but as we are.
-- anais nin

awptechnologies
Posts: 7
Joined: 2023-08-08 23:32

Re: [Testing - Trixie] Error Causing Crash

#3 Post by awptechnologies »

I have snapshots that happen everynight. Should i go back to the night before or is the vm okay to continue running. i reset it and all seems fine as of right now. Also something to note is my ram usage in proxmox was basically at 90 percent. could this happen becuase the vm ran out of memory? Another note this occurred 1 hour after my live snapshot. I use proxmox ve and proxmox backup server which backs up all 6 of my debian testing vms every night at 9 pm. I run docker swarm and have a HA cluster. Just to note this is the only vm that encountered the error and also the only vm that has a nvidia gpu passed through for docker gpu acceleration. All 6 vms reside on the same proxmox host and only migrate if the host goes down. And this is the first time this error has ever happen out of the three years my cluster has been up.

awptechnologies
Posts: 7
Joined: 2023-08-08 23:32

Re: [Testing - Trixie] Error Causing Crash

#4 Post by awptechnologies »

Any ideas on this or should i just keep running it. I'm just concerned this will happen again. Restore or not to restore...

Aki
Global Moderator
Global Moderator
Posts: 4038
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 112 times
Been thanked: 533 times

Re: [Testing - Trixie] Error Causing Crash

#5 Post by Aki »

Hello,
awptechnologies wrote: 2024-08-17 17:07 Any ideas on this or should i just keep running it. I'm just concerned this will happen again. Restore or not to restore...
As @wizard10000 already pointed out, you reported a kernel messag error:

Code: Select all

BUG: Bad page state in process node  pfn:329600
Aug 14 22:02:00 net1server kernel: page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
Aug 14 22:02:00 net1server kernel: Modules linked in: dm_mod xt_REDIRECT nvidia_uvm(PO) nfsv3 nfs_acl ip_vs_rr xt_ipvs ip_vs veth vxlan ip6_udp_tunnel udp_tunnel xt_policy xt_mark xt_bpf xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_ad>
Aug 14 22:02:00 net1server kernel:  configfs efi_pstore nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock vmw_vmci efivarfs qemu_fw_cfg ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi hid_generic sr_mod usbhid cdrom crc64_rocksoft hid crc64 crc_t10dif crct10dif_generic>
Aug 14 22:02:00 net1server kernel: CPU: 3 PID: 3510135 Comm: node Tainted: P           O       6.10.3-amd64 #1  Debian 6.10.3-1
Aug 14 22:02:00 net1server kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 4.2023.08-4 02/15/2024
The memory management unit (MMU) of your installed Linux kernel (version 6.10.3-amd64 - Debian 6.10.3-1) tried to free a memory page used for the process named "node" (what is it ?), but this memory page was probably in an inconsistent state according to some internal kernel checks. So, the kernel reported it. Probably, a memory corruption is the cause of the VM freeze.

Why memory page was in an inconsistent state ? Quite difficult to say.

Your kernel reports that it is "tainted" [1]:

Code: Select all

Tainted: P           O 
where:
  • P - proprietary module was loaded
  • O - externally-built (“out-of-tree”) module was loaded
Therefore, a proprietary out-of-tree kernel module may be involved, too.

Furthermore, in your assumption quoted below:
awptechnologies wrote: 2024-08-17 17:07 [..] this is the first time this error has ever happen out of the three years my cluster has been up.
you are probably not taking into account that Debian released the 6.10.3-1 Linux kernel for Debian Unstable about two weeks ago and for Debian Testing (Trixie) one week ago (2024-08-11):

Code: Select all

linux (6.10.3-1) unstable; urgency=medium

  * New upstream stable update:
    https://www.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.10.2
    https://www.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.10.3
    - ext4: don't track ranges in fast_commit if inode has inlined data
      (Closes: #1039883)

  [ Salvatore Bonaccorso ]
  * [rt] Update to 6.10.2-rt14
    - Refresh patches and drop patches applied upstream
  * [arm64] drivers/net/ethernet/microsoft: Enable MICROSOFT_MANA as module
  * drivers/net/ethernet/pensando: Enable IONIC as module (Closes: #1041893)

  [ Vincent Blut ]
  * [arm64] drivers/phy/marvell: Enable PHY_MVEBU_CP110_UTMI as module
    (Closes: #1076934)

  [ Ben Hutchings ]
  * net: drop bad gso csum_start and offset in virtio_net_hdr (regression in
    6.10.3)
  * spi: spidev: Add missing spi_device_id for bh2228fv (regression in 6.10.3)

 -- Ben Hutchings <benh@debian.org>  Sun, 04 Aug 2024 22:10:58 +0200
 
So you have been running a new kernel version for a week. This could of course be in the causal chain, as it is the major change according to what you have reported so far.

Hope this helps. Please let me know.

EDIT: note: please modify the subject of the first post to add the error message from the kernel; e.g.:
[Testing - Trixie] Error "PAGE_FLAGS_CHECK_AT_FREE flag(s) set" causing crash of a virtual machine
--
[1] Tainted kernels
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

Post Reply