INFO: rcu_sched self-detected stall on CPU

If none of the more specific forums is the right place to ask

INFO: rcu_sched self-detected stall on CPU

Postby prashanth272 » 2018-07-30 15:46

Hi,

Recently we are seeing the following stack trace in our Debian 8 based VM's

Jul 22 04:54:20 localhost kernel: [1068823.326261] INFO: rcu_sched self-detected stall on CPU { 0} (t=5250 jiffies g=54470612 c=54470611 q=899)
Jul 22 04:54:20 localhost kernel: [1068823.326716] sending NMI to all CPUs:
Jul 22 04:54:20 localhost kernel: [1068823.326744] NMI backtrace for cpu 0
Jul 22 04:54:20 localhost kernel: [1068823.326752] CPU: 0 PID: 77982 Comm: dnsmasq Not tainted 3.16.0-5-amd64 #1 Debian 3.16.51-3+deb8u1
Jul 22 04:54:20 localhost kernel: [1068823.326754] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015
Jul 22 04:54:20 localhost kernel: [1068823.326757] task: ffff880078334050 ti: ffff88005544c000 task.ti: ffff88005544c000
Jul 22 04:54:20 localhost kernel: [1068823.326759] RIP: 0010:[<ffffffff81053b96>] [<ffffffff81053b96>] native_write_msr_safe+0x6/0x10
Jul 22 04:54:20 localhost kernel: [1068823.326785] RSP: 0018:ffff88007c6039d0 EFLAGS: 00000046
Jul 22 04:54:20 localhost kernel: [1068823.326787] RAX: 0000000000000400 RBX: 0000000000000000 RCX: 0000000000000830
Jul 22 04:54:20 localhost kernel: [1068823.326788] RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000830
Jul 22 04:54:20 localhost kernel: [1068823.326788] RBP: ffffffff818e8000 R08: ffffffff818e8000 R09: 00000000000006b8
Jul 22 04:54:20 localhost kernel: [1068823.326789] R10: 61206f7420494d4e R11: 3a73555043206c6c R12: 0000000000000000
Jul 22 04:54:20 localhost kernel: [1068823.326790] R13: 000000000000d0ea R14: 0000000000000002 R15: 0000000000080000
Jul 22 04:54:20 localhost kernel: [1068823.326792] FS: 00007f57ca724700(0000) GS:ffff88007c600000(0000) knlGS:0000000000000000
Jul 22 04:54:20 localhost kernel: [1068823.326793] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 22 04:54:20 localhost kernel: [1068823.326794] CR2: 00000000029caa90 CR3: 000000007869a000 CR4: 0000000000160770
Jul 22 04:54:20 localhost kernel: [1068823.326841] Stack:
Jul 22 04:54:20 localhost kernel: [1068823.326843] ffffffff8104cc8d 0000000000000002 0000000000000086 ffff88007c6102a0
Jul 22 04:54:20 localhost kernel: [1068823.326845] ffffffff81858940 0000000000000000 ffffffff818e8560 0000000000000383
Jul 22 04:54:20 localhost kernel: [1068823.326847] ffffffff81858a00 ffffffff810496e3 ffff88007c6102a0 ffffffff810ca46a
Jul 22 04:54:20 localhost kernel: [1068823.326848] Call Trace:
Jul 22 04:54:20 localhost kernel: [1068823.326850] <IRQ>
Jul 22 04:54:20 localhost kernel: [1068823.326851] [<ffffffff8104cc8d>] ? __x2apic_send_IPI_mask+0xad/0xe0
Jul 22 04:54:20 localhost kernel: [1068823.326872] [<ffffffff810496e3>] ? arch_trigger_all_cpu_backtrace+0xc3/0x140
Jul 22 04:54:20 localhost kernel: [1068823.326883] [<ffffffff810ca46a>] ? rcu_check_callbacks+0x42a/0x670
Jul 22 04:54:20 localhost kernel: [1068823.326886] [<ffffffff810cbfb5>] ? timekeeping_update.constprop.9+0x35/0x70
Jul 22 04:54:20 localhost kernel: [1068823.326889] [<ffffffff810d4e50>] ? tick_sched_handle.isra.16+0x60/0x60
Jul 22 04:54:20 localhost kernel: [1068823.326899] [<ffffffff81078120>] ? update_process_times+0x40/0x70
Jul 22 04:54:20 localhost kernel: [1068823.326902] [<ffffffff810d4e10>] ? tick_sched_handle.isra.16+0x20/0x60
Jul 22 04:54:20 localhost kernel: [1068823.326904] [<ffffffff810d4e8c>] ? tick_sched_timer+0x3c/0x60
Jul 22 04:54:20 localhost kernel: [1068823.326910] [<ffffffff8108eb67>] ? __run_hrtimer+0x67/0x210
Jul 22 04:54:20 localhost kernel: [1068823.326912] [<ffffffff8108ef69>] ? hrtimer_interrupt+0xe9/0x220
Jul 22 04:54:20 localhost kernel: [1068823.326929] [<ffffffff81528e6b>] ? smp_apic_timer_interrupt+0x3b/0x50
Jul 22 04:54:20 localhost kernel: [1068823.326933] [<ffffffff81526c61>] ? apic_timer_interrupt+0x81/0x90
Jul 22 04:54:20 localhost kernel: [1068823.326941] [<ffffffff814a8844>] ? fib_table_lookup+0x2e4/0x390
Jul 22 04:54:20 localhost kernel: [1068823.326952] [<ffffffff814a4755>] ? fib_get_table+0x25/0x50
Jul 22 04:54:20 localhost kernel: [1068823.326954] [<ffffffff814b007c>] ? fib4_rule_action+0x3c/0x70
Jul 22 04:54:20 localhost kernel: [1068823.326961] [<ffffffff8144c4e3>] ? fib_rules_lookup+0xc3/0x160
Jul 22 04:54:20 localhost kernel: [1068823.326963] [<ffffffff814afe25>] ? __fib_lookup+0x45/0x80
Jul 22 04:54:20 localhost kernel: [1068823.326968] [<ffffffff814650f5>] ? ip_route_input_noref+0xa05/0xbc0
Jul 22 04:54:20 localhost kernel: [1068823.326980] [<ffffffff811c2ff7>] ? pollwake+0x67/0x70
Jul 22 04:54:20 localhost kernel: [1068823.326989] [<ffffffff8109b550>] ? wake_up_state+0x10/0x10
Jul 22 04:54:20 localhost kernel: [1068823.326994] [<ffffffff810abe17>] ? __wake_up_common+0x57/0x90
Jul 22 04:54:20 localhost kernel: [1068823.326997] [<ffffffff81497bba>] ? arp_process+0x38a/0x660
Jul 22 04:54:20 localhost kernel: [1068823.327003] [<ffffffff81419a46>] ? sock_def_readable+0x36/0x70
Jul 22 04:54:20 localhost kernel: [1068823.327009] [<ffffffff8150c8b0>] ? packet_rcv+0x2e0/0x410
Jul 22 04:54:20 localhost kernel: [1068823.327013] [<ffffffff814301e3>] ? __netif_receive_skb_core+0x563/0x770
Jul 22 04:54:20 localhost kernel: [1068823.327016] [<ffffffff8143046f>] ? netif_receive_skb_internal+0x1f/0x80
Jul 22 04:54:20 localhost kernel: [1068823.327017] [<ffffffff81430ef0>] ? napi_gro_receive+0xd0/0x100
Jul 22 04:54:20 localhost kernel: [1068823.327219] [<ffffffffa00c4eee>] ? vmxnet3_rq_rx_complete+0x58e/0x850 [vmxnet3]
Jul 22 04:54:20 localhost kernel: [1068823.327224] [<ffffffffa00c5305>] ? vmxnet3_poll_rx_only+0x35/0xb0 [vmxnet3]
Jul 22 04:54:20 localhost kernel: [1068823.327227] [<ffffffff814307f9>] ? net_rx_action+0x129/0x250
Jul 22 04:54:20 localhost kernel: [1068823.327229] [<ffffffff8106f461>] ? __do_softirq+0xf1/0x2d0
Jul 22 04:54:20 localhost kernel: [1068823.327230] [<ffffffff8106f875>] ? irq_exit+0x95/0xa0
Jul 22 04:54:20 localhost kernel: [1068823.327233] [<ffffffff81528e70>] ? smp_apic_timer_interrupt+0x40/0x50
Jul 22 04:54:20 localhost kernel: [1068823.327235] [<ffffffff81526c61>] ? apic_timer_interrupt+0x81/0x90
Jul 22 04:54:20 localhost kernel: [1068823.327236] <EOI>
Jul 22 04:54:20 localhost kernel: [1068823.327237] [<ffffffff814a4758>] ? fib_get_table+0x28/0x50
Jul 22 04:54:20 localhost kernel: [1068823.327241] [<ffffffff814b007c>] ? fib4_rule_action+0x3c/0x70
Jul 22 04:54:20 localhost kernel: [1068823.327242] [<ffffffff8144c4e3>] ? fib_rules_lookup+0xc3/0x160
Jul 22 04:54:20 localhost kernel: [1068823.327244] [<ffffffff814afe25>] ? __fib_lookup+0x45/0x80
Jul 22 04:54:20 localhost kernel: [1068823.327246] [<ffffffff81463df3>] ? __ip_route_output_key+0x613/0x900
Jul 22 04:54:20 localhost kernel: [1068823.327272] [<ffffffffa01a0c03>] ? __ext4_handle_dirty_metadata+0x83/0x1d0 [ext4]
Jul 22 04:54:20 localhost kernel: [1068823.327275] [<ffffffff8146a3a0>] ? ip_copy_metadata+0x160/0x160
Jul 22 04:54:20 localhost kernel: [1068823.327277] [<ffffffff81464347>] ? ip_route_output_flow+0x17/0x50
Jul 22 04:54:20 localhost kernel: [1068823.327278] [<ffffffff81493319>] ? udp_sendmsg+0x759/0x8e0
Jul 22 04:54:20 localhost kernel: [1068823.327285] [<ffffffff811d70c1>] ? __mark_inode_dirty+0x171/0x2b0
Jul 22 04:54:20 localhost kernel: [1068823.327289] [<ffffffff8141720b>] ? sock_sendmsg+0x8b/0xc0
Jul 22 04:54:20 localhost kernel: [1068823.327293] [<ffffffff81416f39>] ? move_addr_to_kernel.part.17+0x19/0x60
Jul 22 04:54:20 localhost kernel: [1068823.327295] [<ffffffff814178ef>] ? ___sys_sendmsg+0x29f/0x3a0
Jul 22 04:54:20 localhost kernel: [1068823.327297] [<ffffffff811bfdcf>] ? user_path_at_empty+0x5f/0x90



The machines are running as VM on VMware ESXi and they become unresponsive after running for a week or two.

Just a bit of background on these VM's, these are used a Virtual Router and provides the DHCP and DNS services to other VM's in the network. However these VM's have promiscuous mode enabled on their nics and will have high amount of traffic hitting them. I am not entirely sure on where to start looking at the issue.

From my analysis the issue looks like the amount of traffic hitting these VM's , but not entirely sure on how to optimise the performance of the VM. Once thing, we notice when the issue happens is high percentage of SI (software Interrupts) in the top output. It reaches around 25% and we see the VM becomes nearly unresponsive.

Upon further analysis I see that all all the interrupts are handled by all the CPU's. ( The VM has 4 vCPU's ). All the software interrupts ( for nics) are always handled by CPU0 even though smp_affinity is set to "f". Seems like this is more of a VMware issue as this happens when only using Vmxnet3 NicAdapter type. There is no issue with Software interrupts being handled by multiple CPU's when using E1000 nicAdapter.

Can someone please give some pointers based on the stack trace pasted above? I can share the full kern.log if required.

root@r-1056-VM:~# uname -a
Linux r-1056-VM 3.2.0-4-amd64 #1 SMP Debian 3.2.93-1 x86_64 GNU/Linux
prashanth272
 
Posts: 1
Joined: 2018-07-30 15:18

Return to General Questions

Who is online

Users browsing this forum: pendrachken and 11 guests

fashionable