Random blocking of writes to disk on a newer kernels

If none of the more specific forums is the right place to ask

Random blocking of writes to disk on a newer kernels

Postby Harkonnen » 2018-10-02 15:42

I run Debian stable (kernel 4.9.0-8). But today i tried to install a newer kernel from backports (4.18.0-bpo1) and run into a weird issue. I then tried 4.17.0-bpo3 and 4.17.0-bpo1 and this issue is present too.

So the issue is - at a seemingly random manner, once in few hours, the system becomes barely responsive, LA shoots into 50-100 range, wa at 90+%, syslog have following messages:
Code: Select all
    Oct  2 12:08:45 nikita kernel: [11480.032693] INFO: task md1_raid10:222 blocked for more than 120 seconds.
    Oct  2 12:08:45 nikita kernel: [11480.032698]       Tainted: P           O      4.18.0-0.bpo.1-amd64      #1 Debian 4.18.6-1~bpo9+1
    Oct  2 12:08:45 nikita kernel: [11480.032699] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Oct  2 12:08:45 nikita kernel: [11480.032721] md1_raid10      D    0   222      2 0x80000000
    Oct  2 12:08:45 nikita kernel: [11480.032725] Call Trace:
    Oct  2 12:08:45 nikita kernel: [11480.032733]  ? __schedule+0x3f5/0x880
    Oct  2 12:08:45 nikita kernel: [11480.032737]  schedule+0x32/0x80
    Oct  2 12:08:45 nikita kernel: [11480.032745]  md_super_wait+0x6e/0xa0 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032751]  ? remove_wait_queue+0x60/0x60
    Oct  2 12:08:45 nikita kernel: [11480.032757]  md_update_sb.part.61+0x4af/0x910 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032764]  md_check_recovery+0x312/0x530 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032769]  raid10d+0x64/0x1550 [raid10]
    Oct  2 12:08:45 nikita kernel: [11480.032773]  ? __schedule+0x3fd/0x880
    Oct  2 12:08:45 nikita kernel: [11480.032776]  ? schedule+0x32/0x80
    Oct  2 12:08:45 nikita kernel: [11480.032779]  ? schedule_timeout+0x1e5/0x350
    Oct  2 12:08:45 nikita kernel: [11480.032785]  ? md_thread+0x125/0x170 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032790]  md_thread+0x125/0x170 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032794]  ? remove_wait_queue+0x60/0x60
    Oct  2 12:08:45 nikita kernel: [11480.032797]  kthread+0xf8/0x130
    Oct  2 12:08:45 nikita kernel: [11480.032803]  ? md_rdev_init+0xc0/0xc0 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032805]  ? kthread_create_worker_on_cpu+0x70/0x70
    Oct  2 12:08:45 nikita kernel: [11480.032809]  ret_from_fork+0x35/0x40
    Oct  2 12:08:45 nikita kernel: [11480.032813] INFO: task md2_raid10:225 blocked for more than 120 seconds.
    Oct  2 12:08:45 nikita kernel: [11480.032815]       Tainted: P           O      4.18.0-0.bpo.1-amd64      #1 Debian 4.18.6-1~bpo9+1
    Oct  2 12:08:45 nikita kernel: [11480.032816] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Oct  2 12:08:45 nikita kernel: [11480.032818] md2_raid10      D    0   225      2 0x80000000
    Oct  2 12:08:45 nikita kernel: [11480.032821] Call Trace:
    Oct  2 12:08:45 nikita kernel: [11480.032824]  ? __schedule+0x3f5/0x880
    Oct  2 12:08:45 nikita kernel: [11480.032827]  schedule+0x32/0x80
    Oct  2 12:08:45 nikita kernel: [11480.032833]  md_super_wait+0x6e/0xa0 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032837]  ? remove_wait_queue+0x60/0x60
    Oct  2 12:08:45 nikita kernel: [11480.032842]  write_page+0x177/0x330 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032845]  ? __schedule+0x3fd/0x880
    Oct  2 12:08:45 nikita kernel: [11480.032849]  ? __percpu_ref_switch_mode+0xd0/0x170
    Oct  2 12:08:45 nikita kernel: [11480.032854]  md_update_sb.part.61+0x408/0x910 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032860]  md_check_recovery+0x312/0x530 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032864]  raid10d+0x64/0x1550 [raid10]
    Oct  2 12:08:45 nikita kernel: [11480.032868]  ? lock_timer_base+0x74/0x90
    Oct  2 12:08:45 nikita kernel: [11480.032871]  ? try_to_del_timer_sync+0x4d/0x80
    Oct  2 12:08:45 nikita kernel: [11480.032874]  ? del_timer_sync+0x35/0x40
    Oct  2 12:08:45 nikita kernel: [11480.032877]  ? schedule_timeout+0x177/0x350
    Oct  2 12:08:45 nikita kernel: [11480.032883]  ? md_thread+0x125/0x170 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032888]  md_thread+0x125/0x170 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032891]  ? remove_wait_queue+0x60/0x60
    Oct  2 12:08:45 nikita kernel: [11480.032894]  kthread+0xf8/0x130
    Oct  2 12:08:45 nikita kernel: [11480.032899]  ? md_rdev_init+0xc0/0xc0 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032902]  ? kthread_create_worker_on_cpu+0x70/0x70
    Oct  2 12:08:45 nikita kernel: [11480.032905]  ret_from_fork+0x35/0x40

It is then repeated for multiple processes. Any process that tries to write to disk end up in D state. Reading from disk is possible. After 3-4 minutes system returns to normal. There is no other error messages neither prior nor after this events. I dont have non raid disks in system so cant check if it is caused by raid subsystem or it is a victim of an actual cause. Network filesystems are unaffected.

Does anybody have any idea what it is and how it can be fixed or at least debugged? My google-fu failed to produce anything of value :( All similar problems points to hardware, usually usb, which is not the case here. Its not that i need 4.18 now, 4.9 works, but i would like to find the root of a problem before something forces me to a newer kernel...
Harkonnen
 
Posts: 19
Joined: 2011-08-31 12:52

Re: Random blocking of writes to disk on a newer kernels

Postby bw123 » 2018-10-02 16:48

Never saw that, what does memory use look like when it does that? plenty of mem/swap available?
User avatar
bw123
 
Posts: 3572
Joined: 2011-05-09 06:02
Location: TN_USA

Re: Random blocking of writes to disk on a newer kernels

Postby Harkonnen » 2018-10-02 22:28

No its not swap, more than half of 16GB were free, swap used = 0.
Harkonnen
 
Posts: 19
Joined: 2011-08-31 12:52

Re: Random blocking of writes to disk on a newer kernels

Postby Head_on_a_Stick » 2018-10-03 04:59

Harkonnen wrote:I run Debian stable (kernel 4.9.0-8). But today i tried to install a newer kernel from backports (4.18.0-bpo1) and run into a weird issue. I then tried 4.17.0-bpo3 and 4.17.0-bpo1 and this issue is present too.

Perhaps try going further back until you find when the bug appeared?

https://wiki.debian.org/DebianKernel/GitBisect
dbruce wrote:Ubuntu forums try to be like a coffee shop in Seattle. Debian forums strive for the charm and ambience of a skinhead bar in Bacau. We intend to keep it that way.
User avatar
Head_on_a_Stick
 
Posts: 8315
Joined: 2014-06-01 17:46
Location: /dev/chair

Re: Random blocking of writes to disk on a newer kernels

Postby Harkonnen » 2018-10-03 07:36

Head_on_a_Stick wrote:erhaps try going further back until you find when the bug appeared?


Well i probably do this eventually... 4.17 is the earliest one still in bpo repository, it was easy to test. GitBisect of a kernel takes a long time, especially considering that i have to wait for a problem to appear.
Harkonnen
 
Posts: 19
Joined: 2011-08-31 12:52

Re: Random blocking of writes to disk on a newer kernels

Postby Head_on_a_Stick » 2018-10-03 20:02

Harkonnen wrote:GitBisect of a kernel takes a long time

Yes but if you had bothered to actually read my link then you would have noticed https://snapshot.debian.org/package/linux/.
dbruce wrote:Ubuntu forums try to be like a coffee shop in Seattle. Debian forums strive for the charm and ambience of a skinhead bar in Bacau. We intend to keep it that way.
User avatar
Head_on_a_Stick
 
Posts: 8315
Joined: 2014-06-01 17:46
Location: /dev/chair

Re: Random blocking of writes to disk on a newer kernels

Postby Harkonnen » 2018-10-06 08:37

Some additional info. I have tested bpo kernels from archive and 4.16.16-2~bpo9+1 is (probably) unaffected. Also i tried to run affected kernels to find some way to reliably trigger this so i dont have to wait for a day to decide that kernel is good. I did not succeed at this, but there are some observations:
So far the longest time from reboot to first hit is ~6 hours.
The issue some time last shorter than 2 minutes, so it does not trigger hung_task_timeout messages, which makes it even harder to check if kernel is affected.
It does affect non raid disk.
The comment "Network filesystems are unaffected" was wrong.
Also it seems that sometimes even reads to some filesystems are blocked, but not to other.
Affected kernels spew "perf: interrupt took too long, lowering kernel.perf_event_max_sample_rate" from time to time (not at the same time as blocking), but i think this is related.
Harkonnen
 
Posts: 19
Joined: 2011-08-31 12:52

Re: Random blocking of writes to disk on a newer kernels

Postby Harkonnen » 2018-10-10 13:03

Before going to gitbisect i build the latest kernel (4.19-rc7) and it does not have this problem. So case closed, whatever it was it is already fixed.
Harkonnen
 
Posts: 19
Joined: 2011-08-31 12:52


Return to General Questions

Who is online

Users browsing this forum: No registered users and 7 guests

fashionable