Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

Random blocking of writes to disk on a newer kernels

If none of the specific sub-forums seem right for your thread, ask here.
Post Reply
Message
Author
Harkonnen
Posts: 24
Joined: 2011-08-31 12:52

Random blocking of writes to disk on a newer kernels

#1 Post by Harkonnen »

I run Debian stable (kernel 4.9.0-8). But today i tried to install a newer kernel from backports (4.18.0-bpo1) and run into a weird issue. I then tried 4.17.0-bpo3 and 4.17.0-bpo1 and this issue is present too.

So the issue is - at a seemingly random manner, once in few hours, the system becomes barely responsive, LA shoots into 50-100 range, wa at 90+%, syslog have following messages:

Code: Select all

    Oct  2 12:08:45 nikita kernel: [11480.032693] INFO: task md1_raid10:222 blocked for more than 120 seconds.
    Oct  2 12:08:45 nikita kernel: [11480.032698]       Tainted: P           O      4.18.0-0.bpo.1-amd64      #1 Debian 4.18.6-1~bpo9+1
    Oct  2 12:08:45 nikita kernel: [11480.032699] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Oct  2 12:08:45 nikita kernel: [11480.032721] md1_raid10      D    0   222      2 0x80000000
    Oct  2 12:08:45 nikita kernel: [11480.032725] Call Trace:
    Oct  2 12:08:45 nikita kernel: [11480.032733]  ? __schedule+0x3f5/0x880
    Oct  2 12:08:45 nikita kernel: [11480.032737]  schedule+0x32/0x80
    Oct  2 12:08:45 nikita kernel: [11480.032745]  md_super_wait+0x6e/0xa0 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032751]  ? remove_wait_queue+0x60/0x60
    Oct  2 12:08:45 nikita kernel: [11480.032757]  md_update_sb.part.61+0x4af/0x910 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032764]  md_check_recovery+0x312/0x530 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032769]  raid10d+0x64/0x1550 [raid10]
    Oct  2 12:08:45 nikita kernel: [11480.032773]  ? __schedule+0x3fd/0x880
    Oct  2 12:08:45 nikita kernel: [11480.032776]  ? schedule+0x32/0x80
    Oct  2 12:08:45 nikita kernel: [11480.032779]  ? schedule_timeout+0x1e5/0x350
    Oct  2 12:08:45 nikita kernel: [11480.032785]  ? md_thread+0x125/0x170 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032790]  md_thread+0x125/0x170 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032794]  ? remove_wait_queue+0x60/0x60
    Oct  2 12:08:45 nikita kernel: [11480.032797]  kthread+0xf8/0x130
    Oct  2 12:08:45 nikita kernel: [11480.032803]  ? md_rdev_init+0xc0/0xc0 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032805]  ? kthread_create_worker_on_cpu+0x70/0x70
    Oct  2 12:08:45 nikita kernel: [11480.032809]  ret_from_fork+0x35/0x40
    Oct  2 12:08:45 nikita kernel: [11480.032813] INFO: task md2_raid10:225 blocked for more than 120 seconds.
    Oct  2 12:08:45 nikita kernel: [11480.032815]       Tainted: P           O      4.18.0-0.bpo.1-amd64      #1 Debian 4.18.6-1~bpo9+1
    Oct  2 12:08:45 nikita kernel: [11480.032816] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Oct  2 12:08:45 nikita kernel: [11480.032818] md2_raid10      D    0   225      2 0x80000000
    Oct  2 12:08:45 nikita kernel: [11480.032821] Call Trace:
    Oct  2 12:08:45 nikita kernel: [11480.032824]  ? __schedule+0x3f5/0x880
    Oct  2 12:08:45 nikita kernel: [11480.032827]  schedule+0x32/0x80
    Oct  2 12:08:45 nikita kernel: [11480.032833]  md_super_wait+0x6e/0xa0 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032837]  ? remove_wait_queue+0x60/0x60
    Oct  2 12:08:45 nikita kernel: [11480.032842]  write_page+0x177/0x330 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032845]  ? __schedule+0x3fd/0x880
    Oct  2 12:08:45 nikita kernel: [11480.032849]  ? __percpu_ref_switch_mode+0xd0/0x170
    Oct  2 12:08:45 nikita kernel: [11480.032854]  md_update_sb.part.61+0x408/0x910 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032860]  md_check_recovery+0x312/0x530 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032864]  raid10d+0x64/0x1550 [raid10]
    Oct  2 12:08:45 nikita kernel: [11480.032868]  ? lock_timer_base+0x74/0x90
    Oct  2 12:08:45 nikita kernel: [11480.032871]  ? try_to_del_timer_sync+0x4d/0x80
    Oct  2 12:08:45 nikita kernel: [11480.032874]  ? del_timer_sync+0x35/0x40
    Oct  2 12:08:45 nikita kernel: [11480.032877]  ? schedule_timeout+0x177/0x350
    Oct  2 12:08:45 nikita kernel: [11480.032883]  ? md_thread+0x125/0x170 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032888]  md_thread+0x125/0x170 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032891]  ? remove_wait_queue+0x60/0x60
    Oct  2 12:08:45 nikita kernel: [11480.032894]  kthread+0xf8/0x130
    Oct  2 12:08:45 nikita kernel: [11480.032899]  ? md_rdev_init+0xc0/0xc0 [md_mod]
    Oct  2 12:08:45 nikita kernel: [11480.032902]  ? kthread_create_worker_on_cpu+0x70/0x70
    Oct  2 12:08:45 nikita kernel: [11480.032905]  ret_from_fork+0x35/0x40
It is then repeated for multiple processes. Any process that tries to write to disk end up in D state. Reading from disk is possible. After 3-4 minutes system returns to normal. There is no other error messages neither prior nor after this events. I dont have non raid disks in system so cant check if it is caused by raid subsystem or it is a victim of an actual cause. Network filesystems are unaffected.

Does anybody have any idea what it is and how it can be fixed or at least debugged? My google-fu failed to produce anything of value :( All similar problems points to hardware, usually usb, which is not the case here. Its not that i need 4.18 now, 4.9 works, but i would like to find the root of a problem before something forces me to a newer kernel...

User avatar
bw123
Posts: 4015
Joined: 2011-05-09 06:02
Has thanked: 1 time
Been thanked: 28 times

Re: Random blocking of writes to disk on a newer kernels

#2 Post by bw123 »

Never saw that, what does memory use look like when it does that? plenty of mem/swap available?
resigned by AI ChatGPT

Harkonnen
Posts: 24
Joined: 2011-08-31 12:52

Re: Random blocking of writes to disk on a newer kernels

#3 Post by Harkonnen »

No its not swap, more than half of 16GB were free, swap used = 0.

User avatar
Head_on_a_Stick
Posts: 14114
Joined: 2014-06-01 17:46
Location: London, England
Has thanked: 81 times
Been thanked: 132 times

Re: Random blocking of writes to disk on a newer kernels

#4 Post by Head_on_a_Stick »

Harkonnen wrote:I run Debian stable (kernel 4.9.0-8). But today i tried to install a newer kernel from backports (4.18.0-bpo1) and run into a weird issue. I then tried 4.17.0-bpo3 and 4.17.0-bpo1 and this issue is present too.
Perhaps try going further back until you find when the bug appeared?

https://wiki.debian.org/DebianKernel/GitBisect
deadbang

Harkonnen
Posts: 24
Joined: 2011-08-31 12:52

Re: Random blocking of writes to disk on a newer kernels

#5 Post by Harkonnen »

Head_on_a_Stick wrote:erhaps try going further back until you find when the bug appeared?
Well i probably do this eventually... 4.17 is the earliest one still in bpo repository, it was easy to test. GitBisect of a kernel takes a long time, especially considering that i have to wait for a problem to appear.

User avatar
Head_on_a_Stick
Posts: 14114
Joined: 2014-06-01 17:46
Location: London, England
Has thanked: 81 times
Been thanked: 132 times

Re: Random blocking of writes to disk on a newer kernels

#6 Post by Head_on_a_Stick »

Harkonnen wrote:GitBisect of a kernel takes a long time
Yes but if you had bothered to actually read my link then you would have noticed https://snapshot.debian.org/package/linux/.
deadbang

Harkonnen
Posts: 24
Joined: 2011-08-31 12:52

Re: Random blocking of writes to disk on a newer kernels

#7 Post by Harkonnen »

Some additional info. I have tested bpo kernels from archive and 4.16.16-2~bpo9+1 is (probably) unaffected. Also i tried to run affected kernels to find some way to reliably trigger this so i dont have to wait for a day to decide that kernel is good. I did not succeed at this, but there are some observations:
So far the longest time from reboot to first hit is ~6 hours.
The issue some time last shorter than 2 minutes, so it does not trigger hung_task_timeout messages, which makes it even harder to check if kernel is affected.
It does affect non raid disk.
The comment "Network filesystems are unaffected" was wrong.
Also it seems that sometimes even reads to some filesystems are blocked, but not to other.
Affected kernels spew "perf: interrupt took too long, lowering kernel.perf_event_max_sample_rate" from time to time (not at the same time as blocking), but i think this is related.

Harkonnen
Posts: 24
Joined: 2011-08-31 12:52

Re: Random blocking of writes to disk on a newer kernels

#8 Post by Harkonnen »

Before going to gitbisect i build the latest kernel (4.19-rc7) and it does not have this problem. So case closed, whatever it was it is already fixed.

Post Reply