Scheduled Maintenance: We are aware of an issue with Google, AOL, and Yahoo services as email providers which are blocking new registrations. We are trying to fix the issue and we have several internal and external support tickets in process to resolve the issue. Please see: viewtopic.php?t=158230

 

 

 

[Software] Serious deadlock issue jbd2

Linux Kernel, Network, and Services configuration.
Message
Author
rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

[Software] Serious deadlock issue jbd2

#1 Post by rontarplan »

Hello people.
I have just registered to be able to post as I am having a serious problem with a Debian box and I am out of ideas.
I am running several Debian servers as storage servers. All of them doing pretty much the same stuff and having the same load. Some with software raid, some with hardware raid some are really really close in terms of hardware and configuration.
However there is only one box that gives me one headache after another while the others are running totally fine without issues.

Here is the headache:

Code: Select all

INFO: task jbd2/sda5-8:625 blocked for more than 120 seconds.
Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message."
This message indicates that task jbd2 which is the ext4 journaling task for the hw raid sda5 went into uninterruptible state and basically bricks the server as it never recovers from it. Any other task that wants to write to disks fails also after this message and I have to restart the server with file checking which takes hours. Usually after that the server runs fine for some time but no longer than 48 hours and then this message shows up again.

I have tried to set

Code: Select all

vm.dirty_background_ratio/bytes
to low values. I have also tried to set

Code: Select all

vm.dirty_ratio
to low and high values. It does not change a thing. The server itself and the raid card have been replaced. Drives have been checked. If it turns out the server cannot handle the IOPS it would be fine with me however as this is the only one that acts this way and I even have similar one with a software raid that handles everything well, I am really wondering why this server cannot handle it.

Here is a trace from dmesg:

Code: Select all

task:jbd2/sda5-8     state:D stack:    0 pid:  625 ppid:     2 flags:0x00004000
Call Trace:
__schedule+0x282/0x870
schedule+0x46/0xb0
jbd2_journal_commit_transaction+0x163/0x1ad0 [jbd2]
? sched_clock+0x5/0x10
? sched_clock_cpu+0xc/0xb0
? record_times+0x15/0x90
? psi_group_change+0x41/0x220
? __switch_to_asm+0x3a/0x60
? __switch_to+0x114/0x440
? add_wait_queue_exclusive+0x70/0x70
? lock_timer_base+0x61/0x80
kjournald2+0xab/0x270 [jbd2]
? add_wait_queue_exclusive+0x70/0x70
? load_superblock.part.0+0xb0/0xb0 [jbd2]
kthread+0x11b/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30
Maybe it could even be a kernel issue? Or shall I disable journaling? I don't know.

Any help to further debug this issue and maybe find the cause or even a solution would be appreciated.

Aki
Global Moderator
Global Moderator
Posts: 2823
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 69 times
Been thanked: 385 times

Re: [Software] Serious deadlock issue jbd2

#2 Post by Aki »

Hello,
rontarplan wrote: 2023-05-11 09:25 [..]

Code: Select all

INFO: task jbd2/sda5-8:625 blocked for more than 120 seconds.
Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message."
This message indicates that task jbd2 which is the ext4 journaling task for the hw raid sda5 went into uninterruptible state and basically bricks the server as it never recovers from it. Any other task that wants to write to disks fails also after this message and I have to restart the server with file checking which takes hours. Usually after that the server runs fine for some time but no longer than 48 hours and then this message shows up again.
The cause could be the RAID (software RAID ? hardware RAID ?): perhaps, it does not respond in a timely manner to the kernel for some reason.

By the way, have you check the interested partition for free space ?
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#3 Post by rontarplan »

Thanks for your reply and suggestion.

It is a hardware raid and there is plenty of free space. I have volumes sda1-sda5 on this array.
So regarding your question it would be my understanding if the RAID would not respond it would affect jbd2-sda3 as well, which is the journaling for the root partition.
I am currently recovering the server from the latest outage, so I have to wait till it happens again to do check but from previous failures I can tell that I was able to read and write to the root partition (sda3) and that jbd2-sda5 was the only one in interruptible state.
Next time this happens I will check if I can read/write to sda5.
It would not bother me so much if jbd2-sda5 would return itself into a working state but it seems that it is waiting for something to happen which never happens and therefore remains in that unusable state with the only option left to reboot.

I need to stress again, that I have another server which is pretty much a duplicate in terms of hardware, configuration, partitioning, load etc. and that one is running without this issue and I don't understand why.

Aki
Global Moderator
Global Moderator
Posts: 2823
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 69 times
Been thanked: 385 times

Re: [Software] Serious deadlock issue jbd2

#4 Post by Aki »

Hello,
rontarplan wrote: 2023-05-11 12:19 It is a hardware raid [..] I am currently recovering the server from the latest outage [..]
Has the issue started after an uncontrolled shutdown of the hardware RAID due to a power outage ?
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#5 Post by rontarplan »

Has the issue started after an uncontrolled shutdown of the hardware RAID due to a power outage ?

No power outage unfortunately. This would at least give a possible reason.
The outage I was referring to was just the very same deadlock issue that happened again today after like 10 minutes when I had just recovered the server after the previous deadlock and started it again. Everything seemed okay for about 10 minutes then the deadlock occured again. It is basically unusable.

User avatar
kent_dorfman766
Posts: 535
Joined: 2022-12-16 06:34
Location: socialist states of america
Has thanked: 57 times
Been thanked: 70 times

Re: [Software] Serious deadlock issue jbd2

#6 Post by kent_dorfman766 »

check the smart data on the physical disks that make up the raid volume. betting you have questionable hardware.

Aki
Global Moderator
Global Moderator
Posts: 2823
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 69 times
Been thanked: 385 times

Re: [Software] Serious deadlock issue jbd2

#7 Post by Aki »

Hello,
For the kernel call trace,just before the kernel call to the scheduler (at the top of the stack trace), it is reported:

Code: Select all

[..]
jbd2_journal_commit_transaction+0x163/0x1ad0 [jbd2]
[..]
The operating system cannot commit journal transaction to the file system.

What is your kernel version ?
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#8 Post by rontarplan »

What is your kernel version ?
It is 5.10.0-22-amd64

I have found an old thread referring to an older kernel showing a similar problem, but I cannot tell if it could be related somehow: https://www.spinics.net/lists/linux-ext4/msg53498.html
Is there any known bug that could cause my issue?

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#9 Post by rontarplan »

kent_dorfman766 wrote: 2023-05-11 15:45 check the smart data on the physical disks that make up the raid volume. betting you have questionable hardware.
Thanks for this suggestion.

The hoster had checked the drives short time ago and did not find any errors. I relied on that statement.
Nevertheless I ran a

Code: Select all

smartctl -d sat --all /dev/sgX
on every drive.

SMART reported every drive as

Code: Select all

SMART overall-health self-assessment test result: PASSED
However 2 drives showed some abnormal messages and I wonder if this could cause my issue?

Drive 1:

Code: Select all

Error 1 occurred at disk power-on lifetime: 7740 hours (322 days + 12 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 00 00 00 00 00  Error: ICRC, ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 e0 00 b6 d4 40 00  21d+14:15:59.191  READ FPDMA QUEUED
  60 00 d8 00 b0 d4 40 00  21d+14:15:59.188  READ FPDMA QUEUED
  60 00 d0 00 ae d4 40 00  21d+14:15:59.171  READ FPDMA QUEUED
  60 00 c8 00 a8 d4 40 00  21d+14:15:59.169  READ FPDMA QUEUED
  60 00 c0 00 6e d3 40 00  21d+14:15:59.136  READ FPDMA QUEUED
Drive 2:

Code: Select all

SMART Error Log Version: 1
ATA Error Count: 55 (device log contains only the most recent five errors)
	CR = Command Register [HEX]
	FR = Features Register [HEX]
	SC = Sector Count Register [HEX]
	SN = Sector Number Register [HEX]
	CL = Cylinder Low Register [HEX]
	CH = Cylinder High Register [HEX]
	DH = Device/Head Register [HEX]
	DC = Device Command Register [HEX]
	ER = Error register [HEX]
	ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 55 occurred at disk power-on lifetime: 19974 hours (832 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 53 02 00 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 00      00:00:32.215  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 00      00:00:32.214  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:00:32.214  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 00      00:00:32.214  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 00      00:00:32.213  IDENTIFY DEVICE

Error 54 occurred at disk power-on lifetime: 19974 hours (832 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 53 46 00 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 03 46 00 00 00 a0 00      00:00:32.214  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 00      00:00:32.214  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 00      00:00:32.213  IDENTIFY DEVICE
  25 00 08 f8 ff 3f e0 00      00:00:32.183  READ DMA EXT
  ef 10 02 00 00 00 a0 00      00:00:32.183  SET FEATURES [Enable SATA feature]

Error 53 occurred at disk power-on lifetime: 19974 hours (832 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 53 02 00 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 00      00:00:32.214  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 00      00:00:32.213  IDENTIFY DEVICE
  25 00 08 f8 ff 3f e0 00      00:00:32.183  READ DMA EXT
  ef 10 02 00 00 00 a0 00      00:00:32.183  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 00      00:00:32.182  IDENTIFY DEVICE

Error 52 occurred at disk power-on lifetime: 19974 hours (832 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 53 08 f8 ff 3f e0  Error: ABRT 8 sectors at LBA = 0x003ffff8 = 4194296

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 08 f8 ff 3f e0 00      00:00:32.183  READ DMA EXT
  ef 10 02 00 00 00 a0 00      00:00:32.183  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 00      00:00:32.182  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:00:32.182  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 00      00:00:32.182  SET FEATURES [Enable SATA feature]

Error 51 occurred at disk power-on lifetime: 19974 hours (832 days + 6 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 53 02 00 00 00 a0  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ef 10 02 00 00 00 a0 00      00:00:32.183  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 00      00:00:32.182  IDENTIFY DEVICE
  ef 03 46 00 00 00 a0 00      00:00:32.182  SET FEATURES [Set transfer mode]
  ef 10 02 00 00 00 a0 00      00:00:32.182  SET FEATURES [Enable SATA feature]
  ec 00 00 00 00 00 a0 00      00:00:32.181  IDENTIFY DEVICE
All other drives had

Code: Select all

No Errors Logged
Do you think I should get these replaced?

User avatar
kent_dorfman766
Posts: 535
Joined: 2022-12-16 06:34
Location: socialist states of america
Has thanked: 57 times
Been thanked: 70 times

Re: [Software] Serious deadlock issue jbd2

#10 Post by kent_dorfman766 »

Well, if you are using hardware raid on drives while concurrently running the smartd monitoring daemon that occasionally does self-tests then that would be a bad thing because you don't want a smartd test kicking off while the array is active.

I cant tell anything from the frive one error becaues I dont' know how many hours are on the unit. I'm more interested in the

smartctl /dev/sd? output that shows the raw values...if it's even available. Being hardware raid controller, you may not have raw disk access to the devices that make up the array.

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#11 Post by rontarplan »

kent_dorfman766 wrote: 2023-05-11 23:20 smartctl /dev/sd? output that shows the raw values...if it's even available. Being hardware raid controller, you may not have raw disk access to the devices that make up the array.
It seems that this is the case. I cannot obtain raw disk access, only through /dev/sgX. But this is also the recommended way suggested by the hoster and independent sources.
I have repeated the smartctl test from the rescue mode and got displayed the same errors on the very same drives as before.

Aki
Global Moderator
Global Moderator
Posts: 2823
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 69 times
Been thanked: 385 times

Re: [Software] Serious deadlock issue jbd2

#12 Post by Aki »

rontarplan wrote: 2023-05-12 00:37 [..] I have repeated the smartctl test from the rescue mode and got displayed the same errors on the very same drives as before.
Is it a hardware issue in the end ?
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

User avatar
kent_dorfman766
Posts: 535
Joined: 2022-12-16 06:34
Location: socialist states of america
Has thanked: 57 times
Been thanked: 70 times

Re: [Software] Serious deadlock issue jbd2

#13 Post by kent_dorfman766 »

At this point in the game I'd suggest removing the slice disks from the raid controller and smart testing them individually on a dedicated SATA channel.
If they test out OK, then I'd suggest rebuilding the array, but choosing ext3 and all default options, then stress testing the array with lots of superfluous disk IO at the filesystem level.

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#14 Post by rontarplan »

So we have replaced the drives that showed logged errors.
But I am not too optimistic that this will bring a change.

In fact the first try brought numerous errors:

Code: Select all

 INFO: task kworker/u24:2:73 blocked for more than 120 seconds.
       Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u24:2   state:D stack:    0 pid:   73 ppid:     2 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-8:0)
 Call Trace:
  __schedule+0x282/0x870
  schedule+0x46/0xb0
  schedule_preempt_disabled+0xa/0x10
  __mutex_lock.constprop.0+0x2a1/0x460
  ext4_mb_initialize_context+0x158/0x1d0 [ext4]
  ext4_mb_new_blocks+0x148/0xf40 [ext4]
  ? ext4_cache_extents+0x5a/0xd0 [ext4]
  ? ext4_find_extent+0x3ec/0x450 [ext4]
  ext4_ext_map_blocks+0x861/0x18c0 [ext4]
  ? release_pages+0x3d8/0x460
  ? __pagevec_release+0x1c/0x60
  ext4_map_blocks+0x18e/0x5b0 [ext4]
  ext4_writepages+0x735/0xfd0 [ext4]
  ? sd_init_command+0x486/0xd30 [sd_mod]
  ? aac_queuecommand+0x17/0x30 [aacraid]
  ? scsi_queue_rq+0x323/0xa30 [scsi_mod]
  ? elv_rb_del+0x1f/0x30
  ? deadline_remove_request+0x55/0xc0
  ? fprop_fraction_percpu+0x2b/0x80
  do_writepages+0x34/0xc0
  ? wb_calc_thresh+0x4f/0x70
  __writeback_single_inode+0x39/0x2a0
  writeback_sb_inodes+0x20d/0x4a0
  __writeback_inodes_wb+0x4c/0xe0
  wb_writeback+0x1d8/0x2a0
  wb_workfn+0x2d9/0x4e0
  ? check_preempt_curr+0x2d/0x70
  ? ttwu_do_wakeup+0x17/0x140
  process_one_work+0x1b6/0x350
  worker_thread+0x53/0x3e0
  ? process_one_work+0x350/0x350
  kthread+0x11b/0x140
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
 INFO: task kworker/u24:4:75 blocked for more than 120 seconds.
       Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u24:4   state:D stack:    0 pid:   75 ppid:     2 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-8:0)
 Call Trace:
  __schedule+0x282/0x870
  schedule+0x46/0xb0
  schedule_preempt_disabled+0xa/0x10
  __mutex_lock.constprop.0+0x2a1/0x460
  ext4_mb_initialize_context+0x158/0x1d0 [ext4]
  ext4_mb_new_blocks+0x148/0xf40 [ext4]
  ? ext4_cache_extents+0x5a/0xd0 [ext4]
  ? ext4_find_extent+0x3ec/0x450 [ext4]
  ext4_ext_map_blocks+0x861/0x18c0 [ext4]
  ? release_pages+0x3d8/0x460
  ? __pagevec_release+0x1c/0x60
  ext4_map_blocks+0x18e/0x5b0 [ext4]
  ext4_writepages+0x735/0xfd0 [ext4]
  ? sd_init_command+0x486/0xd30 [sd_mod]
  ? aac_queuecommand+0x17/0x30 [aacraid]
  ? scsi_queue_rq+0x323/0xa30 [scsi_mod]
  ? elv_rb_del+0x1f/0x30
  ? deadline_remove_request+0x55/0xc0
  ? fprop_fraction_percpu+0x2b/0x80
  do_writepages+0x34/0xc0
  ? wb_calc_thresh+0x4f/0x70
  __writeback_single_inode+0x39/0x2a0
  writeback_sb_inodes+0x20d/0x4a0
  __writeback_inodes_wb+0x4c/0xe0
  wb_writeback+0x1d8/0x2a0
  wb_workfn+0x2d9/0x4e0
  ? check_preempt_curr+0x2d/0x70
  ? ttwu_do_wakeup+0x17/0x140
  process_one_work+0x1b6/0x350
  worker_thread+0x53/0x3e0
  ? process_one_work+0x350/0x350
  kthread+0x11b/0x140
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
 INFO: task kworker/u24:5:76 blocked for more than 120 seconds.
       Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u24:5   state:D stack:    0 pid:   76 ppid:     2 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-8:0)
 Call Trace:
  __schedule+0x282/0x870
  schedule+0x46/0xb0
  wait_transaction_locked+0x8a/0xd0 [jbd2]
  ? add_wait_queue_exclusive+0x70/0x70
  add_transaction_credits+0xd6/0x2a0 [jbd2]
  start_this_handle+0xfb/0x520 [jbd2]
  ? jbd2__journal_start+0xc3/0x1e0 [jbd2]
  ? kmem_cache_alloc+0xed/0x1f0
  jbd2__journal_start+0xfb/0x1e0 [jbd2]
  __ext4_journal_start_sb+0xfb/0x110 [ext4]
  ext4_writepages+0x2fc/0xfd0 [ext4]
  ? aac_scsi_cmd+0x6c9/0x1230 [aacraid]
  ? sd_init_command+0x486/0xd30 [sd_mod]
  ? aac_queuecommand+0x17/0x30 [aacraid]
  ? scsi_queue_rq+0x323/0xa30 [scsi_mod]
  ? elv_rb_del+0x1f/0x30
  ? deadline_remove_request+0x55/0xc0
  ? fprop_fraction_percpu+0x2b/0x80
  do_writepages+0x34/0xc0
  ? wb_calc_thresh+0x4f/0x70
  __writeback_single_inode+0x39/0x2a0
  writeback_sb_inodes+0x20d/0x4a0
  __writeback_inodes_wb+0x4c/0xe0
  wb_writeback+0x1d8/0x2a0
  wb_workfn+0x2d9/0x4e0
  ? check_preempt_curr+0x57/0x70
  ? ttwu_do_wakeup+0x17/0x140
  process_one_work+0x1b6/0x350
  worker_thread+0x53/0x3e0
  ? process_one_work+0x350/0x350
  kthread+0x11b/0x140
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
 INFO: task kworker/u24:7:78 blocked for more than 120 seconds.
       Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u24:7   state:D stack:    0 pid:   78 ppid:     2 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-8:0)
 Call Trace:
  __schedule+0x282/0x870
  schedule+0x46/0xb0
  schedule_preempt_disabled+0xa/0x10
  __mutex_lock.constprop.0+0x2a1/0x460
  ext4_mb_initialize_context+0x158/0x1d0 [ext4]
  ext4_mb_new_blocks+0x148/0xf40 [ext4]
  ? ext4_cache_extents+0x5a/0xd0 [ext4]
  ? ext4_find_extent+0x3ec/0x450 [ext4]
  ext4_ext_map_blocks+0x861/0x18c0 [ext4]
  ? release_pages+0x3d8/0x460
  ? __pagevec_release+0x1c/0x60
  ext4_map_blocks+0x18e/0x5b0 [ext4]
  ext4_writepages+0x735/0xfd0 [ext4]
  ? sd_init_command+0x486/0xd30 [sd_mod]
  ? aac_queuecommand+0x17/0x30 [aacraid]
  ? scsi_queue_rq+0x323/0xa30 [scsi_mod]
  ? elv_rb_del+0x1f/0x30
  ? deadline_remove_request+0x55/0xc0
  ? fprop_fraction_percpu+0x2b/0x80
  do_writepages+0x34/0xc0
  ? wb_calc_thresh+0x4f/0x70
  __writeback_single_inode+0x39/0x2a0
  writeback_sb_inodes+0x20d/0x4a0
  __writeback_inodes_wb+0x4c/0xe0
  wb_writeback+0x1d8/0x2a0
  wb_workfn+0x2d9/0x4e0
  ? check_preempt_curr+0x57/0x70
  ? ttwu_do_wakeup+0x17/0x140
  process_one_work+0x1b6/0x350
  worker_thread+0x53/0x3e0
  ? process_one_work+0x350/0x350
  kthread+0x11b/0x140
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
 INFO: task kworker/u24:8:79 blocked for more than 120 seconds.
       Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u24:8   state:D stack:    0 pid:   79 ppid:     2 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-8:0)
 Call Trace:
  __schedule+0x282/0x870
  schedule+0x46/0xb0
  wait_transaction_locked+0x8a/0xd0 [jbd2]
  ? add_wait_queue_exclusive+0x70/0x70
  add_transaction_credits+0xd6/0x2a0 [jbd2]
  start_this_handle+0xfb/0x520 [jbd2]
  ? jbd2__journal_start+0xc3/0x1e0 [jbd2]
  ? kmem_cache_alloc+0xed/0x1f0
  jbd2__journal_start+0xfb/0x1e0 [jbd2]
  __ext4_journal_start_sb+0xfb/0x110 [ext4]
  ext4_writepages+0x2fc/0xfd0 [ext4]
  ? _cond_resched+0x16/0x50
  ? __getblk_gfp+0x27/0x60
  ? ext4_mark_iloc_dirty+0x731/0xa10 [ext4]
  ? fprop_fraction_percpu+0x2b/0x80
  do_writepages+0x34/0xc0
  ? wb_calc_thresh+0x4f/0x70
  __writeback_single_inode+0x39/0x2a0
  writeback_sb_inodes+0x20d/0x4a0
  __writeback_inodes_wb+0x4c/0xe0
  wb_writeback+0x1d8/0x2a0
  wb_workfn+0x2d9/0x4e0
  ? check_preempt_curr+0x2d/0x70
  ? ttwu_do_wakeup+0x17/0x140
  process_one_work+0x1b6/0x350
  worker_thread+0x53/0x3e0
  ? process_one_work+0x350/0x350
  kthread+0x11b/0x140
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
 INFO: task kworker/u24:11:82 blocked for more than 120 seconds.
       Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:kworker/u24:11  state:D stack:    0 pid:   82 ppid:     2 flags:0x00004000
 Workqueue: writeback wb_workfn (flush-8:0)
 Call Trace:
  __schedule+0x282/0x870
  schedule+0x46/0xb0
  schedule_preempt_disabled+0xa/0x10
  __mutex_lock.constprop.0+0x2a1/0x460
  ext4_mb_initialize_context+0x158/0x1d0 [ext4]
  ext4_mb_new_blocks+0x148/0xf40 [ext4]
  ? ext4_cache_extents+0x5a/0xd0 [ext4]
  ? ext4_find_extent+0x3ec/0x450 [ext4]
  ext4_ext_map_blocks+0x861/0x18c0 [ext4]
  ? release_pages+0x3d8/0x460
  ? __pagevec_release+0x1c/0x60
  ext4_map_blocks+0x18e/0x5b0 [ext4]
  ext4_writepages+0x735/0xfd0 [ext4]
  ? sd_init_command+0x486/0xd30 [sd_mod]
  ? aac_queuecommand+0x17/0x30 [aacraid]
  ? scsi_queue_rq+0x323/0xa30 [scsi_mod]
  ? elv_rb_del+0x1f/0x30
  ? deadline_remove_request+0x55/0xc0
  ? fprop_fraction_percpu+0x2b/0x80
  do_writepages+0x34/0xc0
  ? wb_calc_thresh+0x4f/0x70
  __writeback_single_inode+0x39/0x2a0
  writeback_sb_inodes+0x20d/0x4a0
  __writeback_inodes_wb+0x4c/0xe0
  wb_writeback+0x1d8/0x2a0
  wb_workfn+0x2d9/0x4e0
  ? check_preempt_curr+0x2d/0x70
  ? ttwu_do_wakeup+0x17/0x140
  process_one_work+0x1b6/0x350
  worker_thread+0x53/0x3e0
  ? process_one_work+0x350/0x350
  kthread+0x11b/0x140
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
 INFO: task jbd2/sda5-8:614 blocked for more than 120 seconds.
       Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:jbd2/sda5-8     state:D stack:    0 pid:  614 ppid:     2 flags:0x00004000
 Call Trace:
  __schedule+0x282/0x870
  schedule+0x46/0xb0
  jbd2_journal_commit_transaction+0x320/0x1ad0 [jbd2]
  ? add_wait_queue_exclusive+0x70/0x70
  ? lock_timer_base+0x61/0x80
  kjournald2+0xab/0x270 [jbd2]
  ? add_wait_queue_exclusive+0x70/0x70
  ? load_superblock.part.0+0xb0/0xb0 [jbd2]
  kthread+0x11b/0x140
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
 INFO: task ext4lazyinit:640 blocked for more than 120 seconds.
       Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:ext4lazyinit    state:D stack:    0 pid:  640 ppid:     2 flags:0x00004000
 Call Trace:
  __schedule+0x282/0x870
  schedule+0x46/0xb0
  wait_transaction_locked+0x8a/0xd0 [jbd2]
  ? add_wait_queue_exclusive+0x70/0x70
  add_transaction_credits+0xd6/0x2a0 [jbd2]
  start_this_handle+0xfb/0x520 [jbd2]
  ? jbd2__journal_start+0x91/0x1e0 [jbd2]
  ? kmem_cache_alloc+0xed/0x1f0
  jbd2__journal_start+0xfb/0x1e0 [jbd2]
  __ext4_journal_start_sb+0xfb/0x110 [ext4]
  ext4_init_inode_table+0xf3/0x390 [ext4]
  ? schedule_timeout+0x93/0x150
  ext4_lazyinit_thread+0x42a/0x5b0 [ext4]
  ? ext4_sync_fs+0x200/0x200 [ext4]
  kthread+0x11b/0x140
  ? __kthread_bind_mask+0x60/0x60
  ret_from_fork+0x22/0x30
I am retrying now with

Code: Select all

echo 256 > /sys/block/sda/queue/read_ahead_kb
sysctl -w vm.dirty_background_bytes=104857600
sysctl -w vm.dirty_bytes=511870912
just in case it has something to do with the default setting (which I don't really believe as it is working on the other servers with the default settings......)

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#15 Post by rontarplan »

Watching

Code: Select all

iostat -dxm 1
watch -n 0.1 'cat /proc/meminfo | egrep "(Dirty|Writeback)"'
dmesg -Tw
any idea what else to keep an eye on?

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#16 Post by rontarplan »

So server is running and load has been ramped up to targeted state.
At the moment the server is running as it should. There is no indication of possible soon to be failure. But same has happened in the past when it would not run longer than 48 hours before breaking.
So let's see how long it will last this time.

Aki
Global Moderator
Global Moderator
Posts: 2823
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 69 times
Been thanked: 385 times

Re: [Software] Serious deadlock issue jbd2

#17 Post by Aki »

@rontarplan: have you any updates ?
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#18 Post by rontarplan »

Aki wrote: 2023-05-22 16:22 @rontarplan: have you any updates ?
Hello and yes. It's been a wild ride:

The hoster had performed another hardware check and offered to move disks to a different server again to exclude any other hardware issues.
When the disks had been moved to the new server, the RAID performed a rebuild :shock: . During rebuild one disk that had not shown any SMART errors before died.
It got replaced.
I noticed that rebuild now was much faster than when the other disks got replaced before.
After the completion I put the server back into production mode like before. Since then I have not had failures. But I am still wary. A little bit more optimistic though.

Maybe that 3rd disk that got replaced that had not triggered any SMART indication of sudden failure did cause it all. But I am still monitoring.

rontarplan
Posts: 17
Joined: 2023-05-11 07:46
Been thanked: 1 time

Re: [Software] Serious deadlock issue jbd2

#19 Post by rontarplan »

Haha. It just happened again shortly after I have wrote my message saying I was optimistic:

Code: Select all

INFO: task jbd2/sda5-8:623 blocked for more than 120 seconds.
      Not tainted 5.10.0-22-amd64 #1 Debian 5.10.178-3
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:jbd2/sda5-8     state:D stack:    0 pid:  623 ppid:     2 flags:0x00004000
Call Trace:
 __schedule+0x282/0x870
 schedule+0x46/0xb0
 jbd2_journal_commit_transaction+0x163/0x1ad0 [jbd2]
 ? update_load_avg+0x7a/0x5d0
 ? newidle_balance+0x282/0x3d0
 ? __switch_to_asm+0x3a/0x60
 ? __switch_to+0x114/0x440
 ? add_wait_queue_exclusive+0x70/0x70
 ? lock_timer_base+0x61/0x80
 kjournald2+0xab/0x270 [jbd2]
 ? add_wait_queue_exclusive+0x70/0x70
 ? load_superblock.part.0+0xb0/0xb0 [jbd2]
 kthread+0x11b/0x140
 ? __kthread_bind_mask+0x60/0x60
 ret_from_fork+0x22/0x30
Load turned out to be funny too:

Code: Select all

load average: 7918.04, 7920.83, 7920.76
But this time the writeback setting might be the cause. If

Code: Select all

sysctl -w vm.dirty_bytes=511870912
was too low it might block any process for too long that wants to write to the array. So I shall try it with a higher setting I guess maybe even leave it with the default settings.

Aki
Global Moderator
Global Moderator
Posts: 2823
Joined: 2014-07-20 18:12
Location: Europe
Has thanked: 69 times
Been thanked: 385 times

Re: [Software] Serious deadlock issue jbd2

#20 Post by Aki »

Hello,
rontarplan wrote: 2023-05-23 13:16 Haha. It just happened again shortly after I have wrote my message saying I was optimistic:
[..]
Load turned out to be funny too:

Code: Select all

load average: 7918.04, 7920.83, 7920.76
What is the memory (RAM/SWAP) usage under such high loads ?
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Debian - The universal operating system
⢿⡄⠘⠷⠚⠋⠀ https://www.debian.org
⠈⠳⣄⠀

Post Reply