[Software] NFS client mounted over Infiniband freezes perodically

If none of the specific sub-forums seem right for your thread, ask here.
Post Reply
Message
Author
nahso4
Posts: 1
Joined: 2024-12-31 05:08

[Software] NFS client mounted over Infiniband freezes perodically

#1 Post by nahso4 »

Hello,
I have a cluster with 8 machines. 7 of them are compute node(g0[1-7]), 1 of them is the management node(mgt). There is a public directory in management node called /share, and this directory is mounted to all compute node over Infiniband with rdma. But some clients always freeze randomly, after I enabled the nfs client log with follow:

Code: Select all

rpcdebug -m rpc -s all
rpcdebug -m nfs -s all
journalctl -fl shows:

Code: Select all

...
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=9813091 slotid=2 max_slotid=2 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: NFS reply lookup: 0
Dec 31 12:37:29 g02 kernel: NFS: nfs_update_inode(0:44/118169504428 fh_crc=0x51ea6285 ct=1 info=0x427e7f)
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/118169504428), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: open file(results/H2O_run-123_train.txt)
Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=23814709 slotid=1 max_slotid=1 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
Dec 31 12:37:29 g02 kernel: NFS reply lookup: -2
Dec 31 12:37:29 g02 kernel: NFS: dentry_delete(intel64_lin/x86_64, 80c)
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=9813091 slotid=2 max_slotid=2 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: NFS reply lookup: 0
Dec 31 12:37:29 g02 kernel: NFS: nfs_update_inode(0:44/118169504428 fh_crc=0x51ea6285 ct=1 info=0x427e7f)
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/118169504428), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: open file(results/H2O_run-123_train.txt)
Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=23814709 slotid=1 max_slotid=1 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
Dec 31 12:37:29 g02 kernel: NFS reply lookup: -2
Dec 31 12:37:29 g02 kernel: NFS: dentry_delete(intel64_lin/x86_64, 80c)
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: atomic_open(0:44/40808747028), libc.so.6
Dec 31 12:37:29 g02 kernel: NFS call  test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0000 highest_used=4294967295 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=53210473 slotid=0 max_slotid=0 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
...
(The mgt-ib is the address of management node)

Restarting network with

Code: Select all

systemctl restart networking
is not help. The ways that can solve the problem are reboot the node or kill the tasks in the node and remount the share directory.

Here is my nfs.conf in the management node:

Code: Select all

#
# This is a general configuration for the
# NFS daemons and tools
#
[general]
pipefs-directory=/run/rpc_pipefs
#
[nfsrahead]
# nfs=15000
# nfs4=16000
#
[exports]
# rootdir=/export
#
[exportfs]
# debug=0
#
[gssd]
# verbosity=0
# rpc-verbosity=0
# use-memcache=0
# use-machine-creds=1
# use-gss-proxy=0
# avoid-dns=1
# limit-to-legacy-enctypes=0
# context-timeout=0
# rpc-timeout=5
# keytab-file=/etc/krb5.keytab
# cred-cache-directory=
# preferred-realm=
# set-home=1
# upcall-timeout=30
# cancel-timed-out-upcalls=0
#
[lockd]
# port=0
# udp-port=0
#
[exportd]
# debug="all|auth|call|general|parse"
# manage-gids=n
# state-directory-path=/var/lib/nfs
# threads=1
# cache-use-ipaddr=n
# ttl=1800
[mountd]
# debug="all|auth|call|general|parse"
manage-gids=y
# descriptors=0
# port=0
# threads=1
# reverse-lookup=n
# state-directory-path=/var/lib/nfs
# ha-callout=
# cache-use-ipaddr=n
# ttl=1800
#
[nfsdcld]
# debug=0
# storagedir=/var/lib/nfs/nfsdcld
#
[nfsdcltrack]
# debug=0
# storagedir=/var/lib/nfs/nfsdcltrack
#
[nfsd]
# debug=0
threads=16
# host=
# port=0
# grace-time=90
# lease-time=90
udp=y
# tcp=y
# vers3=y
# vers4=y
# vers4.0=y
# vers4.1=y
# vers4.2=y
rdma=y
rdma-port=20049

[statd]
# debug=0
# port=0
# outgoing-port=0
# name=
# state-directory-path=/var/lib/nfs/statd
# ha-callout=
# no-notify=0
#
[sm-notify]
# debug=0
# force=0
# retry-time=900
# outgoing-port=
# outgoing-addr=
# lift-grace=y
#
[svcgssd]
# principal=
The mount option is:

Code: Select all

mount -o rdma,port=20049 mgt-ib:/share /share
cat /etc/fstab in g02 is

Code: Select all

mgt-ib:/share           /share          nfs4            rw,sync,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=172.16.7.2,local_lock=none,addr=172.16.7.200   0 0
cat /etc/exports in the management node:

Code: Select all

/share 172.16.7.0/24(rw,sync,no_subtree_check,insecure,no_root_squash)
nfsstat -s in the management node:

Code: Select all

Server rpc stats:
calls      badcalls   badfmt     badauth    badclnt
340861724   0          0          0          0

Server nfs v4:
null             compound
27        0%     340876261 99%

Server nfs v4 operations:
op0-unused       op1-unused       op2-future       access           close
0         0%     0         0%     0         0%     94072660  7%     93096428  6%
commit           create           delegpurge       delegreturn      getattr
8524      0%     1382      0%     0         0%     92195288  6%     232114827 17%
getfh            link             lock             lockt            locku
21463569  1%     0         0%     1452      0%     0         0%     947       0%
lookup           lookup_root      nverify          open             openattr
11986903  0%     0         0%     0         0%     93272170  6%     0         0%
open_conf        open_dgrd        putfh            putpubfh         putrootfh
0         0%     34        0%     338905383 25%     0         0%     35        0%
read             readdir          readlink         remove           rename
11054156  0%     692804    0%     74575     0%     9568      0%     1600      0%
renew            restorefh        savefh           secinfo          setattr
0         0%     0         0%     1848      0%     0         0%     293884    0%
setcltid         setcltidconf     verify           write            rellockowner
0         0%     0         0%     0         0%     10738467  0%     0         0%
bc_ctl           bind_conn        exchange_id      create_ses       destroy_ses
0         0%     4         0%     56        0%     36        0%     22        0%
free_stateid     getdirdeleg      getdevinfo       getdevlist       layoutcommit
594       0%     0         0%     0         0%     0         0%     0         0%
layoutget        layoutreturn     secinfononam     sequence         set_ssv
0         0%     0         0%     1         0%     341016756 25%     0         0%
test_stateid     want_deleg       destroy_clid     reclaim_comp     allocate
2102813   0%     0         0%     15        0%     29        0%     0         0%
copy             copy_notify      deallocate       ioadvise         layouterror
247       0%     0         0%     0         0%     0         0%     0         0%
layoutstats      offloadcancel    offloadstatus    readplus         seek
0         0%     0         0%     0         0%     0         0%     162       0%
write_same
0         0%
and nfsstat -c in g02(after remount)

Code: Select all

Client rpc stats:
calls      retrans    authrefrsh
89832614   0          89830228

Client nfs v4:
null             read             write            commit           open
5         0%     3607936   4%     381364    0%     6036      0%     3380554   3%
open_conf        open_noat        open_dgrd        close            setattr
0         0%     24548294 27%     0         0%     27912591 31%     153       0%
fsinfo           renew            setclntid        confirm          lock
12        0%     0         0%     0         0%     0         0%     16        0%
lockt            locku            access           getattr          lookup
0         0%     15        0%     28497     0%     175939    0%     2094388   2%
lookup_root      remove           rename           link             symlink
4         0%     726       0%     102       0%     0         0%     0         0%
create           pathconf         statfs           readlink         readdir
53        0%     8         0%     0         0%     121       0%     3325      0%
server_caps      delegreturn      getacl           setacl           fs_locations
20        0%     27658681 30%     0         0%     0         0%     0         0%
rel_lkowner      secinfo          fsid_present     exchange_id      create_session
0         0%     0         0%     0         0%     9         0%     6         0%
destroy_session  sequence         get_lease_time   reclaim_comp     layoutget
4         0%     615       0%     1         0%     5         0%     0         0%
getdevinfo       layoutcommit     layoutreturn     secinfo_no       test_stateid
0         0%     0         0%     0         0%     0         0%     34034     0%
free_stateid     getdevicelist    bind_conn_to_ses destroy_clientid seek
18        0%     0         0%     0         0%     3         0%     0         0%
allocate         deallocate       layoutstats      clone
0         0%     0         0%     0         0%     0         0%
I just installed theese following packages to enable infiniband: rdma-core, infiniband-diags, ibutils, opensm. All nodes are installed with same system:

Code: Select all

Linux version 6.1.0-26-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30)
How to solve it? Thank you.

arzgi
Posts: 1777
Joined: 2008-02-21 17:03
Location: Finland
Has thanked: 1 time
Been thanked: 103 times

Re: [Software] NFS client mounted over Infiniband freezes perodically

#2 Post by arzgi »

Hello and welcome!

I just used nfs once over two decades ago, so don't remember much of it. Debian Wiki is a good resource for any Debian user, and there is also https://wiki.debian.org/NFSServerSetup.

Post Reply