I have a cluster with 8 machines. 7 of them are compute node(g0[1-7]), 1 of them is the management node(mgt). There is a public directory in management node called /share, and this directory is mounted to all compute node over Infiniband with rdma. But some clients always freeze randomly, after I enabled the nfs client log with follow:
Code: Select all
rpcdebug -m rpc -s all
rpcdebug -m nfs -s all
Code: Select all
...
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=9813091 slotid=2 max_slotid=2 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: NFS reply lookup: 0
Dec 31 12:37:29 g02 kernel: NFS: nfs_update_inode(0:44/118169504428 fh_crc=0x51ea6285 ct=1 info=0x427e7f)
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/118169504428), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: open file(results/H2O_run-123_train.txt)
Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=23814709 slotid=1 max_slotid=1 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
Dec 31 12:37:29 g02 kernel: NFS reply lookup: -2
Dec 31 12:37:29 g02 kernel: NFS: dentry_delete(intel64_lin/x86_64, 80c)
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=9813091 slotid=2 max_slotid=2 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0007 highest_used=2 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=000f highest_used=3 slotid=3
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 3 highest_used_slotid 2
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: NFS reply lookup: 0
Dec 31 12:37:29 g02 kernel: NFS: nfs_update_inode(0:44/118169504428 fh_crc=0x51ea6285 ct=1 info=0x427e7f)
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/146245935125), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/118169504428), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: open file(results/H2O_run-123_train.txt)
Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=23814709 slotid=1 max_slotid=1 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0003 highest_used=1 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0007 highest_used=2 slotid=2
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 2 highest_used_slotid 1
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: NFS reply test_stateid: succeeded, 0
Dec 31 12:37:29 g02 kernel: nfs41_handle_recallable_state_revoked: Recallable state revoked on server mgt-ib!
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 1 highest_used_slotid 0
Dec 31 12:37:29 g02 kernel: nfs41_sequence_process: Error 0 free the slot
Dec 31 12:37:29 g02 kernel: nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
Dec 31 12:37:29 g02 kernel: NFS reply lookup: -2
Dec 31 12:37:29 g02 kernel: NFS: dentry_delete(intel64_lin/x86_64, 80c)
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/512), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/515), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/120259161769), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/113817081017), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/115970221607), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/36507614869), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/38655871530), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: permission(0:44/40808747028), mask=0x81, res=0
Dec 31 12:37:29 g02 kernel: NFS: atomic_open(0:44/40808747028), libc.so.6
Dec 31 12:37:29 g02 kernel: NFS call test_stateid 0000000084ebf388
Dec 31 12:37:29 g02 kernel: --> nfs41_call_sync_prepare data->seq_server 00000000d7156190
Dec 31 12:37:29 g02 kernel: --> nfs4_alloc_slot used_slots=0000 highest_used=4294967295 max_slots=30
Dec 31 12:37:29 g02 kernel: <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0
Dec 31 12:37:29 g02 kernel: encode_sequence: sessionid=1735361691:246077031:23:0 seqid=53210473 slotid=0 max_slotid=0 cache_this=0
Dec 31 12:37:29 g02 kernel: nfs41_handle_sequence_flag_errors: "mgt-ib" (client ID 9b846f6767d6aa0e) flags=0x00000040
...
Restarting network with
Code: Select all
systemctl restart networking
Here is my nfs.conf in the management node:
Code: Select all
#
# This is a general configuration for the
# NFS daemons and tools
#
[general]
pipefs-directory=/run/rpc_pipefs
#
[nfsrahead]
# nfs=15000
# nfs4=16000
#
[exports]
# rootdir=/export
#
[exportfs]
# debug=0
#
[gssd]
# verbosity=0
# rpc-verbosity=0
# use-memcache=0
# use-machine-creds=1
# use-gss-proxy=0
# avoid-dns=1
# limit-to-legacy-enctypes=0
# context-timeout=0
# rpc-timeout=5
# keytab-file=/etc/krb5.keytab
# cred-cache-directory=
# preferred-realm=
# set-home=1
# upcall-timeout=30
# cancel-timed-out-upcalls=0
#
[lockd]
# port=0
# udp-port=0
#
[exportd]
# debug="all|auth|call|general|parse"
# manage-gids=n
# state-directory-path=/var/lib/nfs
# threads=1
# cache-use-ipaddr=n
# ttl=1800
[mountd]
# debug="all|auth|call|general|parse"
manage-gids=y
# descriptors=0
# port=0
# threads=1
# reverse-lookup=n
# state-directory-path=/var/lib/nfs
# ha-callout=
# cache-use-ipaddr=n
# ttl=1800
#
[nfsdcld]
# debug=0
# storagedir=/var/lib/nfs/nfsdcld
#
[nfsdcltrack]
# debug=0
# storagedir=/var/lib/nfs/nfsdcltrack
#
[nfsd]
# debug=0
threads=16
# host=
# port=0
# grace-time=90
# lease-time=90
udp=y
# tcp=y
# vers3=y
# vers4=y
# vers4.0=y
# vers4.1=y
# vers4.2=y
rdma=y
rdma-port=20049
[statd]
# debug=0
# port=0
# outgoing-port=0
# name=
# state-directory-path=/var/lib/nfs/statd
# ha-callout=
# no-notify=0
#
[sm-notify]
# debug=0
# force=0
# retry-time=900
# outgoing-port=
# outgoing-addr=
# lift-grace=y
#
[svcgssd]
# principal=
Code: Select all
mount -o rdma,port=20049 mgt-ib:/share /share
Code: Select all
mgt-ib:/share /share nfs4 rw,sync,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=172.16.7.2,local_lock=none,addr=172.16.7.200 0 0
Code: Select all
/share 172.16.7.0/24(rw,sync,no_subtree_check,insecure,no_root_squash)
Code: Select all
Server rpc stats:
calls badcalls badfmt badauth badclnt
340861724 0 0 0 0
Server nfs v4:
null compound
27 0% 340876261 99%
Server nfs v4 operations:
op0-unused op1-unused op2-future access close
0 0% 0 0% 0 0% 94072660 7% 93096428 6%
commit create delegpurge delegreturn getattr
8524 0% 1382 0% 0 0% 92195288 6% 232114827 17%
getfh link lock lockt locku
21463569 1% 0 0% 1452 0% 0 0% 947 0%
lookup lookup_root nverify open openattr
11986903 0% 0 0% 0 0% 93272170 6% 0 0%
open_conf open_dgrd putfh putpubfh putrootfh
0 0% 34 0% 338905383 25% 0 0% 35 0%
read readdir readlink remove rename
11054156 0% 692804 0% 74575 0% 9568 0% 1600 0%
renew restorefh savefh secinfo setattr
0 0% 0 0% 1848 0% 0 0% 293884 0%
setcltid setcltidconf verify write rellockowner
0 0% 0 0% 0 0% 10738467 0% 0 0%
bc_ctl bind_conn exchange_id create_ses destroy_ses
0 0% 4 0% 56 0% 36 0% 22 0%
free_stateid getdirdeleg getdevinfo getdevlist layoutcommit
594 0% 0 0% 0 0% 0 0% 0 0%
layoutget layoutreturn secinfononam sequence set_ssv
0 0% 0 0% 1 0% 341016756 25% 0 0%
test_stateid want_deleg destroy_clid reclaim_comp allocate
2102813 0% 0 0% 15 0% 29 0% 0 0%
copy copy_notify deallocate ioadvise layouterror
247 0% 0 0% 0 0% 0 0% 0 0%
layoutstats offloadcancel offloadstatus readplus seek
0 0% 0 0% 0 0% 0 0% 162 0%
write_same
0 0%
Code: Select all
Client rpc stats:
calls retrans authrefrsh
89832614 0 89830228
Client nfs v4:
null read write commit open
5 0% 3607936 4% 381364 0% 6036 0% 3380554 3%
open_conf open_noat open_dgrd close setattr
0 0% 24548294 27% 0 0% 27912591 31% 153 0%
fsinfo renew setclntid confirm lock
12 0% 0 0% 0 0% 0 0% 16 0%
lockt locku access getattr lookup
0 0% 15 0% 28497 0% 175939 0% 2094388 2%
lookup_root remove rename link symlink
4 0% 726 0% 102 0% 0 0% 0 0%
create pathconf statfs readlink readdir
53 0% 8 0% 0 0% 121 0% 3325 0%
server_caps delegreturn getacl setacl fs_locations
20 0% 27658681 30% 0 0% 0 0% 0 0%
rel_lkowner secinfo fsid_present exchange_id create_session
0 0% 0 0% 0 0% 9 0% 6 0%
destroy_session sequence get_lease_time reclaim_comp layoutget
4 0% 615 0% 1 0% 5 0% 0 0%
getdevinfo layoutcommit layoutreturn secinfo_no test_stateid
0 0% 0 0% 0 0% 0 0% 34034 0%
free_stateid getdevicelist bind_conn_to_ses destroy_clientid seek
18 0% 0 0% 0 0% 3 0% 0 0%
allocate deallocate layoutstats clone
0 0% 0 0% 0 0% 0 0%
Code: Select all
Linux version 6.1.0-26-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30)