Random Timeouts (intermittent connectivity)

If none of the more specific forums is the right place to ask

Random Timeouts (intermittent connectivity)

Postby System_Error_Message » 2020-09-09 09:11

Hi i've been diagnosing this issue for days now. I've been using debian for my fileserver for a month now and since the start i've always had intermittent connectivity issues. I've gone through my network and run as much tests as possible and it seems the network and NIC are entirely fine.

I tried pinging around for a day and here are the results

Pinging a network device, 0 packet loss, 2% packet loss for wifi APs (from ethernet)
attaching a 2nd cable from the debian server and pinging both ips, 7% packet loss at the same time for both NICs (ruling out issue with NIC)
pinging the debian server from the router, same result but timeouts dont happen the same time as from laptop via ethernet.
Checked switches and routers for network errors, there were few but had no corrrelation to timeouts (a timeout did not increase the error stats)
Pinging out from the debian server, no packet loss.
pinging the debian server from the attached switch, all packets from the switch lost.

Im at a loss here and the intermittent issues really really bug me. When im connected via SSH i get pauses before anything happens. If i watch a video it gets stuck from time to time for 1 to a few seconds each time, its very annoying. Its worse when im transferring files and it causes pausing as well as the main purpose of having a fast file server is that it is faster than transferring to an external media for backing up and distributing files.

I get less packet loss from internet than this. Before i had opensuse running but i had to switch because i had to install applications that werent made to run on opensuse (i dont know why software would be made for centOS but not opensuse) but there were no issues or such timeouts running opensuse.
System_Error_Message
 
Posts: 5
Joined: 2020-09-09 09:00

Re: Random Timeouts (intermittent connectivity)

Postby System_Error_Message » 2020-09-30 06:45

bump

This issue is really really annoying. It causes disruptions during file transfers, causes huge initial web page load delays, and gets annoying when entering commands and executing them in SSH, and in other applications too.
System_Error_Message
 
Posts: 5
Joined: 2020-09-09 09:00

Re: Random Timeouts (intermittent connectivity)

Postby sickpig » 2020-09-30 08:49

System_Error_Message wrote:but there were no issues or such timeouts running opensuse.

Rest of the things being equal it could be a firmware issue.
Check what firmware was installed for your nic in opensuse and see if your Debian installation has a different one.
User avatar
sickpig
 
Posts: 589
Joined: 2019-01-23 10:34

Re: Random Timeouts (intermittent connectivity)

Postby System_Error_Message » 2020-10-09 02:35

I tried firmware updates, but the only thing it detected as the motherboard and sata chip. The NIC is an extension datacenter card. It did reduce the timeouts to between 1-3% but thats still significant. Im finding the same problem on other servers. public webservers are timing out at a rate of 1% in the same way, wifi routers with updated firmware are also timing out at the same rate. I still sometimes get 5 timeouts all at once. Before it as timing out at a rate of 7%

Something is very wrong here and i cant suffer any timeouts with this file server as its not just throughput, its also operating as a web cache since the drives peak at 1GB/s, significantly helping with load times, plus good for caching updates. Imagine if your router timed out occasionally as well. My router is pretty much proprietary rather than the usual linux based routers so doesnt suffer from the same problem and is very responsive.
System_Error_Message
 
Posts: 5
Joined: 2020-09-09 09:00

Re: Random Timeouts (intermittent connectivity)

Postby LE_746F6D617A7A69 » 2020-10-09 12:48

@System_Error_Message
You haven't provided any information about Your HW (like mobo, cpu, NIC model), network layout/setup, router/switch type used, and no logs.
Your textual description of a problem is useless for analysis of the issue.
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed
LE_746F6D617A7A69
 
Posts: 414
Joined: 2020-05-03 14:16

Re: Random Timeouts (intermittent connectivity)

Postby System_Error_Message » 2020-10-11 04:34

LE_746F6D617A7A69 wrote:@System_Error_Message
You haven't provided any information about Your HW (like mobo, cpu, NIC model), network layout/setup, router/switch type used, and no logs.
Your textual description of a problem is useless for analysis of the issue.

Let me know what logs you need and i'll gather them. As for the hardware, Its an older AMD piledriver fx8320 on a socket AM3+ gigabyte 9xx series board with 2x8GB of ECC unbuffered RAM (running memtest shows ECC). Im using 1 drive for boot/OS/software, and 5 other drives in raid 5 software raid, a mellanox SFP+ card and onboard NIC.

I used to get the error about the onboard NIC firmware not being found but after running fwupd it did help halve the timeouts and get the NIC detected after i accidentally changed the permissions for / and managed to fix most things. However this problem has been happening even before that.

On other systems, i have a tplink AC1200 wifi router also showing the same symptom random timeouts (at most 1%) when pinging but i havent checked if it affects network forwarding since only 1 out of 3 is showing such symptoms (the most updated one)

On another server i get the same random timeouts (<1%) and on a public shared server as well, both showing the same timeout frequency where you have very good ping, then suddenly a timeout which run xeons, SSDs centOS 7 and the usual hosting stuff. However on the dedicated server rented directadmin keeps showing connection issues but this is a totally different system.

So trying to update the firmware did mostly solve the problem and i am currrently testing long term stability of my file server by slowly transferring files over VPN and pinging it as well and i'll also do a large file transfer on LAN and see if it dips randomly.
In my network i'm using mikrotik SFP+ switch which connects to a CCR1036 with SFP+ ports, no problems here as both are very responsive and fast when doing any related task. If there is a timeout but other functions arent affected than i guess this is fine since ICMP usually takes a back seat, however doing so does affect bufferbloat scores.

The problem of timeouts seem to be a problem with every newer linux system, but i havent tested opensuse yet as been going through reinstalls.
System_Error_Message
 
Posts: 5
Joined: 2020-09-09 09:00

Re: Random Timeouts (intermittent connectivity)

Postby Head_on_a_Stick » 2020-10-11 14:39

System_Error_Message wrote:onboard NIC

^ This is not enough information. Use the -nn switch for the 'lspci' command, that will show the vendor & product IDs that can identify the device unambiguously (read the man page for more on this).

You could probably fix this yourself by entering the card ID into a search engine, if you do that then please be sure to explain your eventual method here to help others with that card.

System_Error_Message wrote:i accidentally changed the permissions for /

Erm, that sounds bad. You should probably restore the system from the backup you made before messing up your permissions. It's the only way to be sure.
Black Lives Matter

Debian buster-backports ISO image: for new hardware support
User avatar
Head_on_a_Stick
 
Posts: 12777
Joined: 2014-06-01 17:46
Location: /dev/chair

Re: Random Timeouts (intermittent connectivity)

Postby LE_746F6D617A7A69 » 2020-10-11 15:26

I have read all Your posts again, trying to filter out some useful information, but it looks like a complete mess for me.
However, there are few significant facts which You mentioned:
System_Error_Message wrote:attaching a 2nd cable from the debian server and pinging both ips, 7% packet loss at the same time for both NICs (ruling out issue with NIC)
System_Error_Message wrote:I get less packet loss from internet than this.
System_Error_Message wrote:On another server i get the same random timeouts (<1%) and on a public shared server as well, both showing the same timeout frequency where you have very good ping, then suddenly a timeout which run xeons, SSDs centOS 7
The central part of all those issues is Your switch (it can't be a problem with ISP, NICs and not with the OS)

And this:
System_Error_Message wrote:Pinging out from the debian server, no packet loss.
pinging the debian server from the attached switch, all packets from the switch lost.
Unless You have blocked ICMP traffic on Debian, it means that the firmware on the switch has a bug.

Logs: I don't think that You'll find anything useful, because the problem does not seem to be related to Debian system. ICMP ping is completely useless for diagnostic of network problems - Wireshark is the right tool for that.
In any case however, it's good to check the kernel logs for issues related to network interfaces, driver crashes, etc.

-----------------------------
System_Error_Message wrote:Imagine if your router timed out occasionally as well. My router is pretty much proprietary rather than the usual linux based routers so doesnt suffer from the same problem and is very responsive.
Yes, I can imagine that - that's why the first thing I do after buying a new router is a replacement of that shitty proprietary firmware with Open Source, Linux-based solutions.
Thanks to this, I'm rebooting my routers only after firmware upgrade, what means that uptimes are ranging from months to years ... ;)

System_Error_Message wrote:The problem of timeouts seem to be a problem with every newer linux system
You have said that You have the same problem with CentOS 7. Besides, the whole internet runs on Linux and nobody have reported such problems AFAIN ;)
Bill Gates: "(...) In my case, I went to the garbage cans at the Computer Science Center and I fished out listings of their operating system."
The_full_story and Nothing_have_changed
LE_746F6D617A7A69
 
Posts: 414
Joined: 2020-05-03 14:16

Re: Random Timeouts (intermittent connectivity)

Postby System_Error_Message » 2020-10-14 04:20

LE_746F6D617A7A69 wrote:I have read all Your posts again, trying to filter out some useful information, but it looks like a complete mess for me.
However, there are few significant facts which You mentioned:
System_Error_Message wrote:attaching a 2nd cable from the debian server and pinging both ips, 7% packet loss at the same time for both NICs (ruling out issue with NIC)
System_Error_Message wrote:I get less packet loss from internet than this.
System_Error_Message wrote:On another server i get the same random timeouts (<1%) and on a public shared server as well, both showing the same timeout frequency where you have very good ping, then suddenly a timeout which run xeons, SSDs centOS 7
The central part of all those issues is Your switch (it can't be a problem with ISP, NICs and not with the OS)

And this:
System_Error_Message wrote:Pinging out from the debian server, no packet loss.
pinging the debian server from the attached switch, all packets from the switch lost.
Unless You have blocked ICMP traffic on Debian, it means that the firmware on the switch has a bug.

If you did read, i mentioned i did the firmware update as suggested and it did fix most of the issues, significantly reducing packet drops that i was able to reliably transfer large files over VPN remotely. I would say the problem is mostly solved. just an itty tiny bit of trying to figure out why there are till some packet drops which isnt a problem with the network if i was able to reliably transfer files with no connection issues be it on FTP or samba. SSH would survive 5 timeouts in a row but not the other 2 which would dip in speed or reconnect. Before posting here i ruled out other problems. I am now able to ping the file server without issue from the switch it is attached to which before couldnt see it despite other network devices able to do it (other switches and routers could communicate with the server despite not being directly connected). I also didnt block ICMP traffic, but prioritising it lets you cheat ratings on internet speed tests for things like bufferbloat which i had a hard time trying to explain others why asus router QoS rated poorly in bufferbloat because ICMP can be deprioritised safely as a QoS measure despite their auto QoS functioning well)
LE_746F6D617A7A69 wrote:Logs: I don't think that You'll find anything useful, because the problem does not seem to be related to Debian system. ICMP ping is completely useless for diagnostic of network problems - Wireshark is the right tool for that.
In any case however, it's good to check the kernel logs for issues related to network interfaces, driver crashes, etc.

ICMP is what you start with when you diagnose network problems. The issue with using wireshark is that the server has a lot of traffic that it would quickly build up in buffer size, so need a filter to apply. Sure i know what you mean in terms of looking at the packets in a lower level.

-----------------------------
System_Error_Message wrote:Imagine if your router timed out occasionally as well. My router is pretty much proprietary rather than the usual linux based routers so doesnt suffer from the same problem and is very responsive.
Yes, I can imagine that - that's why the first thing I do after buying a new router is a replacement of that shitty proprietary firmware with Open Source, Linux-based solutions.
Thanks to this, I'm rebooting my routers only after firmware upgrade, what means that uptimes are ranging from months to years ... ;)

System_Error_Message wrote:The problem of timeouts seem to be a problem with every newer linux system
You have said that You have the same problem with CentOS 7. Besides, the whole internet runs on Linux and nobody have reported such problems AFAIN ;)[/quote]
I like open sourced as well, but a lot of tp links use openwrt as their based but lock it down so you cant simply flash or use it like any other open sourced system. Open source is great if you intend to go into the source code but it doesnt help against proprietary when you're just using it. Sure it means someone else can fix or spot an issue but it doesnt help for brands like tp link that have many hardware revisions and wifi chips that have compatibility issues that require reboots every so often. And ofcourse i pretty much hate it that my mikrotik router wont run other things because having an all in one in doing other network related tasks having a faster router CPU is a nice thing and not having to rely on hardware acceleration either. There is absolutely nothing wrong with my network only that i need to figure out why internet traffic for some seem slow, because it does drop quite a few bad traffic.

The internet is having a problem with this, but only for a few of us was the problem really bad, the problem is just mild for everyone
https://github.com/kubernetes/kops/issues/8224
https://tech.xing.com/a-reason-for-unex ... d041cf7e02
having a 1% timeout rate though usually goes unnoticed but you can test yourself by running ping for a whole day.

Head_on_a_Stick wrote:
System_Error_Message wrote:onboard NIC

^ This is not enough information. Use the -nn switch for the 'lspci' command, that will show the vendor & product IDs that can identify the device unambiguously (read the man page for more on this).

You could probably fix this yourself by entering the card ID into a search engine, if you do that then please be sure to explain your eventual method here to help others with that card.

System_Error_Message wrote:i accidentally changed the permissions for /

Erm, that sounds bad. You should probably restore the system from the backup you made before messing up your permissions. It's the only way to be sure.

i had a parallel install in a VM that i got the permissions from, so i did a scripted permission restore with a package reinstall and then just change the remainder of 777s to 755s

Result of lscpu -nn
Code: Select all
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD/ATI] RD9x0/RX980 Host Bridge [1002:5a14] (rev 02)
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD/ATI] RD890S/RD990 I/O Memory Management Unit (IOMMU) [1002:5a23]
00:02.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GFX port 0) [1002:5a16]
00:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 4) [1002:5a1c]
00:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP Port 5) [1002:5a1d]
00:0d.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RD890/RD9x0/RX980 PCI to PCI bridge (PCI Express GPP2 Port 0) [1002:5a1e]
00:11.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode] [1002:4391] (rev 40)
00:12.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:12.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:13.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller [1002:4385] (rev 42)
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller [1002:439d] (rev 40)
00:14.4 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge [1002:4384] (rev 40)
00:14.5 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller [1002:4399]
00:15.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0) [1002:43a0]
00:15.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1) [1002:43a1]
00:15.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] SB900 PCI to PCI bridge (PCIE port 2) [1002:43a2]
00:16.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller [1002:4397]
00:16.2 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller [1002:4396]
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0 [1022:1600]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1 [1022:1601]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2 [1022:1602]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3 [1022:1603]
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4 [1022:1604]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5 [1022:1605]
01:00.0 Ethernet controller [0200]: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0)
02:00.0 USB controller [0c03]: Etron Technology, Inc. EJ168 USB 3.0 Host Controller [1b6f:7023] (rev 01)
03:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller [1b4b:9172] (rev 11)
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GT200b [GeForce GTX 285] [10de:05e3] (rev a1)
05:0e.0 FireWire (IEEE 1394) [0c00]: VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller [1106:3044] (rev c0)
06:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 06)
07:00.0 USB controller [0c03]: Etron Technology, Inc. EJ168 USB 3.0 Host Controller [1b6f:7023] (rev 01)
08:00.0 SATA controller [0106]: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller [1b4b:9172] (rev 11)

Old GPU but at least has every display connector.
using fwupd to update then reboot fixed a lot of issues. I was having trouble with the debian installer for UEFI installs so i had to switch back to legacy. I did repartition the drive so there was no trace left of any older opensuse installs. The main reason i only used debian here was because i couldnt get openlitespeed to run on opensuse despite compiling it from source. So for me the biggest problem in connectivity was fixed by running wupd to update then running an upgrade in apt-get before rebooting.

However the random timeouts are something to investigate. It may not be a problem anymore now that it happens 1% of the time, but it seems to be a trend i am noticing on many different devices, and to rule out network issues i can ping from within the same servers in the same datacenter for instance. Its annoying when used to be you only get the odd packet lost in a day than a 1% rate.
System_Error_Message
 
Posts: 5
Joined: 2020-09-09 09:00


Return to General Questions

Who is online

Users browsing this forum: No registered users and 10 guests

fashionable