Find the answer to your Linux question:
Results 1 to 8 of 8
In our product, we are seeing a TCP connexion issue on localhost and wonder if anyone has any ideas. On our affected customer systems this seems to happen about once ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Sep 2014
    Posts
    4

    Question TCP connexion failure, SYNs but no SYNACKs


    In our product, we are seeing a TCP connexion issue on localhost and wonder if anyone has any ideas.

    On our affected customer systems this seems to happen about once in every few hundred through to a once every several thousand connexion requests. It has proved almost impossible to reproduce the problem in house where we have managed just a handful of failures in tens of thousands of connexions.

    What we see in a packet capture of an affected connexion is the client sending an SYN to the server. No SYNACK comes back. After a timeout the kernel tries again for the client with the same result. After 6 tries the kernel times out the connexion attempt. During this time other connexions between the client and server are made successfully. [ Both the client and server use non blocking sockets and support multiple parallel connexions. ] This indicates that it is not the server stalling and failing to accept connexions. Netstat -nato also shows multiple retries of sockets in SYN_SENT.

    The proportion of failures does not seem to be significantly affected by load. However it does take an hour or so from when the applications started until the problem first starts.

    We do not see any ListenOverflows or ListenDrops in /proc/net/netstat at the relevant times, which again makes it look like something other than a simple connect queueing issue.

    We do see the TCPAbortOnTimeout increment more around the time the problem is occuring though it is not very clear cut and perhaps the same for TCPDeferAcceptDrop
    and DelayedACKLost. The server does not use TCP_DEFER_ACCEPT. Note that both the client and server also handle/make extensive external network connexions (both
    are in fact proxies) so there is a lot of noise in the statistics from that.

    The problem started to occur when we released a new version of our product where we moved from kernel 3.2.33 (custom configuration) to 3.13.0 (slightly modified Ubuntu configuration). We have tried the latest Ubuntu patches for the 3.13.0 kernel without any benefit. The networking side of the client and server did not change.

  2. #2
    Linux Guru
    Join Date
    Dec 2013
    Posts
    1,556
    I've read about tcp windows scaling and tcp timestamps sometimes causing a problem like this. Turning the options off has been a solution for some. It seems it sometimes occurs where there is a faulty firewall in between.

  3. #3
    Linux Newbie
    Join Date
    Jun 2012
    Location
    SF Bay area
    Posts
    202
    It sounds like you can reproduce the problem inhouse, albeit at a much lower frequency. Even so, as long as you can automate the testing you can crank out the tens of thousands of connections to try and debug the behavior. So I'd suggest running packet captures on both the client and server, run tests until you produce one of the failure states, then look to see what's happening at the network level from BOTH sides. And if you can get packet traces on the switch/router (or system getting a mirror of the ports) as well that would be good too.

  4. $spacer_open
    $spacer_close
  5. #4
    Just Joined!
    Join Date
    Sep 2014
    Posts
    4
    Quote Originally Posted by cnamejj View Post
    It sounds like you can reproduce the problem inhouse, albeit at a much lower frequency. Even so, as long as you can automate the testing you can crank out the tens of thousands of connections to try and debug the behavior. So I'd suggest running packet captures on both the client and server, run tests until you produce one of the failure states, then look to see what's happening at the network level from BOTH sides. And if you can get packet traces on the switch/router (or system getting a mirror of the ports) as well that would be good too.
    We have packet captures from customer machines - how do you think we were able to determine the lack of SYNACKs? We have also managed to get a capture of an in house failure which showed the same. As the issue is with localhost, packet captures on switches and routers are not likely to be useful.

    Anyway we have more details now. The server socket is already in CLOSE_WAIT when a new connexion is being made to it. We suspect the server is not calling close() quickly enough when the client closes its end. This puts the server end into CLOSE_WAIT and the client into FIN_WAIT2. The FIN_WAIT2 times out and the client end closes completely leaving the client port free for reuse. If the client then tries to reuse the port to talk to the server, the server ignores the SYNs as it thinks they are out of sequence traffic for the original connexion which it still has in CLOSE_WAIT.

    So the question is now what has changed between those two kernels that would cause the change in the server's behaviour? It could be something like a change to epoll affecting when/how the server gets to know about the client closing the connexion. We have also seen the issue on a 3.10 kernel which narrows the range a bit.

    If it helps, the server is Squid. This behavour probably would not be noticed in its normal use case. It is only because we are hitting it repeatedly from the same client that it delaying its close becomes so noticable.

  6. #5
    Linux Guru
    Join Date
    Dec 2013
    Posts
    1,556
    The servers kernel should send an ACK for the original connection because the sequence number would seem to be from the original. If the client got it a RST would be returned and the socket would be cleared. Are there any firewalls between the client and server that might be dropping the out of sequence response?

  7. #6
    Just Joined!
    Join Date
    Sep 2014
    Posts
    4

    Smile

    Quote Originally Posted by gregm View Post
    The servers kernel should send an ACK for the original connection because the sequence number would seem to be from the original. If the client got it a RST would be returned and the socket would be cleared. Are there any firewalls between the client and server that might be dropping the out of sequence response?
    No firewall - it is all localhost.

    Anyway we think we have got to the bottom of this.

    We do a slightly hacky connexion to the server, Squid, which involves some small patches to Squid. We had missed something which in certain unusual circumstances (an https to a domain that did not have a DNS entry) caused Squid to go into a 'should never happen' case where it silently abandoned the connexion without closing it. The client closed its end and the server end ended up in CLOSE_WAIT. The client timed out its various WAIT states, but the server's CLOSE_WAIT does not timeout.

    Eventually the client reused the same port at its end. Under the older kernels a bug meant that the server side would accept the new connexion. (Possibly a window bug as originally suggested by gregm.) However the newer kernels correctly ignores the incoming SYN as being out of sequence for the old connexion and we get the sporadic connexion failure that was the original symptom. We have not located the precise change in the kernel but have narrowed it to somewhere between 3.5 and 3.8.

    As you might guess, this has not been easy to find. It has taken us several people working on it for several days.

    Thank you to everyone who made suggestions.

  8. #7
    Linux Newbie
    Join Date
    Jun 2012
    Location
    SF Bay area
    Posts
    202
    Unless you've modified the kernel to allow it, or possibly by doing tricky things with routing or "iptables" like functionality, Squid isn't capable of closing a connection without sending the appropriate TCP/IP packets to the other side.

    If that is happening, meaning an established connect is abandoned by the server without anything sent to the client, then something very odd is going on IMO.

    I have seen something like that happen when there are poorly designed load balancers in the middle of the client/server connections. But without some network level appliance in the middle it shouldn't be possible for an application to violate the TCP/IP protocol by not at least trying to notify the other side that the connection was terminated.

  9. #8
    Just Joined!
    Join Date
    Sep 2014
    Posts
    4
    Quote Originally Posted by cnamejj View Post
    Unless you've modified the kernel to allow it, or possibly by doing tricky things with routing or "iptables" like functionality, Squid isn't capable of closing a connection without sending the appropriate TCP/IP packets to the other side.
    Squid is not closing the connexion. It is keeping the file descriptor open and forgetting about it.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •