Find the answer to your Linux question:
Results 1 to 8 of 8
I'm seeing bursts of TCP traffic occasionally taking a long time to complete in my application. In my test I have a distributed application running on 4 nodes that are ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Nov 2007
    Posts
    4

    Burst of TCP traffic taking too long


    I'm seeing bursts of TCP traffic occasionally taking a long time to complete in my application. In my test I have a distributed application running on 4 nodes that are each connected to a Gigabit switch. Each node is running Fedora Core 2 with kernel 2.6.16. The tcp bursts occur when an external timer triggers the application on each node to send tcp data, at the same time, to the other nodes. Each node has 3 tcp connections to each other node. So during the burst, each node is trying to send tcp data out to a total of 9 connections while at the same time it is receiving data on 9 connections. All messages sent and received during the burst are 32KBytes each. Given a Gigabit switch this burst requires very little bandwidth and normally completes in several milliseconds as it should. Occasionally, the entire burst takes several HUNDRED milliseconds to complete and this is the problem I'm trying to solve.

    I've captured instances of the long bursts using ethereal. After analyzing specific tcp connections, I see that the sender sends all tcp packets for the 32K message within a millisecond, but the receiver does not ACK all of them immediately. Then the sender starts retransmitting the packets that have not been ACKed. In the middle of retransmitting there is a period of hundreds of milliseconds where there is no traffic for that connection, then the remaining packets are retransmitted. I see traffic for other connections during the dead time so I know the entire interface is not dead. During the burst, in ethereal I see many packets with: Duplicate ACKs, ACKed Lost Segment, Out of Order, Previous Segment Lost. I'm not sure if these are "normal" for TCP or maybe part of the problem.

    When I run the same test on one node instead of 4 I do not see any bursts taking a long time. For this test I ran 4 instances of the application on one node, so all tcp traffic used the loopback interface. This tells me the problem is not in my application, but maybe in the kernel's TCP stack or the ethernet driver(e1000)? I'm assuming the problem is not in the Gigabit switch.

    I've started playing with tuning the TCP parameters using sysctl, but have not been able to resolve the problem. Please help!

  2. #2
    Linux Enthusiast
    Join Date
    Aug 2006
    Location
    Portsmouth, UK
    Posts
    539
    TBH, tweaking TCP parameters usually just makes things worse...

    Have a look at the ring buffer sizes on your NIC's:

    Code:
    ethtool -g ethX
    man ethtool
    RHCE #100-015-395
    Please don't PM me with questions as no reply may offend, that's what the forums are for.

  3. #3
    Just Joined!
    Join Date
    Nov 2007
    Posts
    4
    EthX Ring parmas:

    Pre-set maximums:
    RX: 4096
    RX Mini: 0
    RX Jumbo: 0
    TX: 4096
    Current hardware settings:
    RX: 256
    RX Mini: 0
    RX Jumbo: 0
    TX: 256

    Do you suggest setting both the RX and TX to 4096?

  4. #4
    Linux Enthusiast
    Join Date
    Aug 2006
    Location
    Portsmouth, UK
    Posts
    539
    Absolutly, see how it works for you.

    If you get good results you can make the changes persitant by addind the ethtool commands to /etc/rc.init. Don't forget to add a service network restart aftwards as well though.
    RHCE #100-015-395
    Please don't PM me with questions as no reply may offend, that's what the forums are for.

  5. #5
    Just Joined!
    Join Date
    Nov 2007
    Posts
    4
    Changed both tx and rx ring buffer sizes to 4096, but no difference
    Any other suggestions?

  6. #6
    Linux Enthusiast
    Join Date
    Aug 2006
    Location
    Portsmouth, UK
    Posts
    539
    Daft question, but did you restart the network service after the change ?
    RHCE #100-015-395
    Please don't PM me with questions as no reply may offend, that's what the forums are for.

  7. #7
    Just Joined!
    Join Date
    Nov 2007
    Posts
    4
    Sure did. Why did you suggest increasing the ring buffers? Are there any tools I can use to monitor these buffers to see if they're getting filled up? Or if kernel network queues are filled up? I've been using netstat to monitor things "netstat -i ethX -c 10" and "netstat -sc", but I'm not seeing any dropped packets or anything that explains the problem.

  8. #8
    Linux Enthusiast
    Join Date
    Aug 2006
    Location
    Portsmouth, UK
    Posts
    539
    Increasing the ring buffer means that your nic's can hold/accept more data while waiting for "someapp" to process what it's already received before having to reject packets.

    One thing you'll usually notice with larger buffers is the number of interupt requests generated by network requests dropping.

    I'm not aware of any monitoring tools for the ring buffers, more of a suck it and see....
    RHCE #100-015-395
    Please don't PM me with questions as no reply may offend, that's what the forums are for.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •