Results 1 to 8 of 8
I'm seeing bursts of TCP traffic occasionally taking a long time to complete in my application. In my test I have a distributed application running on 4 nodes that are ...
- 11-29-2007 #1Just Joined!
- Join Date
- Nov 2007
- Posts
- 4
Burst of TCP traffic taking too long
I'm seeing bursts of TCP traffic occasionally taking a long time to complete in my application. In my test I have a distributed application running on 4 nodes that are each connected to a Gigabit switch. Each node is running Fedora Core 2 with kernel 2.6.16. The tcp bursts occur when an external timer triggers the application on each node to send tcp data, at the same time, to the other nodes. Each node has 3 tcp connections to each other node. So during the burst, each node is trying to send tcp data out to a total of 9 connections while at the same time it is receiving data on 9 connections. All messages sent and received during the burst are 32KBytes each. Given a Gigabit switch this burst requires very little bandwidth and normally completes in several milliseconds as it should. Occasionally, the entire burst takes several HUNDRED milliseconds to complete and this is the problem I'm trying to solve.
I've captured instances of the long bursts using ethereal. After analyzing specific tcp connections, I see that the sender sends all tcp packets for the 32K message within a millisecond, but the receiver does not ACK all of them immediately. Then the sender starts retransmitting the packets that have not been ACKed. In the middle of retransmitting there is a period of hundreds of milliseconds where there is no traffic for that connection, then the remaining packets are retransmitted. I see traffic for other connections during the dead time so I know the entire interface is not dead. During the burst, in ethereal I see many packets with: Duplicate ACKs, ACKed Lost Segment, Out of Order, Previous Segment Lost. I'm not sure if these are "normal" for TCP or maybe part of the problem.
When I run the same test on one node instead of 4 I do not see any bursts taking a long time. For this test I ran 4 instances of the application on one node, so all tcp traffic used the loopback interface. This tells me the problem is not in my application, but maybe in the kernel's TCP stack or the ethernet driver(e1000)? I'm assuming the problem is not in the Gigabit switch.
I've started playing with tuning the TCP parameters using sysctl, but have not been able to resolve the problem. Please help!
- 11-30-2007 #2Linux Enthusiast
- Join Date
- Aug 2006
- Location
- Portsmouth, UK
- Posts
- 539
TBH, tweaking TCP parameters usually just makes things worse...
Have a look at the ring buffer sizes on your NIC's:
man ethtoolCode:ethtool -g ethX
RHCE #100-015-395
Please don't PM me with questions as no reply may offend, that's what the forums are for.
- 11-30-2007 #3Just Joined!
- Join Date
- Nov 2007
- Posts
- 4
EthX Ring parmas:
Pre-set maximums:
RX: 4096
RX Mini: 0
RX Jumbo: 0
TX: 4096
Current hardware settings:
RX: 256
RX Mini: 0
RX Jumbo: 0
TX: 256
Do you suggest setting both the RX and TX to 4096?
- 11-30-2007 #4Linux Enthusiast
- Join Date
- Aug 2006
- Location
- Portsmouth, UK
- Posts
- 539
Absolutly, see how it works for you.
If you get good results you can make the changes persitant by addind the ethtool commands to /etc/rc.init. Don't forget to add a service network restart aftwards as well though.RHCE #100-015-395
Please don't PM me with questions as no reply may offend, that's what the forums are for.
- 11-30-2007 #5Just Joined!
- Join Date
- Nov 2007
- Posts
- 4
Changed both tx and rx ring buffer sizes to 4096, but no difference

Any other suggestions?
- 11-30-2007 #6Linux Enthusiast
- Join Date
- Aug 2006
- Location
- Portsmouth, UK
- Posts
- 539
Daft question, but did you restart the network service after the change ?
RHCE #100-015-395
Please don't PM me with questions as no reply may offend, that's what the forums are for.
- 11-30-2007 #7Just Joined!
- Join Date
- Nov 2007
- Posts
- 4
Sure did. Why did you suggest increasing the ring buffers? Are there any tools I can use to monitor these buffers to see if they're getting filled up? Or if kernel network queues are filled up? I've been using netstat to monitor things "netstat -i ethX -c 10" and "netstat -sc", but I'm not seeing any dropped packets or anything that explains the problem.
- 12-01-2007 #8Linux Enthusiast
- Join Date
- Aug 2006
- Location
- Portsmouth, UK
- Posts
- 539
Increasing the ring buffer means that your nic's can hold/accept more data while waiting for "someapp" to process what it's already received before having to reject packets.
One thing you'll usually notice with larger buffers is the number of interupt requests generated by network requests dropping.
I'm not aware of any monitoring tools for the ring buffers, more of a suck it and see....RHCE #100-015-395
Please don't PM me with questions as no reply may offend, that's what the forums are for.


Reply With Quote
