Burst of TCP traffic taking too long
I'm seeing bursts of TCP traffic occasionally taking a long time to complete in my application. In my test I have a distributed application running on 4 nodes that are each connected to a Gigabit switch. Each node is running Fedora Core 2 with kernel 2.6.16. The tcp bursts occur when an external timer triggers the application on each node to send tcp data, at the same time, to the other nodes. Each node has 3 tcp connections to each other node. So during the burst, each node is trying to send tcp data out to a total of 9 connections while at the same time it is receiving data on 9 connections. All messages sent and received during the burst are 32KBytes each. Given a Gigabit switch this burst requires very little bandwidth and normally completes in several milliseconds as it should. Occasionally, the entire burst takes several HUNDRED milliseconds to complete and this is the problem I'm trying to solve.
I've captured instances of the long bursts using ethereal. After analyzing specific tcp connections, I see that the sender sends all tcp packets for the 32K message within a millisecond, but the receiver does not ACK all of them immediately. Then the sender starts retransmitting the packets that have not been ACKed. In the middle of retransmitting there is a period of hundreds of milliseconds where there is no traffic for that connection, then the remaining packets are retransmitted. I see traffic for other connections during the dead time so I know the entire interface is not dead. During the burst, in ethereal I see many packets with: Duplicate ACKs, ACKed Lost Segment, Out of Order, Previous Segment Lost. I'm not sure if these are "normal" for TCP or maybe part of the problem.
When I run the same test on one node instead of 4 I do not see any bursts taking a long time. For this test I ran 4 instances of the application on one node, so all tcp traffic used the loopback interface. This tells me the problem is not in my application, but maybe in the kernel's TCP stack or the ethernet driver(e1000)? I'm assuming the problem is not in the Gigabit switch.
I've started playing with tuning the TCP parameters using sysctl, but have not been able to resolve the problem. Please help!