I have some C code that has a "main" and starts 2 threads. The main sets up a socket to another box, connects and waits for a message. One of the threads is a "heartbeat" thread --- sleeping, waking to send a message every 30 seconds. The second thread is started to perform some hashing of files (via an incoming message), then sends the response to the connected box. In the main routine, a mutex is initialized and utilized in both threads prior to doing the "send" command.

The problem is, sometimes the socket gets "hung" and the data doesn't get sent ----- eventually, timing out and closing. Just for info sake --- The socket is getting hung on the hash message (as opposed to the heartbeat) ---- and the heartbeat mesage is 8 bytes, where the hash message is approx 3,000,000 bytes.

A netstat at the time of the "hang" shows data in the send-q between the 2 boxes ----- but tcpdump of data doesn't show anything going between the 2 boxes.

Any advice and/or things to look for, at, etc.????? I'm not quite sure what to do anymore. I have tried various compile options, tried stopping the heartbeat thread prior to sending the hash message --- nothing seems to change the problem. I must note --- this failure happens ALL the time on one box, but the same code works fine on other boxes. It appears to be timing related, but I can't say exactly what the timing and/or condition is.

The failure is being seen on RedHat 8.0 w/ 2.4-18-14 --- but it also works on t his kernel. In addition, the same code is running on 7.2 2.4-20 and solaris boxes.

Any help is greatly appreciated! THANKS!