occasional socket select() hang when using zero timeout
We've seen a recent post about a socket select() hang for 10 minutes using a non-zero timeout. However, we are experiencing an occasional socket select() hang for approximately 71 minutes when using a zero timeout on a non-blocking UDP socket.
We're using multi-cpu, multi-core servers from Aberdeen - which are basically repackaged supermicro servers. We're running CentOS v5.2
2.6.18-92.el5 #1 SMP Tue Jun 10 18:49:47 EDT 2008 i686 i686 i386 GNU/Linux
rpm -qa kernel\* | sort
Occasionally, when a non-blocking UDP socket is polled using the select() system call with a zeroed timeval structure, we note that the select() stalls for approximately 71 minutes. We wish to respond quickly when packets appear spontaneously on this socket, but the opposite socket very, very rarely spontaneously transmits a packet. It is common for no packet to be spontaneously transmitted to this socket for many hours.
We find it quite coincidental that 0xFFFFFFFF in usec resolution equals 71 minutes, 35 seconds. We hypothesized that the usec component of the zeroed timeval structure provided to select() is occasionally being decremented to 0xFFFFFFFF (or the equivalent in "jiffies") prior to the OS testing if it is equal to zero. Thus, we incur a 71 minute, 35 second timeout. An examination of the kernel source doesn't appear to support our hypothesis, but it is nonetheless quite a coincidence.
We call select() on this socket at quite a high rate (e.g. 500 Hz) and this problem might occur once or twice over 12 hours. It is apparently quite sensitive to precisely when the select() function is called in relation to other kernel activities.
We have searched the RedHat bug list, the centos forum, the linuxquestions forum, and this site and have not found any similar complaints using select() with a zeroed timeout. Has anyone else observed this behavior? Is there a remedy that entails something other than avoiding zero timeouts or a watchdog on threads that might perform zero timeout select() calls? Our product also employs a library that may perform zero timeout select() calls, so we'd prefer an OS level solution. We didn't notice anything in the centos v5.3 release notes to indicate that such a problem has been recognized and addressed. Despite that, we intend to update one of our test resources to v5.3 and give that a try. It looks as if the select() implementation has undergone quite a re-write since the 2.6.18 version.
We don't have good feel for whether this problem is due to a unique interaction of v5.2 centos and our Aberdeen peculiar server hardware. If it isn't peculiar to our hardware, we'd have thought there would already be plenty of posts about this issue on-line. On the other hand, despite the vast number of Linux installations, we suppose it's possible a problem such as this might go unnoticed for an extended period of time. It manifests very infrequently given the number of opportunities. And one might only recognize it happens if the socket he is polling using select() with a zeroed timeout only very, very rarely receives packet traffic. Otherwise, the select() would return due to the reception of that traffic.