I've discovered why a couple of machines keep dying. They are pure processing machines running a small custom C program. This program does some work and creates some result files. A script on a Windows machines that sent the work to this linux server polls every second to find if the file exists so it knows when it's finished.

It turns out that Samba fails to find the file but increases it's memory usage incrementally for each request and doesn't release it, eventually sucking up all the memory on the system. When the memory is all sucked up the box stops responding, you can't ssh, local login or anything.

The log shows that the kernel realises it's out of memory and starts killing stuff like ssh to try to get memory back. I'm not sure how it chooses what to kill, perhaps the process taking up the most memory. Can anyone confirm that?

What I really want to know is how I stop linux from getting DoSed when it runs out of memory and how to constrain programs to stop it running out of memory. Ulimit springs to mind but I've never used it....

Any veterans out there help me out?