Random Centos Crashing
Firstly, apologies if I've posted this in the wrong place; I really don't know what the root cause of the issue is.
Bear with me, because this is a tough one to explain.
Basically, we have 7 Dell Poweredge Quadcores (used as game servers), some are on Centos 4.4 and others on Centos 5 (all running the latest kernels). Completely randomly, they have started to "crash". When I say crash, I mean this: we cannot SSH to them (connection refused), everything that was running on them stops, but they still ping. We've tried hooking them all up to KVMs and when the problem occurs the screen fills with wierd text (I'll try and get a copy of it when it happens again). The only way to resolve the issue is to physically reboot the machine.
I tried upgrading some of the machines to Centos 5, but the problem still occurs. It is an absolute nightmare. We don't know if it's some form of malicious attack and there's nothing out of the ordinary (like spikes) on network graphs. We've got iptables running with really secure rules (I've tried disabling this but as usual the problem still occurs). It could be an exploit in the kernel or in the game servers that we run that could be causing the machine to crash; I really don't know. There's nothing in the logs at all, nor anything that shows it's being directly caused by a user/attacker. Whatever it is, it's causing us a huge amount of grief because whatever we do to try and fix it, doesn't work.
We've also been looking at common factors, i.e. the machines all have the same motherboard NIC, etc. could someone be using an exploit in the NIC drivers to crash our machines?
Any assistance is greatly appreciated!
This is quite interesting... First of all I would try to remember and make sure that it is not a software update that you have done and you just don't remember. Provided this is not the case the next thing I would check is the network.
First of all you don't say how these machines are interconnected and how they get to the outside world. Please post some information about this cause speculating is very difficult without knowing how the network is set-up.
Going now to the scenario of a malicious attack these are some facts:
1. Not seeing suspicious logs does not mean that a system is not compromised: a cracker who knows what is doing would never leave traces behind them. Logs and timestamps can be easily changed.
2. Do you know about rootkits? A rootkit is a piece of software that attaches to the kernel and hides processes and files (actually it can hide pretty much everything). If you had a rootkit installed on your machines then the attack can be really stealthy! There are some rootkit removers like rkhunter and chkrootkit.
So, if you suspect an attack, just take one of the machines offline and examine it. Maybe install a rootkit remover and see what it comes up with...
There are many things to speculate about when it comes to security, you need to think out of the box...
And one more thing... Are your servers headless? If not, when the system "crashes" can you use each box individually? Can you login?
Try to eliminate cases, check the networking side first and post your findings. Maybe someone out here might pick something the you have not noticed. Then move to some more drastic methods...
I hope this helps, and if you have more information post it, I find this interesting and would like to help you with this.
Hi Nautilus, thanks very much for the reply.
I've run rkhunter on all the machines; all are clean.
The machines are in different racks, connected to different switches, which are in turn connected to our core switch. We have a primary transit feed from a large provider whose office is just down the corridor from ours; I don't think the issue lies with them.
All machines had been last updated several weeks ago via yum.
Will add more info when I think of it.
When the a machine crashes we can get in through our KVM, yes. But obviously not through SSH.
It's just happened again, so this is what happened:
When I got into KVM none of the services were running.
service sshd status
"sshd dead but pid file exists"
service vsftpd status
"vsftpd dead but subsys locked"
Just restarting all the services fixed the problems. I have a feeling it won't be long until the machine crashes again though. This is so bizarre. Nothing has changed over the past few days that would cause all these crashes.
Someone must be very clever in their malicious attack OR there must be an exploit in one of the games that we host. I can't put it down to anything else.
You can search for the latest exploits there.
Also you might be interested to do a vulnerability assessment with Nessus:
Just bare in mind that the report will only take into account the exploits that were discovered at least 7 days ago, not earlier ones. Otherwise you need to buy a license and get the all the existing exploit lists.
When this nightmare finishes maybe you should consider using an IDS or IPS (intruder detection/prevention system).
Do all the machines go down at the same time?
Right, okay. And no they go down individually at random times.
Okay, we think we've figured this one out.
It seems our primary transit provider are having serious problems with their network whereby packets are becoming corrupted en route to our equipment. When these dodgy packets reach our machines, they crash the NIC and in turn cause a kernel panic. The machine then either reboots (and doesn't do it properly) or doesnt reboot at all, leaving services like sshd in a dodgy state, even though the machine is still pingable.
This should also explain why we have never had the problem occur twice on the same machine within a matter of minutes; it takes a while (sometimes hours) for the game servers that we host on the machine to become popular/busy again, thus increasing the network traffic and the possibility of one of these corrupt packets reaching the machine.
We tried keeping the machines on, but with all the processes stopped and we found that they did not crash. So unless I've got this totally wrong here, it seems we may well have found the cause of the problem!
Am I making sense? Or does everything coincidentally piece together for the wrong reason...?
What you are saying does make sense, but it is just a scenario like the others until you prove it... First of all I believe if you had a kernel panic it should be logged in either /var/log/syslog or /var/log/messages. Is there any log like that? Does it look that the panic comes from the network driver?
And one more thing to do to give you proof is if you run a packet sniffer like tcpdump or wireshark and capture this malformed packets. Then you can go ahead and report it to your transit provider.
This is what I would do. If you got this proves by other means, please let us know...