Results 1 to 3 of 3
RHEL AS 6.0, unknown patch level. We are application admins (I am a former Linux admin) who due to outsourcing no longer have root access to our servers. Recently we ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
- 01-20-2012 #1
- Join Date
- Jan 2012
application segfault causing system crash?
RHEL AS 6.0, unknown patch level.
We are application admins (I am a former Linux admin) who due to outsourcing no longer have root access to our servers. Recently we had a server hang (or crash -- offshore linux admins are conflicted whether it was a hang or a crash) which seems to have coincided with a ping failure which also seems to have coincided with a segfault from the application.
Ok, so a couple of things about that:
My understanding is that ICMP echo is handled at the adapter level for modern network adapters, so absence of a ping is indication of a network failure or a NIC failure, not a system hang or crash. Feel free to disagree.
The application log files are on an NFS share serviced by the previously mentioned NIC, (I know this is not good practice and I wouldn't have done it this way, but did not have say in the matter) it seems reasonable to get a segfault as a result of the application having a file open when the remotely mounted file system went away due to NIC failure. Again, feel free to disagree.
At this point, things are looking bad for the NIC. But the linux admins sent the system log files to tech support of the server manufacturer (to whom we do not have direct access) who immediately pointed at the segfault as evidence that the application caused the crash.
An application segfault caused a system crash? I'm not sure I believe this. I would have dug into this a little more but am not allowed. And so, I ask the opinion of the Linux community. Sorry the information is so skimpy. Email discussions with the admins tend to follow a 24 hour cycle which makes it time consuming to have a conversation, and they are either reluctant or incapable of sharing enough information to tell what was going on.
Upon re-reading this, it sounds like a thinly disguised rant on outsourcing, and I did not mean it that way. It was more for background on why we have so little information to work with. This is a production environment, has crashed many times in the last two weeks, and we're getting desperate. Yes, we're working the issue through the app tech support also.
Any insight would be appreciated.
- 01-20-2012 #2
- Join Date
- Jan 2012
We've analyzed what log files we've been given access to, and have compiled the following order of incidents based on time stamps:
1) Filesystem clustering software on system 1 reports system 2 down.
2) Ping fails on system 2
3) We are alerted that the application is down on system 2
3a) Some confusion is caused by offshore admins not understanding the nature of load balanced web applications; they insist that the application is up, we insist that one of two clustered servers is down. This takes some time to resolve.
4) Report from offshore that NIC is down (occurs during effort to resolve 3a)
5) System is rebooted, but application fails to restart due to segfault (this is the segfault that offshore admins insist started the problem, but clearly it happened late in the sequence)
6) Offshore admins later report NIC is up, (but are unable to tell us why or what happened) cluster reports both nodes up, application starts successfully.
It looks to me like everything is pointing at the NIC or (less likely) a flaky switch port or (even less likely) a problem in that motherboard slot. We feel that there's enough evidence and enough recurrence (this is the most recent of five or six incidents in the last two weeks) to warrant swapping the NIC. Is there anything I've overlooked?
- 01-20-2012 #3
- Join Date
- Apr 2009
- I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
Swap out the entire system - don't play "Whack-a-Mole" with production hardware. The price does not justify the cost. IE, lost productivity if you guess wrong will be a lot more than just swapping out the hardware, and then properly fix the failing gear so you have a backup available in case system 1 goes down in the future.Sometimes, real fast is almost as good as real time.
Just remember, Semper Gumbi - always be flexible!