Results 1 to 5 of 5
Thread: server shutting down ????
Enjoy an ad free experience by logging in. Not a member yet? Register.
server shutting down ????
Suse 10 - Asus / AMD
Linux version 2.6.13-15.12-default (geeko@buildhost) (gcc version 4.0.2 20050901 (prerelease) (SUSE Linux)) #1 Thu Aug 24 11:23:58 UTC 2006
It locks-up at random times. It is on but non responsive at the tty or via ssh, etc.
usually it is over the weekend, I come in on Monday morning and it is locked up, maybe it gets lonely
I have so far :
blew out the system
reseated the cpu and mem sticks
new heat sink grease ( bios did not show it running very hot right after reboot )
a memory test - walking bits +++
I thought the power cord seemed, odd, smaller than most and it did flicker when I just touched it so I replaced it
In the warn logs I have this entry several times prior to last entry:
Normal free:6772kB min:3756kB low:4692kB high:5632kB active:543436kB inactive:262292kB present:901120kB pages_scanned:17031822 all_unreclaimable? yes
I do not quite follow it.
there is nothing in the faillog
Any ideas where I should start looking?
After I reboot, it is up and running with no problems and can run 7 to 16 days with no errors.
There are approximately 150 users in the network with 20 ip based cameras.
I only have the one internal dns server.
At first pass - I think you may have a memory leak somewhere. That is what the final report is talking about.
Available memory has gone down to less than 3MB, as a lot of memory (on the heap) has not been released or freed by aquiring processes.
Have you set core dumping facilities by using the bash ulimit command? Alternatively, if the system is somewhat running but not really responding have you tried the "SysReq Key" functions to capture data to determine what may be the issue(s). See /usr/scr/linux/Documentation/sysrq.tx for those details. It is very possible that this may respond even though you don't see any visable response on the screen - You can usually use the keyboard if on toggling the caps lock you see the led indicating a change.
If toggling the caps lock doesn't get a led toggle your got a complete lockup which maybe simply because the system see no memory left to use.
A very quick solution, but not the one you want to leave, but one that may help you to catch the offending application is to increase the swap space (x10 say). Generally just adding another swap partition, or if that is not do-able, then add a swap file. Better still, add another drive and use it for additional swap space. Modify /etc/init.d/boot.local to add this drive to the swap space available by adding the command "swapon /dev/diskdrivename".
Have you updated the system with all the security and application fixes on a regular basis ? If not, why not?
I especially think you should update to the latest SuSE 10.0 kernel. Also you may not be aware that OpenSUSE 10.0 release has many problems that were only finally fixed in OpenSUSE 10.3 IMHO.
Are you able to see which process dominate what's running over the weekend with no one around? i.e.: Have a console running "top" with the applicable columns like "WaitChan & Flags" showing as this is a great tool for determing what process is doing what or which processes are awaiting an event that maybe doesn't come around.
I assume the system HAS a UPS which has been serviced regulary and the batteries are exchanged every two years (min)?
Is your system secure? If so, how do you come to that conclusion?
What have you done, or what has been done to ensure system security? Even if the system is used only inhouse is Yast Security got the interanl net also under protection? If not why not - Any user can and do usually introduce crazy stuff via web sites, floppy/USB sticks etc., so be aware of this.
Have you got smartctl running on your drives, maybe the disk drive is developing errors as it gets older which only show over time - It is certinly worth a look with this tool.
Maybe that the camera image collector or its controlling program is leaking as its probably the only one running 24/7 (i.e.: the application is not releaing used memory or heap space over time). If your running Zoneminder that is one program I have seen develop such leakage as I have seven cameras watching Possums (at night) birdlife during the day. You could run a "ddd" session on this application run it vis strace to capture a hanging system call.
Otherwise, you will have to compose a list of applicable issues starting with the hardware (things like voltages, temperatures, disk activities logs, printing the console output directly to another PC so as to log all activity over the w/end. Then the software prcesses as somewhat outlined above. Investigate each and everyone, one step at a time.
Hope this asssits you.
Thanks for the reply, it does help alot.
I have not been able to keep up with the servers as I would like. I am a teacher and this is a side thing that has grown greatly. enough excuses though.
I would like to pull it off line and bring it up to 10.3, redo the services on it. I may get the chance yet or it may force me to it.
It was a wonder to me because this box has been in service for so long with little changes, uptimes of 150+ days. This seems 'out of the blue' of sorts.
It is on a 1 year old UPS. I do take it down and blow it out, check fans, filters and such, do a mem test. It is in a room with an AC unit to keep the room cool.
I had been trying to determine if this was a hardware issue or software The one log entry was the only weird event, besides locking up, I could find so far.
You have mentioned tests I never knew so I will begin there.
Thanks again for the help.
You probably do not need to upgrade the complete distro if it is doing the job your happy with. Linux is not like Windoze where you have to keep it all up to date otherwise applications just don't work anymore.
Unless there is a need to upgrade you needn't.
Just connect a good fast internet line to the ethernet port, run YAST - Online Update and everything will be done (OS wise) automagically for you. You will have to look at each applications home web site to determine if those applications need upgrading. YAST can probably help you out on this too. I run 10.3 and haven't touch 10.0 since it was released so I cannot add much to assist on application side sadly.
*** By the Way ****
I had on client @ Sydney Uni - Chem Lab that had a 12 year old server still running on a really old slackware disto (Kernel was 1.2.8 I recall). But because it was only an internal machine with a secure router/firewall before it - it just served the students for all these years without any problems (until they turned off the power for only the second time in ten years where the PSU blew on next powerup, the fans stopped working and the disk drive crashed). Linux wise, it still worked wonderfully. They took two days to find it even, as it was working under the stairs in a not-to-easy-to-get-too small cupboard.
If you have further question just ask.
School is out, I have more time than I did and I see a slip in my server logs. I had added a cron job for my NTP server on this machine.
*/15 * * * * root ntpd -s ntpdate -s time.nist.gov
That has been recent enough to be in the area of shut downs.
I am doing the patches on the machine also, I have several to get.
I am going to remove the cron job for a time ( no pun ) to see if it
has anything to do with it.
I have read several articals, do you have a good idea how often you 'need' to update the time server? It seemed most thought it had to be often to disallow drift. I may have done myself in here.