Results 1 to 4 of 4
Enjoy an ad free experience by logging in. Not a member yet? Register.
- Join Date
- Aug 2007
Troubleshooting server load/Learn more about a running process
I have two general questions about server load. I'm still relatively new and self taught in this area. The other day one of our servers (running Solaris) ran into some problems and I had a heck of a time troubleshooting what was going on. I eventually discovered cause of the problem, but I ended up getting lucky rather than actually taking logical steps to track down the problem and solve it.
I ran top and saw that the server load was around 3. I believe that tells me how many processes are waiting in the queue. I've run top since that time and see this number stays about the same, so I believe this is normal for this particular server.
I also saw that there was a Perl process running in the list of processes within top. I took a wild stab and killed this process. While the load average did not change, users reported that lag decreased and other services were running efficiently again. This is where I have two questions:
First, I'm not sure I fully understand what the CPU % value next to a process means? I saw that perl process was showing 98-100% and had a time of 20:00. Does the CPU% value reflect how much of the processor is being used by this process? I can't imagine it does, because if you take into account other processes, the totals would add up to over 100%.
Second, while top shows me there is a perl process running, I had trouble finding more detail about what actual perl process was running. What script was running? Was it a simple command run at the command line; was it one of our scripts or had the system been hacked; had one of our users tested a perl script that was caught in a loop? I suppose my question here is, how does one go about tracking down a process like this -- to see it's not only a perl process, but the actual command/script that is running? One of our users later told me he was running a "search and replace" perl script that was changing millions of lines in a file... so I found out what was going on, but don't know if I would have if he hadn't came forward.
hi and welcome culley
A few words about load first:
It shows how many processes are using the cpu or waiting to be executed.
But you need to see that in context of your hardware.
A load of 20 for a single core CPU is bad.
Load 20 on a 24 core system is money well spent.
That is: If the load is spread even among the cores. If somehow only one of the cores does all the work, then there is need for investigation.
Call top again and press "1".
At least on a linux, you will see the load of each core
Now for your example.
You have seen it high in the top ranks.
Most probably you know what a pid is, as you have killed the perl program.
So to do some investigation, you can also issue:
ps auwwwx -H
This will reveal how it was called.
Does it listen to ports?
Judging from the previous command, look at the mountpoint, where the openend files are:
iostat <DEVICE> 1
That means, the perl programm does not directly wait for CPU, more for the harddisc.
Now you would have enough data to go to the user and tell him his programm is too heavy right now, as it disturbs other users.
a) does it have to be so heavy?
b) does it need to run now? One could think of a cronjob in the night.
c) does it have to run on *that* machine? Maybe a dedicated machine is appropiate.
As the sysadmin, you could also think of using limits.
For a general overview of what is going on, you can also use
Should the machine swap heavily (again hardware dependant, but a constant swapping with several MByte/s is bad), then action is needed.
Call top and press capital M to sort processes by memory consumption.
Kill the ones with high numbers.
io (bi and bo) gives a summary of BlockIn and BlockOut, aka read and writes.
If the harddiscs are busy, the system will appear slow to users.
Last edited by Irithori; 04-02-2010 at 12:37 PM.You must always face the curtain with a bow.
- Join Date
- Aug 2007
Many thanks! I did not know about vmstat or iostat. I also continue to forget to utilize lsof in this way.
And, thanks for the mention of limit. That was one thing I was going to try and do, set a limit on his process so that it wasn't given such a high priority. His script does not warrant such a high usage on the server.
Again, thanks very much for such a detailed and thought out reply!
You are welcome.
Of course, every case is a bit different, so learning tools is key.
From the top of my head: tcpdump, iptraf, fuser, strace; then the usual tools to mangle the output: grep, awk, perl, tee, pee, tail, sort, cut, hundreds moreYou must always face the curtain with a bow.