Results 1 to 4 of 4
Hello People
I have an unusual problem I have been looking at for ages and finally think I am getting close to the problem, but my test to verify it ...
- 11-27-2011 #1Just Joined!
- Join Date
- Mar 2008
- Posts
- 7
RHEL 3 and AMD Devil chips, I think, help me please
Hello People
I have an unusual problem I have been looking at for ages and finally think I am getting close to the problem, but my test to verify it has failed me and I am not sure I have tested in the correct manner.
We have several servers (all our really important DB infrastructure) running on really old various HP servers. All these servers that we have the problem with are Dual PROC AMD Opteron Processor 252. Only the RHEL 3 servers running on these chips have this problems newer CPUs and VMs on newer CPUs have never had the problem. That problem is these servers can randomly reboot over a random period with absolutely no OS logging as to the reason why they are rebooting. The only indication that something fishy has gone on is in the ilolog you get this message below
Informational iLO 11/24/2011 11:03 11/24/2011 11:03 1 On-board clock set; was 11/23/2011 22:03:41.
Informational iLO 11/24/2011 10:59 11/24/2011 10:59 1 Server power restored.
Caution iLO 11/24/2011 10:59 11/24/2011 10:59 1 Server reset.
Informational iLO 11/23/2011 21:59 11/23/2011 21:59 1 On-board clock set; was 11/24/2011 10:59:48.
Informational iLO 11/18/2011 15:33 11/18/2011 15:33 1 Server power restored.
Caution iLO 11/18/2011 15:33 11/18/2011 15:33 1 Server reset.
As you can see it appears to be switching between Local time and UTC Local time is GMT +12.
We always see this entry when we get this type of unexplained reboot. The HWclock is set to UTC and have hard set it start up scripts
We do not get anything like this if it is running on Hardware that does not have those CPUs, but we get it on PCLass blades and DLs alike.
I have logged a call with HP, but they were not much help, but while working on another problem on windows I came across the TSC drift on AMDs issue
Can not post a url, but google AMD TSC drift Redhat and second result will lead you to what I am talking about
Now before I attempt to apply any of these recommendations I have been tasked to replicate the problem in our DEV oracle cluster first (Which have never experienced this issue) on the same type of hardware and OS
Now I compiled a C script to constantly uses the gettimeofday query ran it half a dozen times specifying it to run on different CPUs, but still could not replicate.
I have had many problems with the AMD quad and six core CPUs with Linux and databases and I am sure that the AMD CPU factory is run by the devil and think that maybe I have not tested this as well as many of the much more gifted geeks out there would.
Has anybody got any ideas or ever had a similar issue or ever dealt with the TSC time drift issue on RHEL 3?
2.4.21-27.ELsmp i686 athlon i386 GNU/Linux
- 11-28-2011 #2Linux Guru
- Join Date
- Apr 2009
- Location
- I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
- Posts
- 8,975
Well, this is a very out-of-date operating system, but I understand that legacy systems may still require running a 2.4 kernel (kind of like continuing to run Windows 3.1 today...) - sigh. If I read your post correctly, it seems to be related to clock drift, correct? If so, have you tried running NTP to keep the clocks synchronized to global time properly? The source can either be an Internet clock source, such as NIST, or a local server that is similarly sync'd.
Sometimes, real fast is almost as good as real time.
Just remember, Semper Gumbi - always be flexible!
- 11-28-2011 #3Just Joined!
- Join Date
- Mar 2008
- Posts
- 7
Thanks Rubberman
No this is not and NTP issue, but due to the ilologs I believe it is got something to do with time, but more due to a bug than configuration. We have spent a lot of time looking at that and can write that one off as not configured correctly. OS logs show that the time is accurate right up to when they end suddenly. 12 hour movement is a lot of drift too I might add. If you have a look at the this exert
Informational iLO 11/24/2011 10:59 11/24/2011 10:59 1 Server power restored.
Caution iLO 11/24/2011 10:59 11/24/2011 10:59 1 Server reset.
Informational iLO 11/23/2011 21:59 11/23/2011 21:59 1 On-board clock set; was 11/24/2011 10:59:48.
This all happened at 11/24/20011 10:59. As you can see the clock has jumped around 13 hours. I forgot about Daylight savings so local time is UTC+13
Once it comes back up
Informational iLO 11/24/2011 11:03 11/24/2011 11:03 1 On-board clock set; was 11/23/2011 22:03:41.
Clock is reset. This is not time drift this looks like conflict between local time and system time.
This TCS time drift issue seems like a possibility worth investigating.
Yes I know this is an old OS, but.................the hardware is even uglier. We lost a whole bunch of these at an alternate datacenter due to a power surge and HP could not provide replacement motherboards. They just shipped us "refurbished" (rub the motherboard on the carpet then put it in a static bag) parts and we ended up having to losing an entire 3 server system that still has not been replaced 2 years on. Making do with what we have is a fact of life here unfortunately
We have other RHEL versions of the same version on other servers physical and virtual, but this issue only ever happens on servers running these chips both in blades and the pizza boxes.
- 11-29-2011 #4Linux Guru
- Join Date
- Apr 2009
- Location
- I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
- Posts
- 8,975
Ok. Had to ask...
It sounds like you are convinced it is a hardware flaw then, correct? Seems like it's time to replace the affected servers then. New hardware of the same capacity should not be a major investment I would think. In fact, it would probably cost less than the engineering time you have already invested in figuring out the problem root cause.
Sometimes, real fast is almost as good as real time.
Just remember, Semper Gumbi - always be flexible!


Reply With Quote