Find the answer to your Linux question:
Results 1 to 7 of 7
Hello community I am having a very critical problem I have 2 IBM X 3950 servers with RedHat AS 4 and Cluster Suite, i dont have any license for updates ...
  1. #1
    Just Joined!
    Join Date
    Apr 2010
    Posts
    4

    RedHat Server power off every 2 weeks

    Hello community

    I am having a very critical problem

    I have 2 IBM X 3950 servers with RedHat AS 4 and Cluster Suite, i dont have any license for updates since a year

    For some unknown reason both servers are shutting down every two weeks on thursday i havent found the solution the last entry on both nodes on messages is this

    Apr 16 01:38:58 sjocsprodb2 kernel: tg3: eth0: Link is down.
    Apr 16 01:39:07 sjocsprodb2 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
    Apr 16 01:39:07 sjocsprodb2 kernel: tg3: eth0: Flow control is off for TX and off for RX.
    Apr 16 01:39:08 sjocsprodb2 kernel: tg3: eth0: Link is down.
    Apr 16 01:39:20 sjocsprodb2 kernel: CMAN: removing node sjocsprodb1 from the cluster : Missed too many heartbeats
    Apr 16 01:39:20 sjocsprodb2 fenced[4722]: sjocsprodb1 not a cluster member after 0 sec post_fail_delay
    Apr 16 01:39:20 sjocsprodb2 fenced[4722]: fencing node "sjocsprodb1"
    Apr 16 01:39:30 sjocsprodb2 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
    Apr 16 01:39:30 sjocsprodb2 kernel: tg3: eth0: Flow control is off for TX and off for RX.
    Apr 16 01:40:28 sjocsprodb2 kernel: tg3: eth0: Link is down.
    Apr 16 01:40:53 sjocsprodb2 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
    Apr 16 01:40:53 sjocsprodb2 kernel: tg3: eth0: Flow control is off for TX and off for RX.
    Apr 16 01:40:54 sjocsprodb2 kernel: tg3: eth0: Link is down.
    Apr 16 01:41:02 sjocsprodb2 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
    Apr 16 01:41:02 sjocsprodb2 kernel: tg3: eth0: Flow control is off for TX and off for RX.
    Apr 16 01:59:57 sjocsprodb2 syslogd 1.4.1: restart.
    Apr 16 01:59:57 sjocsprodb2 syslog: syslogd startup succeeded
    ************************************************** ***************

    Apr 16 01:39:20 sjocsprodb1 kernel: CMAN: removing node sjocsprodb2 from the cluster : Missed too many heartbeats
    Apr 16 01:39:20 sjocsprodb1 fenced[4739]: sjocsprodb2 not a cluster member after 0 sec post_fail_delay
    Apr 16 01:39:20 sjocsprodb1 fenced[4739]: fencing node "sjocsprodb2"
    Apr 16 01:39:20 sjocsprodb1 clurgmgrd: [5130]: <warning> Link for eth0: Not detected
    Apr 16 01:39:20 sjocsprodb1 clurgmgrd: [5130]: <warning> No link on eth0...
    Apr 16 01:39:20 sjocsprodb1 clurgmgrd[5130]: <notice> status on ip "158.58.17.34" returned 1 (generic error)
    Apr 16 01:39:29 sjocsprodb1 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
    Apr 16 01:39:29 sjocsprodb1 kernel: tg3: eth0: Flow control is off for TX and off for RX.
    Apr 16 01:40:28 sjocsprodb1 kernel: tg3: eth0: Link is down.
    Apr 16 01:40:51 sjocsprodb1 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
    Apr 16 01:40:51 sjocsprodb1 kernel: tg3: eth0: Flow control is off for TX and off for RX.
    Apr 16 01:40:52 sjocsprodb1 kernel: tg3: eth0: Link is down.
    Apr 16 01:41:02 sjocsprodb1 kernel: tg3: eth0: Link is up at 100 Mbps, full duplex.
    Apr 16 01:41:02 sjocsprodb1 kernel: tg3: eth0: Flow control is off for TX and off for RX.
    Apr 16 02:00:41 sjocsprodb1 syslogd 1.4.1: restart.
    Apr 16 02:00:41 sjocsprodb1 syslog: syslogd startup succeeded


    i will be greatfull with any comment

  2. #2
    Linux Guru coopstah13's Avatar
    Join Date
    Nov 2007
    Location
    NH, USA
    Posts
    3,149
    perhaps there is a cron expression somewhere? check root user crontab and /etc/cron* folders

  3. #3
    Just Joined!
    Join Date
    Apr 2010
    Posts
    4
    thanks fot the tip but i have check that and there is no cron job

  4. #4
    Linux Guru Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    2,099
    You operate a failover system.
    The two servers watch each other via a heartbeat packet each X seconds.

    What happens is:
    1) *Somehow* the network fails.
    It flaps for unkown reason (eth0 down, eth0 up, etc)
    Maybe
    - faulty network card
    - faulty switch
    - someone tripping over a cable
    - etc

    2) the heartbeat packets cannot be received anymore

    3) so each server thinks the other one is unreliable

    4) There is something called STONITH (Shoot the other Node in the Head)
    Apart from the ..interesting name, it is a mechanism to ensure a clean, deterministic state.
    That means: A turned off faulty machine does less harm than a live one.

    So, you might want to check logs and network cards for errors.
    I also remember, that broadcom NICs behave unreliable with linux under heavy load. But that information might be outdated, I have no recent (1,5y) experience.

    If you have the chance, maybe you could try a pair of intel NICs?
    You must always face the curtain with a bow.

  5. #5
    Just Joined!
    Join Date
    Apr 2010
    Posts
    4
    Quote Originally Posted by Irithori View Post
    You operate a failover system.
    The two servers watch each other via a heartbeat packet each X seconds.

    What happens is:
    1) *Somehow* the network fails.
    It flaps for unkown reason (eth0 down, eth0 up, etc)
    Maybe
    - faulty network card
    - faulty switch
    - someone tripping over a cable
    - etc

    2) the heartbeat packets cannot be received anymore

    3) so each server thinks the other one is unreliable

    4) There is something called STONITH (Shoot the other Node in the Head)
    Apart from the ..interesting name, it is a mechanism to ensure a clean, deterministic state.
    That means: A turned off faulty machine does less harm than a live one.

    So, you might want to check logs and network cards for errors.
    I also remember, that broadcom NICs behave unreliable with linux under heavy load. But that information might be outdated, I have no recent (1,5y) experience.

    If you have the chance, maybe you could try a pair of intel NICs?
    Irithori thank you for the comments

    you are correct this is a fail over environment i understand perfectly the terms the fenced service will shutdown the node to prevent trash when the system is avalible

    the thing here is, this problem occuor exactly every two weeks on thursday at 1:39 am always at the same time we have consecutive failures i have apply latest bios and tz-data updates and still same thing, i have netdump running and there is nothing in the logs that show faulty eth card

    thanks

  6. #6
    Linux Guru Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    2,099
    Just a thought.
    Is *maybe* every 2 weeks a FullBackup happening at that time?
    Or something else, that can cause high network load on eth0?

    Do you have a monitoring/trend tool like munin or xymon in place?
    Something that gives you cpuload / memory / number of processes / network throughput / and <whatnot> in nice RRD graphs?

    This could indicate, if every two weeks something special is happening (or not).

    Or you spend a night at the datacenter and watch if someone rips cables out
    You must always face the curtain with a bow.

  7. #7
    Just Joined!
    Join Date
    Apr 2010
    Posts
    4
    Quote Originally Posted by Irithori View Post
    Just a thought.
    Is *maybe* every 2 weeks a FullBackup happening at that time?
    Or something else, that can cause high network load on eth0?

    Do you have a monitoring/trend tool like munin or xymon in place?
    Something that gives you cpuload / memory / number of processes / network throughput / and <whatnot> in nice RRD graphs?

    This could indicate, if every two weeks something special is happening (or not).

    Or you spend a night at the datacenter and watch if someone rips cables out
    No fullbackups at that time i have check everything actually that log on the messages file that indicates the network is down is because the server actually power off itself, i have a very good monitoring application that give cpu load memory everything even disk load and we cant see anything abnormal , i can say that the only thing that have change is that we dont have the redhat license anymore for the up2date thats it nothing else change

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...