Find the answer to your Linux question:
Results 1 to 6 of 6
Hey guys I was running into a problem with my clusters network the other day that I was hoping to get someone to chime in with troubleshooting help. We have ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Apr 2010
    Posts
    15

    Cluster Network Down


    Hey guys I was running into a problem with my clusters network the other day that I was hoping to get someone to chime in with troubleshooting help. We have a 20 node cluster (1 head node, 19 compute nodes) connected through a gigabit ethernet switch. All compute nodes mount /home and /opt from the head node. At some point I realized that I was unable to ssh to the cluster. Upon further inspection it seemed as though the head node was functioning just fine (I attached a monitor and keyboard to it directly). All compute nodes were running fine except that they had an error about the head node not responding. However, I also found that I was unable to ssh to any of the nodes from any node. So it appears that the ethernet network has gone down. The switch is still on except all of the port lights are flashing simultaneously in a regular interval.

    This has happened a couple of times now and each time the only way I've gotten the cluster back up an running is to brute force shut everything down and restart.

    Really I'm looking for suggestions on how to figure out what went wrong and why because each time time this happens all of the calculations on the cluster are killed. So obviously I want to avoid this in the future...

    Thanks in advance!

  2. #2
    Linux Guru Lazydog's Avatar
    Join Date
    Jun 2004
    Location
    The Keystone State
    Posts
    2,677
    Sounds like the problem is the switch and not the cluster. What type of switch is it?

    Regards
    Robert

    Linux
    The adventure of a life time.

    Linux User #296285
    Get Counted

  3. #3
    Just Joined!
    Join Date
    Apr 2010
    Posts
    15
    I initially thought it was the switch as well. It is a 50 port gigabit switch (SMC8150L2). I'm just not very familiar on how to troubleshoot switches. Any info would definitely be appreciated...

  4. $spacer_open
    $spacer_close
  5. #4
    Linux Guru Lazydog's Avatar
    Join Date
    Jun 2004
    Location
    The Keystone State
    Posts
    2,677
    You have access to the switch and can log into it?
    Who is the maker of this switch?

    Regards
    Robert

    Linux
    The adventure of a life time.

    Linux User #296285
    Get Counted

  6. #5
    Just Joined!
    Join Date
    Apr 2010
    Posts
    15
    The switch is made by SMC and I can indeed access it. It is a managed 50 port 10/100/1000 gigabit switch. When I access the switch what kinds of things am I looking for?

    What say ye?

  7. #6
    Linux Guru Lazydog's Avatar
    Join Date
    Jun 2004
    Location
    The Keystone State
    Posts
    2,677
    Look at the logs and see if there is anything in there around the times that you lose the cluster.

    Regards
    Robert

    Linux
    The adventure of a life time.

    Linux User #296285
    Get Counted

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •