Cluster Network Down
Hey guys I was running into a problem with my clusters network the other day that I was hoping to get someone to chime in with troubleshooting help. We have a 20 node cluster (1 head node, 19 compute nodes) connected through a gigabit ethernet switch. All compute nodes mount /home and /opt from the head node. At some point I realized that I was unable to ssh to the cluster. Upon further inspection it seemed as though the head node was functioning just fine (I attached a monitor and keyboard to it directly). All compute nodes were running fine except that they had an error about the head node not responding. However, I also found that I was unable to ssh to any of the nodes from any node. So it appears that the ethernet network has gone down. The switch is still on except all of the port lights are flashing simultaneously in a regular interval.
This has happened a couple of times now and each time the only way I've gotten the cluster back up an running is to brute force shut everything down and restart.
Really I'm looking for suggestions on how to figure out what went wrong and why because each time time this happens all of the calculations on the cluster are killed. So obviously I want to avoid this in the future...
Thanks in advance!
Sounds like the problem is the switch and not the cluster. What type of switch is it?
I initially thought it was the switch as well. It is a 50 port gigabit switch (SMC8150L2). I'm just not very familiar on how to troubleshoot switches. Any info would definitely be appreciated...
You have access to the switch and can log into it?
Who is the maker of this switch?
The switch is made by SMC and I can indeed access it. It is a managed 50 port 10/100/1000 gigabit switch. When I access the switch what kinds of things am I looking for?
What say ye?
Look at the logs and see if there is anything in there around the times that you lose the cluster.