Results 1 to 1 of 1
I have a cluster set up with a head node and compute nodes running TORQUE and MOAB. The distro is ROCKS 5.3 and I have redhat el5. I've been having ...
- 09-13-2010 #1Just Joined!
- Join Date
- Sep 2010
- Posts
- 1
Intermittent connectivity issues with ROCKS on a compute cluster
I have a cluster set up with a head node and compute nodes running TORQUE and MOAB. The distro is ROCKS 5.3 and I have redhat el5. I've been having problems with the connectivity for the past couple weeks now. Every couple hours it seems like the network connectivity will just stop working: sometimes it'll start back up in 10-15 minutes, sometimes I have to reboot the machine. I have SAMBA set up, and the network drive I have mounted on my windows PC won't respond (often causing windows explorer to crash) and I can't putty in. During this time, if I already have a putty window open, I can do basic commands like "ls" and "cd" but qstat and pbsnodes don't work. If I'm putty'd into the head node, I can ssh into one of the compute nodes. Eventually the putty window will crash though. Also, I can ping the server just fine.
The SAMBA logs were reporting all sorts of problems:
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(42)
INTERNAL ERROR: Signal 7 in pid 9816 (3.0.33-3.15.el5_4)
[2010/09/10 03:51:29, 0] smbd/close.c:close_directory(430)
close_directory: Could not get share mode lock for Pao
Please read the Trouble-Shooting section of the Samba3-HOWTO
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(44)
From: [CAN'T POST URLS -- it's the samba help pages]
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(41)
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(45)
[2010/09/10 03:51:29, 0] lib/fault.c:fault_report(42)
[2010/09/10 03:51:29, 0] lib/util.c:smb_panic(1655)
INTERNAL ERROR: Signal 7 in pid 8475 (3.0.33-3.15.el5_4)
PANIC (pid 9816): internal error
Please read the Trouble-Shooting section of the Samba3-HOWTO
[2010/09/10 03:51:30, 0] lib/util.c:log_stack_trace(1759)
[2010/09/10 03:51:30, 0] lib/fault.c:fault_report(44)
I turned off SAMBA, still have the same problems. /var/log/messages contained this:
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 10 10:38:02 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
bnx2i is some sort of driver for the broadcom network card. I updated the broadcom multi-function drivers and the firmware, still have problems. One thing I couldn't get working was the bnx2i iSCSI offload driver -- I ran into version issues with the RPMs. I've ran MEMTEST and a couple hardware diagnostic checks -- can't find any problems. Here's /var/log/messages from when I reboot the machine. Note that I hosed the x server somehow, and I'm not really worried about fixing that.
Sep 13 04:49:25 wantsh01 gdm[3930]: Failed to start X server several times in a short time period; disabling display :0
Sep 13 04:49:29 wantsh01 mountd[3527]: Caught signal 15, un-registering and exiting.
Sep 13 04:52:12 wantsh01 kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range
Sep 13 04:52:12 wantsh01 kernel: PCI: BIOS Bug: MCFG area at e0000000 is not E820-reserved
Sep 13 04:52:12 wantsh01 kernel: PCI: Not using MMCONFIG.
Sep 13 04:52:13 wantsh01 kernel: intel_rng: FWH not detected
Sep 13 04:52:13 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 13 04:52:13 wantsh01 kernel: bnx2i: dev eth0 does not support iscsi
Sep 13 04:52:13 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 13 04:52:13 wantsh01 kernel: bnx2i: dev eth1 does not support iscsi
Sep 13 04:52:13 wantsh01 named[3028]: the working directory is not writable
Sep 13 04:52:19 wantsh01 sshd[3428]: error: Bind to port 22 on 0.0.0.0 failed: Address already in use.
Sep 13 04:52:19 wantsh01 xinetd[3445]: /etc/xinetd.d/RCS is not a regular file. It is being skipped.
Sep 13 04:52:24 wantsh01 smartd[3926]: Problem creating device name scan list
Sep 13 04:52:24 wantsh01 smartd[3926]: Problem creating device name scan list
Sep 13 04:52:24 wantsh01 smartd[3926]: In the system's table of devices NO devices found to scan
Sep 13 04:52:31 wantsh01 gdm[4042]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:52:40 wantsh01 gdm[4188]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:52:49 wantsh01 gdm[4210]: gdm_slave_xioerror_handler: Fatal X error - Restarting :0
Sep 13 04:53:19 wantsh01 gdm[3940]: Failed to start X server several times in a short time period; disabling display :0
Sep 13 04:53:32 wantsh01 dhcpd: receive_packet failed on eth0: Network is down
Sep 13 04:53:33 wantsh01 kernel: bnx2i: dev eth0 does not support iscsi
Sep 13 04:53:33 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth0
Sep 13 04:53:37 wantsh01 kernel: bnx2i: dev eth1 does not support iscsi
Sep 13 04:53:37 wantsh01 kernel: bnx2i: iSCSI not supported, dev=eth1
Sep 13 04:54:50 wantsh01 snmpd[3379]: c64 32 bit check failed
Sep 13 04:55:20 wantsh01 snmpd[3379]: looks like a 64bit wrap, but prev!=new
Thanks for any help, I'd really appreciate some advice.


Reply With Quote