Hi, I am trying to set up a high availability service in a 2 Node cluster with Red Hat Cluster Suite, and I am using Centos 5.3 as the Nodes OBS.

Node1: 10.4.0.4
Node2: 10.4.0.5

Luci is installed at Node2.
Both Nodes have started cman, clvmd, qdiskd, rgmanager and ricci

Currently I am setting up a web service in my cluster with comprise of three resources:
1. A GFS File system in a shared storage
2. An IP address
3. A startup script at /etc/init.d

And this is what I experienced; I managed to start my service for around half a minute at the Node I instruct it to start at. But after around 30 seconds the services kill itself, I have tested that the service works if I run it as an individual service without using luci.

I have read the following about Cluster service in the internet:

Monitoring and failover of cluster services can be done using simple scripts/binaries. Cluster can/need to receive 3 responses from those scripts: start, stop and status. If the script is called with the status parameter, it should have exit status 0 for OK and exit status 1 for not OK.
So, if we want to have apache as a clustered service, we can use its default init script (under /etc/init.d path).


So, these are my worries, it mentioned that the script need to response the “start”, “stop” and “status” call, and my script only response to “start” and “stop”, but not “status” !!! So I need to make sure I understand how Cluster service works since I observed something about returning 1 (not work) in /var/log/message:


Sep 10 10:02:12 HA2 clurgmgrd[8769]: <notice> Stopping service service:MYservice1
Sep 10 10:02:14 HA2 luci[6194]: Unable to retrieve batch 398076251 status from 10.4.0.5:11111: module scheduled for execution
Sep 10 10:02:19 HA2 clurgmgrd: [8769]: <err> script:myscript: stop of /etc/init.d/myscript failed (returned 1)
Sep 10 10:02:19 HA2 clurgmgrd[8769]: <notice> stop on script "myscript" returned 1 (generic error)
Sep 10 10:02:19 HA2 clurgmgrd[8769]: <alert> Marking service:MYservice1 as 'disabled', but some resources may still be allocated!
Sep 10 10:02:19 HA2 clurgmgrd[8769]: <notice> Service service:MYservice1 is disabled
Sep 10 10:02:20 HA2 luci[6194]: Unable to retrieve batch 398076251 status from 10.4.0.5:11111: Unable to disable failed service MYservice1 before starting it: clusvcadm failed to stop MYservice1:
Sep 10 10:04:43 HA2 clurgmgrd[8769]: <notice> Starting disabled service service:MYservice1
Sep 10 10:04:44 HA2 luci[6194]: Unable to retrieve batch 2051321004 status from 10.4.0.5:11111: module scheduled for execution
Sep 10 10:04:45 HA2 avahi-daemon[4933]: Registering new address record for 10.4.0.7 on eth0.
Sep 10 10:04:50 HA2 luci[6194]: Unable to retrieve batch 2051321004 status from 10.4.0.5:11111: module scheduled for execution
Sep 10 10:04:51 HA2 clurgmgrd: [8769]: <err> script:myscript: start of /etc/init.d/myscript failed (returned 1)
Sep 10 10:04:51 HA2 clurgmgrd[8769]: <notice> start on script "myscript" returned 1 (generic error)
Sep 10 10:04:51 HA2 clurgmgrd[8769]: <warning> #68: Failed to start service:MYservice1; return value: 1
Sep 10 10:04:51 HA2 clurgmgrd[8769]: <notice> Stopping service service:MYservice1
Sep 10 10:04:55 HA2 luci[6194]: Unable to retrieve batch 2051321004 status from 10.4.0.5:11111: module scheduled for execution
Sep 10 10:05:30 HA2 last message repeated 6 times
Sep 10 10:06:32 HA2 last message repeated 11 times
Sep 10 10:07:01 HA2 last message repeated 5 times
Sep 10 10:07:01 HA2 clurgmgrd: [8769]: <err> script:myscript: stop of /etc/init.d/myscript failed (returned 1)
Sep 10 10:07:01 HA2 clurgmgrd[8769]: <notice> stop on script "myscript" returned 1 (generic error)
Sep 10 10:07:01 HA2 avahi-daemon[4933]: Withdrawing address record for 10.4.0.7 on eth0.
Sep 10 10:07:06 HA2 luci[6194]: Unable to retrieve batch 2051321004 status from 10.4.0.5:11111: module scheduled for execution
Sep 10 10:07:11 HA2 clurgmgrd[8769]: <crit> #12: RG service:MYservice1 failed to stop; intervention required
Sep 10 10:07:11 HA2 clurgmgrd[8769]: <notice> Service service:MYservice1 is failed
Sep 10 10:07:11 HA2 clurgmgrd[8769]: <crit> #13: Service service:MYservice1 failed to stop cleanly
Sep 10 10:07:12 HA2 luci[6194]: Unable to retrieve batch 2051321004 status from 10.4.0.5:11111: clusvcadm start failed to start MYservice1:
Sep 10 10:09:14 HA2 qdiskd[13333]: <warning> qdisk cycle took more than 1 second to complete (1.010000)




Its seem to first attempts to stop the service, then it says the stop has failed because of a returned 1, so I fear that it is checking if the start or stop call was successful by calling the service’s status, which my script doesn’t support. Hence will always thinks a start or stop has fail so deciding to withdraw the resource from the service. That might explain why I could sometimes start my service for 30 sec and then the service killing itself.

I hope I am wrong, but can anyone kindly explain to me how a cluster service really works in RHCS