Hi all,
I setup a glassfish cluster successfully on RHEL5 version 3.1 b43 with two nodes each node having one instance. I tested mod_jk versions 1.2.26/28.31 both prebuild and manually compiled. We plan to deploy an application that will be receiving each hour small updates from 300000 clients. We deployed and sample http application showing the instance name to make sure the loadbalancer works. I followed this post, but made changes to the worker.properties as failover was not working:

h t t p : //tiainen.sertik.net/2011/03/load-balancing-with-glassfish-31-and.html]tiainen: Load balancing with Glassfish 3.1 and Apache

jk.conf
LoadModule jk_module modules/mod_jk.so
JkWorkersFile /etc/httpd/worker.properties
JkShmFile /var/log/httpd/mod_jk.shm
JkLogFile /var/log/httpd/mod_jk.log
JkLogLevel info
JkLogStampFormat "[%a %b %d %H:%M:%S %Y] "
JkOptions +ForwardKeySize +ForwardURICompat -ForwardDirectories
JkRequestLogFormat "%w %V %T"
# redirect traffic to loadbalancer
JkMount /* loadbalancer

worker.properties
worker.list=loadbalancer
# default properties for workers
worker.template.type=ajp13
worker.template.port=28009
worker.template.lbfactor=50
worker.template.connection_pool_timeout=600
worker.template.socket_keepalive=1
worker.template.socket_timeout=120
# properties for worker1
worker.worker1.reference=worker.template
worker.worker1.host=glassfishin01
#worker.worker1.lbfactor=50
# properties for worker2
worker.worker2.reference=worker.template
worker.worker2.host=glassfishin02
#worker.worker2.lbfactor=50
# properties for loadbalancer
worker.loadbalancer.type=lb
worker.loadbalancer.sticky_session=False
worker.loadbalancer.reply_timeout=30000
worker.loadbalancer.balance_workers=worker1,worker 2

What works:
1. The DAS and the Glassfish instances work as expected
2. The loadbalancing works just fine
3. Failover works ONLY if I stop or restart the instance from the DAS
The problem:
If I restart the OS of an instance, failover is damaged - instance is detected as down and until this instance is down things seem fine. When the failed instance boots up either of them is not working (randomly). Sometimes I have to restart the cluster and the httpd to get things going. Somehow mod_jk makes difference between both types of failover. It wrongly detects one or both intances as down.

[Tue Apr 26 09:07:18 2011] [29639:3085998688] [error] ajp_connection_tcp_get_message::jk_ajp_common.c (1011): (worker2) can't receive the response message from tomcat, network problems or tomcat (192.168.3.204:28009) is down (errno=104)
[Tue Apr 26 09:07:18 2011] [29639:3085998688] [error] ajp_get_reply::jk_ajp_common.c (1766): (worker2) Tomcat is down or refused connection. No response has been sent to the client (yet)
[Tue Apr 26 09:07:18 2011] [29639:3085998688] [info] ajp_service::jk_ajp_common.c (2186) (worker2) sending request to tomcat failed (recoverable), (attempt=1)



Can you, please suggest where to look for the problem?
Best regards,
Todor