some hosts are unreachable until i add / remove a host route for them
I've got a really strange Problem that i hardly come to grips with, but let me describe my overall situation first:
I've got a couple of hosts. Some are physical and some are virtual machines running e.g. on Linux-KVM or VirtualBox based servers, etc. Those hosts are spreaded over two networks, let's call them n1 and n2, wich are connected via OpenVPN (with pretty often breaks down, but that's another story). To be honest I doubt (not to say I'm 98% sure) that the OpenVPN and virtualization (and related bridging) stuff could really be the cause...
- n1 is part of my universities subnet and thus it has public IP-addresses (141.13.*.*). The (dedicated) OpenVPN "gateway server" (call it 'U') is running inside a virtual machine. There's also a separate virtual machine for the firewall/routing/dns/dhcp services in that network.
- n2 is a private network and uses private IP-addresses (192.168.*.*). The routing and OpenVPN services are also accomplished by a virtual machine in this network (call it 'H').
- Routes are set on each network (i.e. on the routers / VPN-servers) so that every host from n1 can reach every host from n2 and vice versa. Everything works fine ... nearly fine...
- A, B and C are hosts from n1,
- X, Y and Z are hosts within n2.)
Sometimes, not to say quite regularly, it occurs that Z is able to reach (e.g. ping, ssh, etc.) A and B but for C no connection can be established, i.e. it times out. It's interesting to see that for example X or Y can indeed connect/ping C. So this is probably not related to any of the routers or VPN-servers involved. Even more interesting is the fact that running
on Z (but leaving all other hosts untouched) instantly enables communication from Z to C i.e. it gets ping/ssh/etc working. (I found that phenomenon by pure accident)
route add -host C gw H; route del -host C
I doesn't really matter whether the connection source or target is a physical or a virtual host. Maybe that's just in my mind but I've got a feeling that this mainly happens after the OpenVPN connection was down and I had to restart it (manually), so that n1 was unreachable for a while. I looks quite like if Z (i.e. a service running on it) was trying to connect to C within that period of time and has been -- of course -- unsuccessful.
Is there any feature or bug in the kernel which makes it remember such unsuccessful connection requests? It quite looks that way, at least...
(The system on Z is Debian Wheezy running on a 3.2.0-2-amd64 kernel.)
This problem totally drives me crazy. Is there anyone who could shed some light into it? What can I do to get rid of it?
Thanks in advance!