DNS issue with Failover IP from Hetzner

  • As promised, here goes my answer:

  • Full disclosure: I'm not working for Hetzner, but worked for different companies in the past and present who used to colocate hardware at Hetzner.

  • In case the location inside your profile is correct, and you need support: I'm based in the same city, and could offer a hand, or two.

  • For all the people who never dealt with Hetzner: They're filtering network access etc., which means, especially regarding their failover IPs (IPs which are usable on different machines to provide some sort of high availability), that they're sending traffic directed to a specific IP to a specific MAC.

  • If one wants to change the target (the machine) the traffic is directed to, one has to sent a POST request to an API which is served via HTTPS. The API then validates authentication (which is a username and a corresponding password) and the request, and, if valid, propagates this new config to various routers in the network. This technique is similar to the one used by OVH, a big provider based in France.

  • Caveat: Albeit people use these IPs to provide some sort of high availability (as written) for their machines / services, the propagation of the new routing config takes some time, sometimes up to ~ 60 seconds. This means, for example, if using some sort of automatic failover, that if a machine to which the traffic currently gets routed to, goes down, for a certain amount of time, which people will notice, the traffic just gets dropped, because the machine is down, up until the point in time when the new routing config is in place.
  • So far for the introduction, let's have a look at your specific problem:
  • As pointed out in the comments / chat, using auto eth0:0, will set up your failover IP at the interface eth0:0, as soon as the network gets started, normally at boot time. You've got two machines, with the same configuration, so this leads to the situation, that the same IP is active on two different machines (which isn't a no-go, but leads to the situation you're currently dealing with). Just a note: The syntax you're using, aliasing the same interface multiple times, is deprecated (but still working). The "new way" is described inside the Debian wiki (this link) as well, which just assigns multiple IPs to one interface.
  • So: You've got the IP assigned locally to both machines at the same time. curl inside your test case does the following: It resolves the given domain name to an IP, and then tries to connect to this IP at port 443. Because this IP is in any case assigned locally and therefore reachable, the packets never get send out to the network. If nginx (like in your test case) is not running locally at this time, you're just getting connection refused, which is totally fine and valid: "The IP is local, so lets send the traffic there". It will never send the packets to some router, which maybe has the information: "The traffic directed to this IP should go to this machine".
  • Now...actually I'm not entirely sure what you're after. Do you only want to understand whats happening? If so, I've tried to described this. Do you want do find / implement a way, which "solves" this situation? If the later, here are some thoughts:
  • Solution 1: Remove the directive auto eth0:0 (but leave the rest of the configuration of eth0:0 in place) from /etc/network/interfaces. Doing this, will not assign the IP to the machine. Doing this would be your task (a task of a script), which does ifup eth0:0 (and, again maybe, speaks to the API to ensure the traffic gets routed to the correct machine).
  • Solution 2, aka "automate all the things": Don't do manual failover, but implement a system which does this automatically, via heartbeats (to check the health) between both machines: Multiple solutions exist for this, for example the Virtual Router Redundancy Protocol and (full disclosure: my personal favorite, I'm using this since years in production for tasks like this): corosync and pacemaker, which is the de facto standard to set up clusters providing high availability under Linux. (Also, have a look at this.) If you want to try out the later way, the fine folks of Kumina developed (and published) a resource agent some years ago for exactly dealing with this situation at Hetzner. The resource agent takes care of updating the routing information via speaking to the API.
  • To come to an end (for now): I'm not entirely sure what you're after. I've tried to described the root cause of the problem you're facing right now. Additionally, I've tried to present some thoughts for possible solutions. In case I didn't got what you're trying to do, there are things which are left unclear or you've got additional questions: Please give feedback, I'm glad to help (or at least try to).
  • (Besides: Could you please move your configs etc. into your post, to keep all the stuff in one place, so this question could be of help in the future to other people?)