(or warm standby architecture with IP failover in the Cloud)

Background and the Problem

As long as you have stateless servers that can run in active-active -configuration (that’s kind of build into stateless-nes) it is trivial to build highly available architectures by placing a load balancer in front of the servers. However when migrating legacy services to public cloud, it is not uncommon to run into software architectures that just don’t work with load balancing.

Typical problem is statefull server running a service that can be active only on single instance at the time with clients using an IP address to connect (not DNS name) because DNS caching would make failover too slow or unreliable. So you would need a static IP that could failover to active server.

IP failover may not sound too difficult at first, attach an elastic network interface (ENI) to your instance and if the server stops responding, detach and re-attach it to next instance with some Lambda -magic 🦄. But what if instances are in different AZs?

Amazon VPC for On-Premises Network Engineers

Best practice for highly available service is to have servers in multiple AZs, but ENI (and subnet) are AZ specific. As servers would be in different subnets, it is not possible to have the same private IP assigned to both of them, right? (don’t even start thinking about using public elastic IP and exposing servers to internet just because of IP failover)


Fortunately there is an easy way to setup IP failover between availabity zones.

  • Assign virtual IP to both instances interfaces. Actual command might vary a bit depending on your OS/distribution but on Ubuntu 18 below cmd will add as secondary IP for eth0 interface.
# ip addr add dev eth0 label eth0:1

  • Modify route table(s) for subnet(s) to route to the serverA.

  • DONE!

Now let’s verify from an instance in the same VPC that connection to is routed to serverA. For demonstration I’m using Nginx webserver but this could be any application listening to TCP/UDP traffic on instance.

$ curl -s | grep title
<title>Welcome to nginx @serverA!</title>

For IP failover demo, lets start pinging …

$ ping
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.578 ms
64 bytes from icmp_seq=2 ttl=64 time=0.503 ms
64 bytes from icmp_seq=19 ttl=64 time=0.431 ms
64 bytes from icmp_seq=20 ttl=64 time=0.570 ms

At this point I edit VPC route table and change to serverB.

64 bytes from icmp_seq=21 ttl=64 time=1.04 ms
64 bytes from icmp_seq=22 ttl=64 time=0.981 ms
64 bytes from icmp_seq=28 ttl=64 time=0.951 ms
64 bytes from icmp_seq=29 ttl=64 time=0.943 ms
--- ping statistics ---
29 packets transmitted, 29 received, 0% packet loss, time 28443ms

Ping response time increased from 0.5ms to 1.0ms after switching from serverA to serverB. This is because the client and serverA are both in AZ-A but serverB is in AZ-B. To be sure is really routed to serverB lets check I get different response than earlier …

$ curl -s | grep title
<title>Welcome to nginx @serverB!</title>

Yes, is now routed to serverB and no ping packets were lost during the failover!


Above example works only because

  1. Both client and servers are within the same VPC. If the client would be outside of VPC, e.g. in on-prem network routing wouldn’t work as VPC would have refused packets addressed outside VPC CIRD (I’d love to be proven wrong)

  2. Failover IP is outside of VPC CIDR -range. This may first sound strange but it is requirement for this solution to work. I was hoping the same trick of modifying route tables would also allow routing traffic between partially overlapping VPCs but as it turns out, it is not possible to create a more specific route for network/address inside VPC CIDR than default local -route.

Similar works also in Google Cloud, and I think you can build something similar in Azure too(?)