Scaling NAT Gateways

AWS NAT Gateway is a service you need to connect to internet from private subnets of VPC. There is nothing wrong with the service itself, actually it is great that you don’t have to run a critical piece of network infrastructure on self-managed EC2 instances. And the price of less than $0.05/hour is pretty good deal for all the efforts it saves. But it can add up to significant cost when you start scaling things.

Problem is when you start scaling things up. A single NAT Gateway costs about $0.05/hour (rounding up and excluding traffic), and the best-practice is to have 1 per AZ. Deploy it into VPC over 3 AZs and the total cost is $0.15/VPC/hour. Multiply that by 3 tiers - dev, test and prod, for each application and you have $0.45/application/hour. And because it is not recommended to share accounts or VPCs this becomes $45/hour for 100 applications. Now the cost for internet access is almost $400.000 in a year !!!

I don’t think many will scale it that far but start thinking about trade-offs like deploying only a single NAT gateway for non-critical environments etc. to reduce the cost. However it would be nice to be able to eat the cake (=have HA network setup) and save it (=keep the cost in control) while scaling applications and environments.

Saving opportunity

For a long time I was thinking there is no way around this. Even if I could build a shared VPC to host NAT gateways, I would still have to pay similar price for each Transit Gateway attachment to connect subnets with shared NATs. Until I saw this discussion on AWS Community Builders Slack. Sounds almost too good to be true. I have to test this myself …

So a deployed a Transit Gateway and attached that to 2 subnets in VPC.

A Day later I went to check how my Cost Explorer looks. Yes! It is as it says in the documentation. TGW attachment is charged by VPC. It doesn’t matter how many AZs or subnets it is attached.

If NAT Gateway cost for HA network setup used to be

$0.05/NAT/hour * 3 AZs * VPCs

It is now reduced to

$0.05/NAT/hour * 3 AZs + $0.05/TGW attachment/hour * (VPCs + 1)

For 300 VPCs in my example, difference would be $394.200 vs $133.152, ie. saving $260.000 each year. Numbers are not 100% accurate and depend a bit on which region, or how many regions, you have deployed services but you could expect saving 1/2 - 2/3 of your current spend on NAT Gateways. Not bad at all, even if you don’t have 300 VPCs.

Cheaper but better

The best part of this is, it isn’t just cheaper but also better. If you used to save in NAT costs by deploying just single (or two) AZ, now you can upgrade your setup to cover all 3 AZs with independent NAT gateways. Better HA, with same, or less spend.

As a side-effect you have now also a place to deploy centralized egress controls for all internet traffic, which will make your network security team very happy :-)

Caveat

To be able to connect VPCs with Transit Gateway to shared NATs, there can not be overlapping VPC CIDRs. This is the basic requirement of connecting VPCs, but if you start from bunch of disconnected VPCs it can be those are using the same IP ranges and can’t be connected. Or you actually can, but solution would involve private NAT Gateways that will eat all your savings. But you would still get the option to deploy centralized egress controls.