- Glenrose Group
- Posts
- Optimizing NATGW Costs by 90%
Optimizing NATGW Costs by 90%
Stop shoveling truckloads of money to AWS
Introduction
To optimize cloud costs many organizations focus on tooling, broader architectural changes, vendor migrations, and/or contract re-negotiations. While optimizing these may bring positive returns, the greatest returns can originate from the simplest changes. One such change relates to a small piece of hardware that lies at the heart of every organization's infrastructure footprint.
This box is known as a NAT Gateway, or Network Address Translation Gateway (NATGW), and it is responsible for sending and receiving traffic to the public internet. The NAT Gateway, however, does a lot more than just send packets out into the world; it’s there to help translate the IP space into which your organization deploys its services and to translate the public IP space we all share to use to access applications like Youtube or this website!
Sidenote: It’s likely you get your home internet from an ISP and just like your organization might have a NAT Gateway, your ISP will have an even bigger version called Carrier-grade NAT (CGNAT)!
A cloud networking primer
In the cloud, you’ll always start by provisioning a virtual public network (VPC) – a virtual network that acts as an isolated environment for your cloud infrastructure. To illustrate the process, imagine you’re starting a company and you want each building (VPC) to have its own phone number. As a visual guide, let’s pretend we’re setting up an account in us-east-1 (an AWS region) and the “phone number” we’re going to give our VPC is 192.168.1.0/16.
Our basic VPC primer
The next step in preparing the VPC is to further divide it into subnets which represent segments of the entire VPC. Continuing with our analogy, envision assigning each floor a unique extension number (e.g., floor 1 -> extension 1000) that streamlines communication. Now, let's create two subnets - one public and one private - to enhance your network's security and accessibility.
Subnets
We’ve setup two subnets: one public and one private.
Within these two subnets, we have some special rules. Private subnets exist to create an isolation layer from accidentally exposing resources to the public internet. Resources deployed within them don't carry a public IP address. Picture this subnet as a secure extension line connected to a prominent corporate number (e.g. extension 1458). On the other hand, resources in public subnets enjoy the privilege of public IP addresses, akin to having a direct line to the CEO's office.
Let’s create two EC2 instances
Internet Gateways and NAT Gateways
For these instances to communicate with resources out on the public internet each needs a gateway. The public subnet (akin to a company’s phone system that is equipped to call out) needs an Internet Gateway placed within the subnet to allow resources to communicate with the public internet. If a resource has a public IP address it can communicate through the Internet Gateway. In the private subnet (akin to a company's internal phone system) a resource needs a NAT GW to talk to the Internet Gateway and gain access to the internet.
The traffic path of instances communicating with resources in the open internet.
What are the costs of NATGWs?
There are two primary cost drivers for the utilization of NAT GWs; a fixed hourly cost and a variable cost per gigabyte sent through the NAT GW. Internet gateways are cost-free. List prices for NAT GWs in us-east-1 are: hourly pricing at $0.045/hr ($32.85 a month) and $0.045/gb for traffic egressing through the NAT GW. The minimum cost to deploy one of these is $32.85/ month. If you want a high availability solution you’ll need an hourly NAT GW for each availability zone your services are present. For most regions that means deploying three NATGW’s at $32.85/month; or almost $100 before you send the first gigabyte.
The outrageous costs of NAT GWs
Example costs for an organization
Let’s create a hypothetical example: SaaSCo is a solution provider with three distinct product lines. Each product line has three environments (or accounts), one for development, one for pre-production, and one that runs the production environment. All accounts are deployed into us-east-1 across all six availability zones. Development and staging average roughly 1TB a month per gateway while production does 25TB per gateway.
With our hypothetical example above, each product line costs over $8,000 a month!
What are the alternative solutions?
Before NAT GW hardware was created all organizations ran NAT instances. These are regular compute instances running standard linux services that enable them to perform a nearly identical function to NAT instances. This is the next money-saving option for you. If you run NAT instances, you’re only charged the base cost of the instance (a monthly fixed cost).
AWS actually has a very interesting comparison of NAT gateways vs NAT instances. We think there is a bit of a bias here so here’s our take on it:
The holy grail: Alternat
Alternat is an open-source solution that leverages NAT instances to provide a more cost-effective option with minimal maintance burden out of the box. From Alternat’s description we can quickly learn about what makes it a fantastic drop-in solution for your organization:
Self-provisioned NAT instances in Auto Scaling Groups
Standby NAT Gateways with health checks and automated failover, facilitated by a Lambda function
Vanilla Amazon Linux 2 AMI (no AMI management requirement)
Optional use of SSM for connecting to the NAT instances
Max instance lifetimes (no long-lived instances!) with automated failover
A Terraform module to set everything up
Compatibility with the default naming convention used by the open source terraform-aws-vpc Terraform module
If you’re interested in the architecture of Alternat, please checkout their repository which is full of gems like the architecture diagram below.
Alternat’s architecture overview
Revisiting our earlier example, if we replace our NAT GW costs with Alternat we have a few more options at our disposal. Our development and staging environments benefit from using basic EC2 instances while production benefits by leveraging a properly capacity sized instance.
Overall, leveraging Alternat would reduce the monthly costs down to $700 a month, down from $8,056 a month; resulting in an ~90% reduction in costs.
If you found this helpful, please let us know! At the Glenrose Group is a select set of expert cloud practitioners who love to tame these wild, out of control, infrastructure bills — mention this newsletter to get a free consult.
Elizabeth Flowers
Founder & Chief Cloud Scientist @ Glenrose Group