The large cloud zombie resource tax

Look, you can't have these six roaming around your P&L.

Nobody look at the P&L this quarter.

Introduction

A major benefit arising from engaging with a widely diverse client group is the identification of common sets of problems and misconfigurations. Of course, each client faces unique challenges and constraints, but clients share many issues. These commonalities have become part of our day-zero bill review process and today we will share them with you.

Resource Rightsizing

The cloud features an array of resources. One of the three basic primitives (and often the most expensive) is compute. The cloud promised increased simplicity by eliminating the need to order hardware and manage complexity, but the real-world experience is more complicated. Teams often lack the level of platform maturity to have dynamic resource rightsizing through automation. The result is that resource rightsizing must become a part of a monthly or quarterly review.

Additionally, capacity constraints can often mean that moving large amounts of compute is often a multi-month process. Those arm processors you’re waiting on from your favorite cloud provider (Graviton, Ampere, Arm) can oftentimes need weeks to months to come in if you’re a client running at scale. Planning ahead and factoring in the lead time required for procuring and deploying the necessary compute resources becomes vital.

Instance Generation

One of the simplest quick wins in optimizing cloud infrastructure is to find instance generation upgrades. Most clients don’t keep a compendium of instance equivalency charts by their bedside table, so finding instance generation upgrades has become known as "the easiest thing they've ever done, tied with GP2 upgrades." Improving performance by upgrading to the latest generation of instances, results in significant cost savings.

WORKLOAD UTILIZATION TARGETS

A frequent client question is How do we know if we have optimal workload utilization targets?” Giving case-specific advice presents a challenge without using more permanent resources like databases, but we use a five-tier classification to evaluate appropriate utilization. The targets are:

  • Bronze: Below 50% utilization - This level of utilization may indicate untapped resources and potential cost savings can be realized by resizing instances or consolidating workloads.

  • Silver: 50-60% utilization - A moderate level of utilization that balances performance and cost, appropriate for non-critical workloads with some room for growth.

  • Gold: 70-80% utilization - This target range is ideal for most production workloads, providing a good balance between performance, cost, and resource availability.

  • Platinum: 80-85% utilization - A superior level of utilization indicative of strong engineering culture that seeks to maximize utilization

  • Diamond: Over 90% utilization - The highest level of utilization, reserved for extremely demanding workloads where performance is the top priority.

Instance Storage

Instance storage costs frequently rank among the top five cost drivers for our clients. Despite this, many clients overlook this entire expense category, assuming that there is little room for optimization. In fact, there are several strategies to optimize instance storage to save money and improve overall efficiency. Here are the two items we look for first:

Base Volume Size

A key to achieving optimal instance storage costs is the base volume size. Selecting the appropriate volume type and size produces the best balance between performance and cost. Many clients leverage the default size of 8GBs or 16GBs with average volume utilization sitting at less than 1%, which is inefficient and costly. We’ve seen some clients realize significant savings by reducing root volume sizes down to 1GB for compute-optimized workloads that don’t utilize disks.

GP3

GP3 volumes provide a balance of price and performance, making them suitable for most use cases. They offer a baseline performance of 3,000 IOPS and 125 MB/s throughput, which can be scaled independently to meet your specific requirements. By choosing the correct size and performance settings for your GP3 volumes, you can optimize storage costs without sacrificing performance.

If you’re running GP2, upgrade to GP3. It’s an effortless upgrade with zero downtime and most customers can drop their instance storage costs by 20%. We have yet to find a valid reason for clients to be running GP2.

Orphans & Zombies

Orphaned and zombie resources are two classes of resource waste that can significantly impact the efficiency and cost of your cloud infrastructure. Identifying and eliminating these resources is crucial, as is devising a strategy to prevent re-occurrence.

  • Orphaned resources were once attached to or used by other resources but are now disconnected and are no longer in use. They continue to incur charges and consume resources without providing any value.

  • Zombie resources are outdated or unused resources that still exist in your infrastructure, consuming storage space and incurring costs without serving any purpose.

Sources of orphaned resources:

Remove unattached volumes

Unattached EBS volumes are orphaned resources that are no longer connected to any instances but still incur charges. Identifying and deleting these volumes can lead to significant cost savings and a smaller infrastructure footprint.

Release unattached Elastic IPs

Unattached Elastic IPs are orphaned resources that are not associated with any running instances. Releasing these IPs can help you optimize costs and free up valuable IP addresses for future use.

Sources of zombie resources:

Remove outdated snapshots

Outdated or unused EBS snapshots continue to consume storage space and add to your costs. For example, in investigating an incident with a client we discovered that they capture snapshots of troubled VMs. We found, however, that they never actually cleared these snapshots which led to thousands of dollars wasted on storing completely useless snapshots.

Remove unused images

Unused images like Amazon Machine Images (AMIs), are another type of zombie resource. These are custom images created for launching instances but are no longer needed. Identifying and de-registering unused AMIs can help reduce storage costs and simplify your infrastructure management.

If you found this helpful, please let us know! At the Glenrose Group is a select set of expert cloud practitioners who love to tame these wild, out of control, infrastructure bills — let’s optimize your bills next.


Elizabeth Flowers
Founder & Chief Cloud Scientist @ Glenrose Group