When evaluating the value of any business, one of the most important factors is the cost of goods sold (or COGS). For every dollar that a business makes, how many dollars does it cost to deliver?

For a traditional business, there are many ways to minimize COGS. You could optimize your supply chain, find cheaper raw materials, or negotiate better rates with your suppliers. But all of these take time.

In the age of the cloud, we’ve entered a new reality. Your costs might grow by 10x overnight as a result of a sudden increase in volume or a one-line config change.

In September 2018, we found ourselves hitting fever pitch. We were contemplating a new funding round. All of our business metrics looked great—revenue growth, customer churn, new product attachments were all positive. All of them, that is, except our gross margin.

As one board member put it: “Your gross margin is a bit of a black eye on the whole economics of the business.”

To remedy this, in the six months between July and January, we managed to reduce our infrastructure cost by 30%, while simultaneously increasing traffic volume by 25% over the same period.

Two years ago, we shared the details of how we shaved $1m off our AWS bill. This is the story of how we improved our gross margin by 20% in 90 days

Cost drivers and monitoring

Before digging into how we achieved such dramatic savings, I’ll quickly outline how we think about our infrastructure costs.

Segment provides an API to help companies easily move their data from the places where it originates (sources e.g. web, mobile, server or cloud app) to the places where it’s valuable (destinations e.g. Google Analytics, Mixpanel, Intercom, etc.). We generally model our cost as the cost to process one million events, split across five different product areas:

  1. Core product - the cost to ingest one million events and process them with Kafka.

  2. Streaming destinations - the cost to reliably transmit one million events.

  3. Cloud sources - the cost to pull one million events from APIs like Salesforce and Stripe.

  4. Warehouses - the cost to load one million rows into a customer’s warehouse.

  5. Personas - the cost to power one million events with Personas, our customer profile and audience management product.

For each of these products, we’ve tagged the raw AWS infrastructure which powers the product (EC2 instances, EBS volumes, RDS instances, network traffic, etc). Our first step was to look product by product, and understand where we could get the biggest wins.

So we turned to the age-old tool of strategic finance teams and corporate accountants: the humble spreadsheet. 

For each project, we’d estimate the dollar amount saved, the cost driver, and then assign an ‘owner’ to drive the project to completion.

All told, our ~50 or so projects added up to saving more than $5 million per year. Here are the top 10 or so projects worth writing about. 

NSQ reduction

For the past five years, we’ve used NSQ to power the majority of our pipeline. It’s acted as the distributed queue to help us process hundreds of billions of messages per month. It’s the backbone that allowed our services to fail without losing data.

NSQ shines in terms of simplicity. It’s incredibly easy to run, has a low memory footprint, and doesn’t require any central coordination from a system like Zookeeper or Consul.

Unfortunately, the overall cost of NSQ boiled over this past year. To understand why, we’ll give you a flavor of how NSQ works.

NSQ is distributed as a single Go binary. You run it as a standalone process (nsqd), and it exposes a server interface. 

Writers can publish messages to the binary, which are kept in-memory until they overflow a high-watermark. Readers can subscribe to a different endpoint on the server to read messages from the daemon.

                      /---- Reader A
  Writer -> nsqd --<
                      \---- Reader B

The recommended way to run NSQ is by running a single nsqd binary on each of your instances. 

Each nsqd registers itself with an nsqlookupd, a discovery service shipped with NSQ. When readers discover which nsqd instances to connect to, they first make a request to the nsqlookupd and then connect to the full set of nsqd nodes.

The result is a ‘mesh’ network between readers and writers. Effectively, every reader connects to every nsqd that it has discovered.

The primary benefit of this architecture is availability in the face of network partitions. Writers can almost always publish to their local instance (excluding a failure with storage), while a network partition will only halt the reader’s ability to process the pipeline. 

It also gives us some measure of isolation; even if a single nsqd fails, the rest of the pipeline can continue processing data.

The last benefit is simplicity. Running standalone binaries is incredibly straightforward, but we were making the trade-off in terms of infrastructure cost. 

These costs fell into three major buckets: CPU and memory, disk, and network.

CPU and memory cost

The first cost issue was that we had to ‘reserve’ space for the nsqd process on each host. Remember, in our setup, the nsqd daemon is co-located with the publisher, so we run one on each host. In addition to this daemon, we’ll typically run 50-100 other bin-packed containers, to take full advantage of all cores and the host’s single-tenancy.

In normal circumstances, the nsqd may be doing minimal work (a happy queue is an empty queue). But in an outage, we need to guarantee our process will be able to handle the backlog. 

We ended up reserving 8gb and 8 vCPUs for each instance (20% of the instance). While these CPUs sat idle during normal operation, we had to reserve the maximum work we’d expect NSQ to do in an outage.

Disk cost

The second cost concern came from NSQ’s spillover mechanics. When messages exceed the in-memory high watermark, NSQ writes that data to disk. In the case of an extended outage, we want to guarantee we have enough capacity on disk.

When a single instance is processing tens of thousands of messages per second, data can quickly add up. We gave each host a 500gb GP2 EBS drive to compensate, adding $50 per month in cost for each compute instance we’d run, regardless of whether we actually needed the disk.

Network cost

The last and most insidious cost came from the cross-network traffic. In our setup, readers simply read from a randomly connected nsqd instance.

We want Segment to be highly available, so we run instances across three availability zones (AZs). Amazon promises that each AZ is located in a physically isolated data center, so downtimes related to power outages should be independent. 

Our auto-scaling groups automatically balance instances between the zones. Running cross-AZ lets us continue to process data, even in the event of downtime from a single AZ.

However, this means that, on average, two-thirds of our reads come across availability zones while transferring that data (we’ll go into more depth on that down below).

Migrating

We discussed several ways to cut costs with NSQ. We could move from the co-located model to a centralized cluster of NSQ instances, which would reduce the per-instance ‘spacer’ and the expensive disk drives. We could lower disk space (everywhere except at the ‘front door’) and have the system apply back pressure all the way through.

In the end, we decided to take two separate approaches unique to the architecture requirements of our pipeline. 

Inbound: the inbound part of the Segment pipeline is a core part of the product. It consists of our “tracking API”, the front door API we use to collect hundreds of thousands of messages per second.

One of the core requirements of this API is that it must achieve near-perfect availability. If the API goes down for any reason, we end up dropping data. We can sacrifice consistency by delaying data delivery, but availability is the #1 guarantee of the tracking API. 

The fact that nsqd can run co-located on an instance and is isolated from all other nodes makes it appealing for this super-high availability use case. Even in the case of a network partition, one of our inbound instances will happily continue to accept data, so long as the load balancer continues to send it requests. We don’t have to worry about leader election, coordination, or any other sort of ‘split-brain’ behavior.

To solve the network traffic problem on the inbound pipeline, we instead made our NSQ installation ‘zone-aware’. 

When a request comes in, the tracking API service publishes it to its local nsqd instance. Same zone, we’re good.

Then when a reader connects, instead of connecting directly to the nsqlookupd  discovery service, the reader connects to a proxy. The proxy has two jobs. One is to cache lookup requests, but the other is to return only in-zone nsqd instances for zone-aware clients.

Our forwarders that read from NSQ are then configured as one of these zone-aware clients. We run three copies of the service (one for each zone), and then have each send traffic only to the service in its zone. 

Using this intra-AZ routing and removing intermediate load balancers in favor of service discovery saved us nearly $11k in monthly spend

Everywhere else: for the other places of the pipeline, we decided to migrate our services to use Kafka.

We’d already used Kafka in a handful of places in our pipeline but to eliminate our costly drives, NSQ spacers, and cross-network traffic, we’d have to migrate a bunch of our services to use Kafka.

That meant new client libraries, new messaging semantics, and new checkpointing. We used our open source Kafka-go library to provide a fast, idiomatic approach for connecting to our Kafka clusters. To get intra-AZ routing, we added zone-affinity to its consumer groups as well.

This work was big enough to warrant its own section, so we’ll cover more on the service discovery these tools use below.

Kafka optimization

When we decided to go all-in on moving our pipeline to Kafka, we began evaluating parts of our Kafka setup to make it efficient as possible. 

Consolidating clusters

An extremely quick win was consolidating our pipelines to use a single Kafka cluster.

In the early days of using Kafka, we had split clusters by product area and workload. This provided a better isolation story but meant a lot of our clusters ran with a ton of headroom.

At a minimum, we require each cluster to run with a replication factor of three for fault tolerance. For the small clusters, that meant we had a lot of CPU and disk space just waiting around. We moved these into a single set of auto-scaling groups per AZ to get significantly better utilization.

I3 Instances

We previously wrote about how we optimized our Kafka setup. Traditionally, we’ve run on the same c5.9xlarge instances as all the rest of our workloads.

In February of 2017, AWS launched new I3 instances, equipped with NVMe (non-volatile memory express) SSD chips. 

Instead of communicating over SATA like a traditional SSD, NVMe uses PCIe to transfer data back and forth, delivering high performance at a low cost. More importantly, I3 instances use local instance storage rather than EBS, meaning they can connect locally over a high-speed interface rather than network-attached storage.

For a write-heavy workload like Kafka, this means we see 10gbps disk bandwidth, rather than the 875mbps provided by regular EBS-optimized C-class instances.

We found the sweet spot to be running i3.8xlarge instances for our Kafka brokers. Moving to larger instance types meant less impact when we lost a single broker instance, while still providing fewer noisy neighbors, better network I/O, and more consolidated disk space.

Analytics.js tree-shaking

For context, Analytics.js is the JavaScript bundle we give all our customers to help them collect data. Notably, each customer gets a different copy depending on which destinations they have enabled. Some have a copy of Analytics.js which has Google Analytics, while others may get Mixpanel. To do this, we had a custom build process that would effectively ‘template’ the pieces of integration specific code in or out.

The full file looked something like this:

... dependencies ...
... dependencies ...
... dependencies ...
if (enabled('google-analytics')) { ... include GA code ... }
if (enabled('mixpanel')) { ... include Mixpanel code ... }
if (enabled('intercom')) { ... include Intercom code ... }
...

While we’d be sure to include only the adapters the customer was using, we ran into one unfortunate problem. No matter what integrations users had enabled, they’d still receive the full set of dependencies. 

Critically, the code changes might have been immaterial. But the result was that we were bundling similar code multiple times.

So we implemented ‘tree-shaking’, in other words, removing dead code. This works via the standard yarn install, which installs all dependencies, and then ‘shakes’ the tree to combine individual packages. If Google Analytics depends on the validator package, but the user doesn’t use Google Analytics, we’ll omit GA from the bundle.

Critically, this reduced the size of Analytics.js by 30%. Every month this script is loaded billions of times from different CloudFront distributions, reducing our network transfer by 30%

Rewriting the core validation in Go

A lot has been written about migrating different production services over to Go. Segment was no different, and migrated almost every service in Segment’s core data pipeline to Go, from our front door tracking API to the service which reliably retries various requests. 

Every service that is, except our internal validation service, which was written in node.js.

To give you a quick sense of the validation service, it was designed to:

  • Validate incoming messages to ensure they meet our message format.

  • Resolve an API key to the particular ‘source id’ it corresponds to.

  • Publish these messages to NSQ.

Each container was allocated one full vCPU and 4gb of memory. Publishing 200,000 messages per second required about 800 of these containers, with each container processing 250 messages per second.

We knew we could do better.

Now, each of the new Go services gets the same allocation (1vCPU, 4gb of memory) but, critically, the logic has been totally rewritten and optimized. We have a lot more experience optimizing high-throughput Go code, with tools like pprof, flamegraphs, and friends.

Today, we’re processing 220,000 messages per second, across just 340 containers. This means our throughput per container is more than double at nearly 650 messages per second, cutting our spending for this component in half. 

Additionally, we now run three instances of these services, one across each of our availability zones to avoid cross-zone data transfer. Which brings us nicely onto our next point.

Getting a handle on data transfer costs

Data transfer is one of the biggest costs of running on the cloud. Network transfer is often independent of any particular service, and easy to ignore. In our experience, it ends up becoming a “tragedy of the commons” – everybody’s pain, but nobody’s problem.

But left unchecked, it can have massive repercussions.

For those unfamiliar, AWS charges for network traffic based upon data volume in a few different ways. 

  • Inter-region and internet traffic costs $0.02/GB to send and receive.

  • Inter AZ traffic costs $0.01/GB to send and receive.

  • Within AZ traffic is free  

In our case, the cost of data transfer constituted nearly 1/6th of our bill. Yikes.

Here’s the breakdown of costs from October 2018 through February 2019 (axes removed):

To lower our network transfer, we kicked off three major initiatives:

1.Replacing ALBs with service discovery

Throughout Segment’s history, most of our infrastructure relied on a very simple method of service discovery. We typically use Route53 to assign internal DNS names to Amazon-run Application Load Balancers (ALBs).

These ALBs then forward requests to the backing service. It means we don’t have to worry about running our own service discovery and let the load balancers handle things like spread and health checks.

As we started digging into our bill we found some of these load balancers created an enormous expense due to their bytes transferred. We were spending tens of thousands per month just on the requests to the load balancers themselves.

A second downside of the DNS to ALB approach is that clients will hit any IP in the ALB, whether that IP is in-zone or not. 

As a result, two-thirds of all requests to an ALB were likely to cross the network boundary of an availability zone, because the client will hit a random IP that comes back from the DNS query.

We knew that to reduce our cost, we had to switch to a smarter service discovery solution, which could do in-zone routing and remove our reliance on ALBs.

Enter Consul.

For those who might be unfamiliar, Consul is the Swiss Army knife for distributed systems here at Segment. It supports distributed locks, a KV store, and built-in primitives for service discovery.

We started eliminating the ALBs by using three pieces of infrastructure: Registrator, Consul, and Consul-aware clients.

  • Registrator runs on each host and connects to the Docker daemon to query the associated services running on that host.

  • Registrator then syncs all that data into Consul, tagging services with the AZ they run in.

  • Clients perform a lookup within Consul, aware of what AZ they are running in.

  • Clients then prefer to connect to services within their own AZ.

To achieve this, clients use the EC2 metadata API to detect the zone they are running in. 

For each request, the client applies a load balancing algorithm taking into account the network topology of clients and servers. This helps distribute the load evenly across the fleet. 

Balancing is based on the number and positions of clients and servers and ensures an imbalanced distribution doesn't create a hot spot in the zone with the highest client-to-server ratio.

By taking this approach, we managed to cut the cost of the ALBs themselves, as well as the inter AZ network cost.

2.NATs and public subnets

We also decided to do an audit of how our networking setup was running overall. 

Traditionally, we decided to put services in private subnets as an extra layer of security. Even though the VPC may limit the service from being contacted by the internet, having an extra L4 layer can help limit inbound traffic.

Typically, services talking to the internet will have to traverse a NAT Gateway to get to the public subnet, and then an Internet Gateway to get to the public internet. The problem is that traversing these NATs costs money, both for cross AZ traffic and for instance throughput.

The quickest win we identified was using VPC endpoints for services like S3. VPC endpoints are a drop-in replacement for the public APIs supported by many Amazon services and, critically, they don’t count against your public network traffic. 

We replaced all of our S3 calls through these NATs with custom VPC endpoints, leading to additional cost savings. 

If this seems like incredibly low hanging fruit… it was. We hadn’t made significant changes to these requests in some time, and had been hitting the public endpoints since VPC endpoints were first introduced.

3.Kafka in-AZ routing

Switching from NSQ to Kafka reduced disk space usage, but the migration also had a favorable impact on our AZ routing.

Instead of having random queues and clients spread across each availability zone, we built on Kafka’s consumer groups API to create 'AZ-affinity' for certain brokers and partitions. Consumers would first choose to read partitions in their own AZ, reducing the need for cross-AZ traffic. 

As of Kafka 0.10.0.0, metadata responses from a broker return the rack that the broker is in. From the Kafka docs:

The client can then discover its own AZ from the EC2 Metadata API, and decide to consume from partitions in the same availability zone.

We’ve since contributed this fix upstream to the open-source library as well (you can find the PR here). Making this behavior the easy default continues to give us the ability to reduce our inter-AZ costs, without any work from the developer.

Cost-saving product wins

While many of the wins came from pure infrastructure improvements, with no changes to functionality, we found several places where our systems were carrying out redundant work.

Failing sources

Segment collects data from various sources and sends that data to destinations. We act as a pipe for this data—reliably delivering it and handling whatever transforms and filters need to happen on the data.

After running some initial analysis, we reached a startling conclusion. 

More than 6,000 of these sources were not connected to any destinations at all, or were sent to destinations with expired credentials! We were effectively paying for data infrastructure piping straight to /dev/null.

We kicked off a campaign to reduce the total number of sources without any attached destinations. We started by sending customers a notice asking if they wanted to keep their source running for any reason. By doing so, we managed to eliminate more than $20,000 per month in cost by not delivering these messages.

Unfiltered messages

We also noticed duplication within our infrastructure. 

Data is sent through Segment to analytics tools via two different means. 

In one scenario, we transmit data client-side, directly from the browser. If one of our users is using Google Analytics, we help wrap all of those calls to the Google Analytics JavaScript object for them. This helps ensure the user’s session data looks identical. 

In the second scenario, we proxy the data through our servers. This can happen for any number of reasons—to help reduce the page weight, get better data fidelity, or because the data is produced out-of-band as some part of the server-code. 

We send both sets of data to our servers to give users a full archive of all the data they have collected. Any data not already sent from the client is then delivered via our server-to-server APIs.

In the simplified diagram below:

  1. Requests come into our API.

  2. Those messages persisted to Kafka.

  3. A consumer reads from Kafka to determine where messages should be delivered (message A should go to Google Analytics, message B should go to Salesforce).

  4. Our Centrifuge system ensures reliable delivery and traffic absorption in the case of failures.

  5. A proxy service transforms data into the relevant API calls for each of our destinations.

Traditionally, the decision point for that data delivery only happened at the last possible moment, at the ‘proxy’ step. 

                                            -> proxy -> google
     API -> Kafka -> Consumer -> Centrifuge -> proxy -> mixpanel
                                             -> proxy -> intercom
                                  |------ ($$$) ------|

That meant we were doing a bunch of extra bookkeeping, writes, and duplicate work for messages which had already been delivered. We were spending a ton on messages which had already been sent!

Instead, we moved the logic into the ‘consumer’ step of our pipeline. This would short-circuit all of the expensive bookkeeping and avoid all the duplicate work. 

We’ll note that porting the logic was significant. The proxy is written in node, while the consumer is written in Go. The logic operates in several different fields, which makes detecting control flow complicated.

If you’re noticing a theme here, you’re right. We were overpaying for areas of duplication or unnecessary work. We were only able to eliminate that work by closely examining each product area, and seeing if we could tighten the guarantees it was making. 

Lessons learned

Measure first

You can’t make what you can’t measure. By systematically identifying all areas of cost savings, we were able to accurately take the lowest hanging fruit with the biggest savings. It helped us figure out what would move the needle instead of vague guesses at performance improvements. We used a combination of CloudHealth, the AWS billing CSV, and Tableau to get the level of fidelity we needed.

Make a plan

Once we identified the areas where savings could be made, actually putting a plan in place to make sure it got done was key.  We had everyone responsible for a cost-cutting effort meet every week. Each team started with the lowest-hanging fruit and steadily worked towards reducing their cost centers. Our SRE team coordinated the effort and ensured our different cost savings efforts landed by the deadline we’d set.

Build a repeatable monitoring process

Once we had invested heavily in reducing our cost, we wanted to make sure it didn’t regress. It wouldn’t do us any good to have to repeat the whole cost-cutting process six months in the future.

To get the ongoing visibility we need, we built a set of repeatable pricing drivers, calculated daily. The entire cost pipeline feeds into our Redshift instance, and we get daily monitoring on our “cost drivers”, visualized in Tableau. 

If a single cost driver ever spikes by more than a threshold percentage in a day, it automatically emails the team so that we can quickly respond and fix the regression. We no longer wait weeks or months to notice a cost spike.

Draw the impact through the business

Our finance teams told us that every improvement we make to our gross margin drives up the value of the company significantly. You can see this phenomenon reflected in the public markets. Here’s the data from an internal slide deck we snapshotted in mid-2019.

You can see an immediate correlation between Revenue Multiple (x-axis) and Gross Margin (y-axis). 

Some of these companies are decidedly “SaaS” businesses with SaaS gross margins (Zoom, Cloudflare, Datadog), and they have gone on to trade with exceptionally high revenue multiples. The rest have fared…less well. 

Companies that reach 70% to 80% gross margins are heavily rewarded for their unit economics and command of their markets. It’s the sort of company that we’d like to become.

The result

After all was said and done, we managed to increase our gross margin by 20% over 90 days.

That said, there’s more work still left to do. Our current instance utilization is still below target (40% vs 70%). We still have a significant number of EBS snapshots that aren’t getting garbage collected. But overall, we’re proud of what we managed to accomplish.

Instead of creating a one-time fix, we’ve now put the systems and monitoring in place to repeatedly forecast the areas of biggest spend and help save our pipeline in the future.

If you run similar infrastructure, we’re hoping you can take these ideas and apply them to your own business, keeping costs low, one day at a time.

Huge thanks to the entire gross margin team: Achille Roussel, Travis Cole, Steve van Loben Sels, Aliya Dossa, Josh Curl, John Boggs, Albert Strasheim, Alan Braithwaite, Alex Berkenkamp, Lauren Reeder, Rakesh Nair, Julien Fabre, Jeremy Jackins, Dean Karn, David Birdsong, Gerhard Esterhuizen, Alexandra Noonan, Rick Branson, Arushi Bajaj, Anastassia Bobokalonova, Andrius Vaskys, Parsa Shabani, Scott Cruwys, Ray Jenkins, Sandy Smith, Udit Mehta, Tido Carriero, Peter Richmond, Daniel St. Jules, Colin King, Tyson Mote, Prateek Srivastava, Netto Farah, Roland Warmerdam, Matt Shwery, Geoffrey Keating for editing, and thanks to Jarrod Bryan for creating these ✨ illustrations.