Go back to Blog

Engineering

Benjamin Yolken on September 15th 2020

Today, we're releasing kubeapply, a lightweight tool for git-based management of Kubernetes configs. Here's why.

All Engineering articles

Evan Johnson on August 28th 2017

The way companies manage application secrets is critical. Even today, improper secrets management has resulted in an astonishing number of high profile breaches.

Having internet-facing credentials is like leaving your house key under a doormat that millions of people walk over daily. Even if the secrets are hard to find, it’s a game of hide and seek that you will eventually lose.

At Segment we centrally and securely manage our secrets with AWS Parameter Store, lots of Terraform configuration, and chamber.

While tools like Vault, Credstash, and Confidant have gotten a lot of buzz recently, Parameter Store is consistently overlooked when discussing secrets management. 

After using Parameter Store for a few months, we’ve come to a separate conclusion: if you are running all of your infrastructure on AWS, and you are not using Parameter Store to manage your secrets, then you are crazy__! This post has all of the information you need to get running with Parameter Store in production.

Service Identity

Before diving into Parameter Store itself, it’s worth briefly discussing how service identity works within an AWS account.

At Segment, we run hundreds of services that communicate with one another, AWS APIs, and third party APIs. The services we run have different needs and should only have access to systems that are strictly necessary. This is called the principle of least privilege’.

As an example, our main webserver should never have access to read security audit logs. Without giving containers and services an identity, it's not possible to protect and restrict access to secrets with access control policies.

Our services identify themselves using IAM roles.  From the AWS docs:

An IAM role … is an AWS identity with permission policies that determine what the identity can and cannot do in AWS.

IAM roles can be assumed by almost anything: AWS users, running programs, lambdas, or ec2 instances. They all describe what the user or service can and cannot do. 

For example, our IAM roles for instances have write-only access to an s3 bucket for appending audit logs, but prevent deletion and reading of those logs.

How do containers get their role securely?

A requirement to using ECS is that all containers must run the EC2 Container Service Agent (ecs-agent). The agent runs as a container that orchestrates and provides an API that containers can communicate with. The agent is the central nervous system of which containers are scheduled on which instances, as well as which IAM role credentials should be provided to a given container.

In order to function properly, the ecs-agent runs an HTTP API that must be accessible to the other containers that are running in the cluster. The API itself is used for healthchecks, and injecting credentials into each container. 

To make this API available inside a container on the host, an iptables rule is set on the host instance. This iptables rule forwards traffic destined for a magic IP address to the ecs-agent container.

Before ecs-agent starts a container, it first fetches credentials for the container’s task role from the AWS credential service. The ecs-agent next sets the credentials key ID, a UUID, as the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable inside the container when it is started. 

From inside the container, this variable looks like the following:

Using this relative URI and UUID, containers fetch AWS credentials from the ecs-agent over HTTP. 

One container cannot access the authentication credentials to impersonate another container because the UUID is sufficiently difficult to guess.

Additional security details

As heavy ECS users we did find security foot-guns associated with ECS task roles. 

First, it’s very important to realize that any container that can access the EC2 metadata service can become any other task role on the system. If you aren’t careful, it’s easy overlook that a given container can circumvent access control policies and gain access to unauthorized systems.

The two ways ways a container can access the metadata service are a) by using host networking and b) via the docker bridge.

When a container is run with --network='host', it is able to connect to the EC2 metadata service using its host’s network. Setting the ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST variable to false in the ecs.config prevents containers from running with this privilege.

Additionally, it’s important to block access to the metadata service IP address over the docker bridge using iptables. The IAM task role documentation recommends preventing access to the EC2 metadata service with this specific rule.

The principle of least privilege is always important to keep in mind when building a security system. Setting ECS_DISABLE_PRIVILEGED to true in the host’s ecs.config can prevent privileged Docker containers from being run and causing other more nuanced security problems.

Now that we’ve established how containers establish identity and exchange secret keys, that brings us to the second part of the equation: Parameter Store.

AWS Parameter Store

Parameter Store is an AWS service that stores strings. It can store secret data and non-secret data alike. Secrets stored in parameter store are “secure strings”, and encrypted with a customer specific KMS key.

Under the hood, a service that requests secure strings from the AWS Parameter Store has a lot of things happening behind the scenes.

  1. ecs-agent requests the host instance’s temporary credentials.

  2. The ecs agent continuously generates temporary credentials for each ecs task role running on ECS, using an undocumented service called ECS ACS.

  3. When the ecs agent starts each task it sets a secret UUID in the environment of the container.

  4. When the task needs its task role credentials, it requests them from the ecs-agent API and authenticates with the secret UUID.

  5. The ecs task requests it’s secrets from the parameter store using the task role credentials.

  6. Parameter store transparently decrypts these secure strings before returning them to the ecs task.

Using roles with Parameter Store is especially nice because it doesn’t require maintaining additional authentication tokens. This would create additional headache and additional secrets to manage!

Parameter Store IAM Policies

Each role that accesses the Parameter Store requires the ssm:GetParameters permission. “SSM” stands for “Simple System Manager”, and is how AWS denotes Parameter Store operations

The ssm:GetParameters permission is the policy used to enforce access control and protect one service’s secrets from another. Segment gives all services an IAM role that grants access to secrets that match the format  {{service_name}}/*.  

In addition to the access control policies, Segment uses a dedicated AWS KMS key to encrypt secure strings within the Parameter Store. Each IAM role is granted a small set of KMS permissions in order the decrypt the secrets they store in Parameter Store.

Of course, creating all of these pieces of boilerplate by hand quickly becomes tiresome. Instead, we automate the creation and configuration of these roles via Terraform modules.  

Automating service identity and policies

Segment has a small Terraform module that abstracts away the creation of a unique IAM role, load balancers, DNS records, autoscaling, and CloudWatch alarms. Below we show how our nginx load balancer is defined using our service module.

Under the hood, the task role given to each service has all of the IAM policies we previously listed, restricting access to the parameter store by the value in the name field. No configuration required.

Additionally, developers have the option to override which secrets their service has access to by providing a “secret label”. This secret label will replace their service name in their IAM policy. If nginx were to need the same secrets as an HAProxy instance, the two services can share credentials by using the same secret label.

Parameter Store in production

All Segment employees authenticate with AWS using aws-vault, which can securely store AWS credentials in the macOS keychain or in an encrypted file for Linux users.

Segment has several AWS accounts. Engineers can interact with each account by using the aws-vault command and executing commands locally with their AWS credentials populated in their environment.

This is great for AWS APIs, but the AWS CLI leaves a bit to be desired when it comes to interacting with parameter store. For that, we use Chamber.

Using Chamber with Parameter Store

Chamber is a CLI tool that we built internally to allow developers and code to communicate with Parameter Store in a consistent manner. 

By allowing developers to use the same tools that run in production, we decrease the number of differences between code running in development with staging and production.

Chamber works out of the box with aws-vault, and has only a few key subcommands:

  • exec - a command after loading secrets in to the environment.

  • history - of changes of a secret in parameter store.

  • list - the names of all secrets in a secret namespace.

  • write - a secret to the Parameter Store.

Chamber leverages Parameter Store’s built in search and history mechanisms to implement the list and history subcommands. All strings stored in Parameter Store are automatically versioned as well, so we gain a built-in audit trail. 

The subcommand used to fetch secrets from the Parameter Store is exec. When developers use the exec subcommand they use it with aws-vault.

In the preceeding command, chamber is executed with the credentials and permissions of the employee in the development account, and it fetches the secrets associated with loadbalancers from Parameter Store. After chamber populates the environment it will run the nginx server. 

Running chamber in production

In order to populate secrets in production, chamber is packaged inside our docker containers as a binary and is set as the entrypoint of the container. Chamber will pass signals to the program it executes in order to allow the program to gracefully handle them.

Here’s a diff of what it required to make our main website chamber ready.

Non-docker containers can also use chamber to populate the environment before creating configuration files out of templates, run daemons, etc. We simply need to wrap their command with the chamber executable, and we’re off to the races.

Auditing

As a last piece of our security story, we want to make sure that every action we described above is logged and audited. Fortunately for us, all access to the AWS Parameter Store is logged with CloudTrail.

This makes keeping a full audit trail for all parameters simple and inexpensive. It also makes building custom alerting and audit logging straightforward.

CloudTrail makes it possible to determine exactly what secrets are used and can make discovering unused secrets or unauthorized access to secrets possible.

AWS logs all Parameter Store access for free as a CloudTrail management event. Most Security Information and Events Management (SIEM) solutions can be configured to watch, and read data from S3. 

Takeaways

By using Parameter Store and IAM, we were able to build a small tool that gives us all of the properties that were most important to us in a secret’s management system, with none of the management overhead. In particular, we get:

  • Protect the secrets at rest with strong encryption.

  • Enforce strong access control policies.

  • Create audit logs of authentication and access history.

  • Great developer experience.

Best of all, these features are made possible with only a little configuration and zero service management. 

Secrets management is very challenging to get right. Many products have been built to manage secrets, but none fit the use cases needed by Segment better than Parameter Store.

Daniel Fuente on July 26th 2017

Today we’re excited to open source the various pieces of our logging pipeline. We’ve released a rate-limiting-syslog proxy, a journald fanout service, and a cloudwatch logs CLI. To understand how they work in concert, read on.

If you’re running a production system which doesn’t have logs or metrics, you’re flying blind. But if your logs are getting dropped… how does one find out?

A few months ago, this started happening to us at Segment. 

It started gradually. Every few weeks, someone on the Engineering team would ping the #engineering channel and ask why they weren’t seeing log messages for a given container. 

At first we thought these missing logs were the result of ephemeral failures. 

But, soon, we’d start seeing containers that seemed to be working based on all of our metrics–but not logging a thing. No output in our log aggregators. Nothing in docker logs. No trace whatsoever that the program was running. 

Here’s the story of how we dug through our logging pipeline, got to a root cause, and open-sourced our tooling for logging with Docker, ECS, Cloudwatch, and Go

System Overview

Before diving in, it’s worth taking a cursory look at how logging works at Segment. 

Currently, we’re pushing multiple TB of log data per day into our logging pipeline. Some services log frequently with request-level information, while others contain only important audits.

In order to get good utilization, we also make heavy use of service bin-packing, often running hundreds of containers on a single host.

To get an idea of how our logging pipeline fits together, let’s first take a look at the high level architecture for our logging setup. Here’s the architecture diagram showing how logs propagate through a given host:

Containers run via the docker daemon are configured to log straight to journald. Journald is the default logger that ships with systemd and recent Ubuntu distributions (we use Xenial in production), and collects system-wide logs across our pipeline. 

From Journald, we run a daemon on the host that tails our logs and handles our fanout to different logging providers (more on that shortly).

This design covers a few core tenets:

  • structured logging – clients should log structured fields to query by, not just plain text

  • use system-wide journald – rather than buffering in memory, we log to journald (which writes directly to disk) and then tail logs from there

  • multi-destination adapter – instead of coupling to a single provider, allow multi-destination log fanout (via our open-sourced ecs-logs)

Here’s what each tenet looks like in more detail:

Structured logging

At the application level, we encourage our developers to use structured logs. To make structured logging as easy as possible, we’ve created logging helper libraries for our two main languages (Go and JS) which provide consistent formatters. 

For Go, we use  ecs-logs-go which provides a set of logrus (or events) compatible formatters and outputs. For JS, we have ecs-logs-js which provides a logger built on winston to do the same. 

By using these libraries, we ensure that logging from applications built in either language look identical.

This uniformity in log structure helps us build tooling for routing and querying logs without needing to worry about where the logs originated.

Use system-wide journald

From the application, we write our logs into journald via the journald Docker logging driver. This plays well with our structured logging approach, as journald easily allows you to do structured logging by attaching metadata to logs. 

journald logs first to a local file rather than a remote server.  That means if there’s a network partition or we lose instance connectivity, we can still reboot the instance and identify the root cause for the problem. 

Furthermore, because journald logs information from all systemd processes, the logging tooling on a host is consistent. It preserves the behavior of journalctl <service-name> as well as docker logs <container-id>, allowing us to debug once logged into a host in the case of a complex problem. 

The Docker logging driver also adds additional metadata about which specific containers logs come from, which is often very valuable debugging information. It allows us to separate the logs coming from different containers 

Multi-destination adapter

Our log forwarder (ecs-logs), tails journald on each host and forwards those logs to downstream aggregators. This forwarder is also responsible for taking unstructured logs from legacy applications, or other open source applications we run, and converting them to our structured log format.

We have experimented with a few log aggregators; internally we use Loggly, LogDNA and AWS Cloudwatch Logs. Loggly and LogDNA are good for fulfilling an indexed ‘search’ use case, where you need to answer a ‘needle in the haystack’ type query. 

Cloudwatch is great for tailing and auditing. Even if you persist logs indefinitely, it’s almost an order of magnitude cheaper than any other logging provider we’ve seen–and it integrates nicely with a variety of AWS tooling. 

Between the three of them, we’re able to handle different use cases with different retentions, from different users inside the company. 

The case of the missing logs

This setup has worked well for us, but we noticed that sometimes we'd be missing a few log lines. The problem was sporadic, but affected all services on a host when it did happen. 

After digging in, we managed to isolate the root cause: systemd-journal wasn’t keeping up with our logging throughput.

To understand what was going on, it’s worth looking at the way that journald is implemented. journald runs as a single process on each host. When set as the logging client for Docker, all logs are sent from the docker daemon directly to journald.

As we run tens or hundreds of containers on a given host, we were seeing times where a single host’s journald process would be completely CPU-bound. 

Fortunately, journald allows you to set a rate limit in its configuration.

This means that for a given process, it can send the burst number of messages in the specified interval (200,000 every minute in our configuration above).

Unfortunately for us, journald applies rate limits based on systemd service, not per process. 

To understand why this is an issue, know that we run all of our applications in Docker containers via ECS. Our systemd unit for docker looks like this:

We have one Docker process, run as a single systemd process. And that Docker process manages all other services running on a host. 

What that means is that all logs from all containers on a host come from the exact same systemd service. What’s more, these log lines are dropped silently when any process causes us to exceed the ratelimit.

With this setup, one misbehaving container on a host can cause logs from all other containers on that host to be dropped without any sort of notification whatsoever.

In order for us to be able to guarantee the same log reliability to any application running on our infrastructure, we needed a way to apply this rate limiting per container, rather than globally.

Our first thought was to apply this rate limiting at the docker logging driver level. 

However, none of the docker drivers provide this capability out of the box. We could write our own logging driver, but that would mean compiling it into docker and maintaining a fork–something we didn’t want to do (note: the latest docker version has added support for dynamically loading logging drivers, which might be a possibility in the future).

Instead, we created a small proxy (rate-limiting-log-proxy) to sit between the Docker daemon and journald. This proxy presents itself as a syslog server, so that we can use the built in Docker syslog driver to forward logs to it. 

The proxy is a tiny golang binary, with extremely low memory footprint and CPU overhead. You can grab it from github or install it straight from docker hub. 

Once the logs have been shipped to our proxy, we use the log tag (container ID), to rate limit messages from individual containers. Messages that obey the rate limit are forwarded along to journald, while messages from noisy containers will be dropped to fit within the limit.

Here’s a detailed look at the handle method. Notice that we first pull out the message tag, then look up the container by ID to pull its full info from the Docker daemon. 

After each message has been tagged, we see whether the message has exceeded the limit. If it hasn’t, we log it. The rate-limiting uses a simple token-bucket approach to drop messages. 

Now, regardless of noisy neighbors, well behaving containers will always get their logs into journald (and from there, into the rest of our logging pipeline).

What’s more, instead of silently dropping log lines like the journald ratelimit does, we wanted to give developers a much bigger ‘heads up’ of when their logs are getting dropped. 

If you’re logging too much, the proxy will now add the following lines to your log output:

It’s now much easier to understand where and why your logs may be getting dropped, as opposed to the silent failures we encountered earlier. 

Where performance is concerned, Go performs admirably (as per usual). The log-proxy is capable of handling hundreds of thousands of messages per second, all with minimal CPU and memory overhead. 

Even with hundreds of containers running on a single host, logging tens of thousands of lines per second, CPU hovers at 2-3% and memory is capped to 10mb. 

There’s still some work we could do here. From our early profiling, the majority of the CPU is being consumed by the syslog server parsing code. But this is a step which can easily be optimized.

If you’d like to use this yourself, the code is freely available on GitHub, and the docker image is runnable today via Docker Hub.

To get up and running, first run the image as a standalone syslog server, listening on a unixgram socket. 

And then run the docker config for a given service to point at the proxy server:

And voila! Rate-limited logs tagged with your service and container information!

Useful once usable

We’d successfully managed to debug where logs were being dropped, but our job wasn’t over yet. Logs aren’t actually useful until they are useable.

On that note, the biggest issues we had here were with Cloudwatch. The Cloudwatch logs UI is clunky and slow for our common use cases, and the aws-cli tool is hard to use effectively for day-to-day debugging.

To ease this process for developers, we created cwlogs, a more user-friendly CLI for accessing Cloudwatch logs. 

When using the rate-limiting-proxy and ecs-logs, cloudwatch logs are grouped by “ECS service”. Within each service, individual containers log to a given “stream”. This allows you to filter either by service-wide logs, or a particular running instance of the program. 

In turn, we designed cwlogs around our most common logging use cases:

Listing the available log streams for a given service

Fetching/tailing logs from a particular service

Fetching/tailing logs from a particular ECS Task (stream)

Formatting log output for analysis with common cli tooling (grep, jq, etc)

The formatting string passed in here is a Go template string, so you can easily tune the ‘verbosity’ of your log output (more details can be found here).  Because we have a consistent schema across all of our logs, it’s easy for our developers to construct format strings to pull out the data they need and know that it will exist.

Our cwlogs CLI also supports fuzzy “did you mean” matching of various streams in case you forget the exact name of the service you’re trying to query. 

Accessing the right logs from the terminal now only takes a few seconds. No more complicated login process, or tens of clicks to identify the log streams you’re interested in.

EOF

While all the usual advice when it comes to logging still applies (use structured logs, log straight to disk, adapt to common interchange formats, etc)–we’ve found the biggest piece still missing from most implementations is clear rate limits.

Many logging providers don’t provide service-level rate limits at all, or will silently drop messages when those limits are applied. At best, this causes confusion. At worst, it may cost you hours of debugging headaches. 

At the end of the day, the I/O on a machine is a shared resource. If you’re stacking hundreds of containers on a given host, that I/O can start to disappear quickly. By introducing rate-limiting with clear indication when it is invoked, we’ve been able to spot problematic services much more quickly, and adjust their log volume accordingly. 

If you’d like to try rate-limit-log-proxy, you can find the source on github and the image on docker hub. Same goes for the cwlogs CLI to access Cloudwatch. If you do end up kicking the tires, let us know what you think. Happy (rate-limited) logging!

Amir Abu Shareb on June 29th 2017

The single requirement of all data pipelines is that they cannot lose data. Data can usually be delayed or re-ordered–but never dropped. 

To satisfy this requirement, most distributed systems guarantee at-least-once delivery. The techniques to achieve at-least-once delivery typically amount to: “retry, retry, retry”. You never consider a message ‘delivered’ until you receive a firm acknowledgement from the consumer.

But as a user, at-least-once delivery isn’t really what I want. I want messages to be delivered once. And only once.

Unfortunately, achieving anything close to exactly-once delivery requires a bullet-proof design. Each failure case has to be carefully considered as part of the architecture–it can’t be “bolted on” to an existing implementation after the fact. And even then, it’s pretty much impossible to have messages only ever be delivered once. 

In the past three months we’ve built an entirely new de-duplication system to get as close as possible to exactly-once delivery, in the face of a wide variety of failure modes. 

The new system is able to track 100x the number of messages of the old system, with increased reliability, at a fraction of the cost. Here’s how. 

The problem

Most of Segment’s internal systems handle failures gracefully using retries, message re-delivery, locking, and two-phase commits. But, there’s one notable exception: clients that send data directly to our public API.

Clients (particularly mobile clients) have frequent network issues, where they might send data, but then miss the response from our API.

Imagine, you’re riding the bus, booking a room off your iPhone using HotelTonight. The app starts uploading usage data to Segment’s servers, but you suddenly pass through a tunnel and lose connectivity. Some of the events you’ve sent have already been processed, but the client never receives a server response. 

In these cases, clients retry and re-send the same events to Segment’s API, even though the server has technically already received those exact messages.

From our server metrics, approximately 0.6% of events that are ingested within a 4-week window are duplicate messages that we’ve already received. 

This error rate might sound insignificant. But for an e-commerce app generating billions of dollars in revenue, a 0.6% discrepancy can mean the difference between a profit and a loss of millions. 

De-duplicating our messages

So we understand the meat of the problem–we have to remove duplicate messages sent to the API. But how?

Thinking through the high-level API for any sort of dedupe system is simple. In Python (aka pseudo-pseudocode), we could represent it as the following:

For each message in our stream, we first check if we’ve seen that particular message, keyed by its id (which we assume to be unique). If we’ve seen a message before, discard it. If it’s new, we re-publish the message and commit the message atomically. 

To avoid storing all messages for all time, we keep a ‘de-duplication window’–defined as the time duration to store our keys before we expire them. As messages fall outside the window, we age them out. We want to guarantee that there exists only a single message with a given ID sent within the window.

The behavior here is easy to describe, but there are two aspects which require special attention: read/write performance and correctness.

We want our system to be able to de-duplicate the billions of events passing through our pipeline–and do so in a way that is both low-latency and cost efficient. 

What’s more, we want to ensure the information about which events we’ve seen is written durably so we can recover from a crash, and that we never produce duplicate messages in our output.

Architecture

To achieve this, we’ve created a ‘two-phase’ architecture which reads off Kafka, and de-duplicates all events coming in within a 4-week window.

The dedupe high-level architecture

Kafka topology

To understand how this works, we’ll first look at the Kafka stream topology. All incoming API calls are split up as individual messages, and read off a Kafka input topic. 

First, each incoming message is tagged with a unique messageId , generated by the client. In most cases this is a UUIDv4 (though we are considering a switch to ksuids). If a client does not supply a messageId, we’ll automatically assign one at the API layer.

We don’t use vector clocks or sequence numbers because we want to reduce the client complexity. Using UUIDs allows anyone to easily send data to our API, as almost every major language supports it.

Individual messages are logged to Kafka for durability and replay-ability. They are partitioned by messageId so that we can ensure the same messageId will always be processed by the same consumer.

This is an important piece when it comes to our data processing. Instead of searching a central database for whether we’ve seen a key amongst hundreds of billions of messages, we’re able to narrow our search space by orders of magnitude simply by routing to the right partition. 

The dedupe “worker” is a Go program which reads off the Kafka input partitions. It is responsible for reading messages, checking whether they are duplicates, and if they are new, sending them to the Kafka output topic. 

In our experience, the worker and Kafka topology are both extremely easy to manage. We no longer have a set of large Memcached instances which require failover replicas. Instead we use embedded RocksDB databases which require zero coordination, and gets us persistent storage for an extremely low cost. More on that now!

The RocksDB worker

Each worker stores a local RocksDB database on its local EBS hard drive. RocksDB is an embedded key-value store developed at Facebook, and is optimized for incredibly high performance.

Whenever an event is consumed from the input topic, the consumer queries RocksDB to determine whether we have seen that event’s messageId

If the message does not exist in RocksDB, we add the key to RocksDB and then publish the message to the Kafka output topic.  

If the message already exists in RocksDB, the worker simply will not publish it to the output topic and update the offset of the input partition, acknowledging that it has processed the message.

Performance

In order to get high performance from our database, we have to satisfy three query patterns for every event that comes through:

  1. detecting existence of random keys that come in, but likely don’t exist in our DB. These may be found anywhere within our keyspace.

  2. writing new keys at a high write throughput

  3. aging out old keys that have passed outside of our ‘de-duplication window’

In effect, we have to constantly scan the entire database, append new keys, and age out old keys. And ideally, it happens all within the same data model.

Our database has to satisfy three very separate query patterns

Generally speaking, the majority of these performance gains come from our database performance–so it’s worth understanding the internals that make RocksDB perform so well. 

RocksDB is an log-structured-merge-tree (LSM) database–meaning that it is constantly appending new keys to a write-ahead-log on disk, as well as storing the sorted keys in-memory as part of a memtable.

Keys are sorted in-memory as part of a memtable

Writing keys is an extremely fast process. New items are journaled straight to disk in append-only fashion (for immediate persistence and failure recovery), and the data entries are sorted in-memory to provide a combination of fast search and batched writes. 

Whenever enough entries have been written to the memtable, it is persisted to disk as an SSTable (sorted-string table). Since the strings have already been sorted in memory, they can be flushed directly to disk. 

The current memtable is flushed to disk as an SSTable at Level 0

Here’s an example of flushing from our production logs:

Each SSTable is immutable–once it has been created, it is never changed–which is what makes writing new keys so fast. No files need to be updated, and there is no write amplification. Instead, multiple SSTables at the same ‘level’ are merged together into a new file during an out-of-band compaction phase. 

When individual SSTables at the same level are compacted, their keys are merged together, and then the new file is promoted to the next higher level.

Looking through our production logs, we can see an example of these compaction jobs. In this case, job 41 is compacting 4 level 0 files, and merging them into a single, larger, level 1 file. 

After a compaction completes, the newly merged SSTables become the definitive set of database records, and the old SSTables are unlinked.

If we log onto a production instance, we can see this write-ahead-log being updated–as well as the individual SSTables being written, read, and merged. 

The log and the most recent SSTable dominate the I/O

If we look at the SSTable statistics from production, we can see that we have four total ‘levels’ of files, with larger and larger files found at each higher level.

RocksDB keeps indexes and bloom filters of particular SSTables stored on the SSTable itself–and these are loaded into memory. These filters and indexes are then queried to find a particular key.  and then the full SSTable is loaded into memory as part of an LRU basis. 

In the vast majority of cases, we see new messages–which makes our dedupe system the textbook use case for bloom filters. 

Bloom filters will tell us whether a key is ‘possibly in the set’, or ‘definitely not in the set’. To do this, the bloom filter keeps set bits for various hash functions for any elements which have been seen. If all the bits for a hash function are set, the filter will return that the message is ‘possibly in the set’.

Querying for w in our bloom filter, when our set contains {x, y, z}. Our bloom filter will return ‘not in set’ as one of the bits is not set.

If the response is ‘possibly in the set’, then RocksDB can query the raw data from our SSTables to determine whether the item actually exists within the set. But in most cases, we can avoid querying any SSTables whatsoever, since the filter will return a ‘definitely not in the set’ response. 

When we query RocksDB, we issue a MultiGet for all of the relevant messageIds  we’d like to query. We issue these as part of a batch for performance, and to avoid many concurrent locking operations. It also allows us to batch the data coming from Kafka and generally avoid random writes in favor of sequential ones. 

This answers the question of how the read/write workload gets good performance–but there’s still the question of how stale data is aged out. 

Deletion: size-bound, not time-bound

With our de-dupe process, we had to decide whether to limit our system to a strict ‘de-duplication window’, or by the total database size on disk.

To avoid the system falling over suddenly and de-dupe collection for all customers, we decided to limit by size rather than limit to a set time window. This allows us to set a max size for each RocksDB instance, and deal with sudden spikes or increases in load. The side-effect is that this can lower the de-duplication window to under 24 hours, at which point it will page our on-call engineer. 

We periodically age out old keys from RocksDB to keep it from growing to an unbounded size. To do this, we keep a secondary index of the keys based upon sequence number, so that we can delete the oldest received keys first.  

Rather than using the RocksDB TTL, which would require that we keep a fixed TTL when opening the database–we instead delete objects ourselves using the sequence number for each inserted key.

Because the sequence number is stored as a secondary index, we can query for it quickly, and ‘mark’ it as being deleted. Here’s our deletion function, when passed a sequence number. 

To continue ensuring write speed, RocksDB doesn’t immediately go back and delete a key (remember these SSTables are immutable!). Instead, RocksDB will append a ‘tombstone’ which then gets removed as part of the compaction process. Thus, we can age out quickly with sequential writes, and avoid thrashing our memory by removing old items.

Ensuring Correctness

We’ve now discussed how we ensure speed, scale, and low-cost searching across billions of messages. The last remaining piece is how we ensure correctness of the data in various failure modes. 

EBS-snapshots and attachments

To ensure that our RocksDB instances are not corrupted by a bad code push or underlying EBS outage, we take periodic snapshots of each of our hard drives. While EBS is already replicated under the hood, this step guards against the database becoming corrupted from some underlying mechanism. 

If we need to cycle an instance–the consumer can be paused, and the associated EBS drive detached and then re-attached to the new instance. So long as we keep the partition ID the same, re-assigning the disk is a fairly painless process that still guarantees correctness. 

In the case of a worker crash, we rely on RocksDB’s built-in write-ahead-log to ensure that we don’t lose messages. Messages are not committed from the input topic unless we have a guarantee that RocksDB has persisted the message in the log. 

Reading the output topic

You may notice that up until this point, that there is no ‘atomic’ step here which allows us to ensure that we’ve delivered messages just once. It’s possible that our worker could crash at any point: writing to RocksDB, publishing to the output topic, or acknowledging the input messages. 

We need a ‘commit’ point that is atomic–and ensures that it covers the transaction for all of these separate systems. We need some “source of truth” for our data. 

That’s where reading from the output topic comes in. 

If the dedupe worker crashes for any reason or encounters an error from Kafka, when it re-starts it will first consult the “source of truth” for whether an event was published: the output topic

If a message was found in the output topic, but not RocksDB (or vice-versa) the dedupe worker will make the necessary repairs to keep the database and RocksDB in-sync. In essence, we’re using the output topic as both our write-ahead-log, and our end source of truth, with RocksDB checkpointing and verifying it. 

In Production

We’ve now been running our de-dupe system in production for 3 months, and are incredibly pleased with the results. By the numbers, we have:

  • 1.5 TB worth of keys stored on disk in RocksDB

  • a 4-week window of de-duplication before aging out old keys

  • approximately 60B keys stored inside our RocksDB instances

  • 200B messages passed through the dedupe system

The system has generally been fast, efficient, and fault tolerant–as well as extremely easy to reason about. 

In particular, the our v2 system has a number of advantages over our old de-duplication system. 

Previously we stored all of our keys in Memcached and used Memcached’s atomic CAS (check-and-set) operator to set keys if they didn’t exist. Memcached served as the commit point and ‘atomicity’ for publishing keys. 

While this worked well enough, it required a large amount of memory to fit all of our keys. Furthermore, we had to decide between accepting the occasional Memcached failures, or doubling our spend with high-memory failover replicas. 

The Kafka/RocksDB approach allows us to get almost all of the benefits of the old system, with increased reliability. To sum up the biggest wins:

Data stored on disk: keeping a full set of keys or full indexing in-memory was prohibitively expensive. By moving more of the data to disk, and leveraging various level of files and indexes, we were able to cut the cost of our bookkeeping by a wide margin. We are able to push the failover to cold storage (EBS) rather than running additional hot failover instances. 

Partitioning: of course, in order to narrow our search space and avoid loading too many indexes in memory, we need a guarantee that certain messages are routed to the right workers. Partitioning upstream in Kafka allows us to consistently route these messages so we can cache and query much more efficiently. 

Explicit age-out: with Memcached, we would set a TTL on each key to age them out, and then rely on the Memcached process to handle evictions. This caused us exhaust our memory in the face of large batches of data, and spike the Memcached CPU in the face of a large number of evictions. By having the client handle key deletion, we’re able to fail gracefully by shortening our ‘window of deduplication’. 

Kafka as the source of truth: to truly avoid de-duplication in the face of multiple commit points, we have to use a source of truth that’s common to all of our downstream consumers. Using Kafka as that ‘source of truth’ has worked amazingly well. In the case of most failures (aside from Kafka failures), messages will either be written to Kafka, or they wont. And using Kafka ensures that published messages are delivered in-order, and replicated on-disk across multiple machines, without needing to keep much data in memory. 

Batching reads and writes: by making batched I/O calls to Kafka and RocksDB, we’re able to get much better performance by leveraging sequential reads and writes. Instead of the random access we had before with Memcached, we’re able to achieve much better throughput by leaning into our disk performance, and keeping only the indexes in memory. 

Overall, we’ve been quite happy with the guarantees provided by the de-duplication system we’ve built. Using Kafka and RocksDB as the primitives for streaming applications has started to become more and more the norm. And we’re excited to continue building atop these primitives to build new distributed applications. 


Thanks to Rick Branson, Calvin French-Owen, Fouad Matin, Peter Reinhardt, Albert Strasheim, Josh Ma and Alan Braithwaite for providing feedback around this post.

Rick Branson on June 7th 2017

Today we’re releasing ksuid, a Golang library for unique ID generation. It borrows core ideas from the ubiquitous UUID standard, adding time-based ordering and more friendly representation formats. In doing the research that went into this library, we uncovered a compelling story that we wanted to share with a larger audience.

Ever since two or more machines found themselves exchanging information on a network, they’ve needed a way to uniquely identify things.

The first networks that resemble our contemporary ideas began with the construction of the first telephone exchanges in 1870s. Before this crucial tipping point, telecom wires were entirely point-to-point links. While amazing at the time, they were expensive, inflexible, and unreliable. It even resulted in an almost comical explosion of copper lines zig-zagging above the roads of major cities.

While telegraphs were mainly used for important governmental and business communications, telephones were an extreme luxury. Compared to the speed of the telegraph, tying up an expensive copper line for a quick chat was incredibly frivolous. The key invention that brought telecommunications to the masses was the switchboard, enabling the creation of exchanges, and in turn vastly increased the utility of these lines. With it brought the first unique identifier in a network: the telephone number.

Fast forward many decades later to the advent of the networked computer. Suddenly the granularity of things had to become orders of magnitude finer. 

Until this tipping point, data sent across telecom wires had been ephemeral — the networks were just a conduit. Now, it became routine to store and retrieve data on demand, and thus the titanic explosion of data that has inundated the world ever since. Given these new capabilities, the things on a network shifted from physical machines to logical pieces of data.

These networks needed a way to uniquely address these pieces of data. The old systems of central control during the telecom age just wouldn’t scale. This is a mathematical inevitability as a network’s capacity for storage and retrieval increases linearly with size. This scale also brings with it a bit of chaos — failures and machine ephemerality move from a yak shaving problem to the routine. Data no longer lives in one place, it flows freely across the network.

A Networking Event for Computing

This brings us to the 1980s. At this point in time, using a computer to share data actually meant sharing a physical computer. Institutions exchanged information using minicomputers and powerful mainframes with hundreds or thousands of dumb terminals. 

In other words, data was colocated with computation. While PCs had revolutionized computing, they lacked networking capabilities, and therefore were very fancy calculators. 

Founded in 1980, Apollo Computer was one of the first companies to enter the nascent workstation market. Workstations were really the first networked computers. It sounds kind of ridiculous to use this term, but worth stating that at this time most of the networking technology we take for granted had yet to blossom. And in stark contrast to the mainframe world, data and compute was distributed across many interconnected computers. Thus the idea of distributed computing entered the mainstream.

Like its contemporary, Sun Microsystems, Apollo was truly full-stack. Everything had to be built from scratch as the hardware and software of that era was not designed for the use cases they imagined. The asynchrony of networks and the demanding nature of these tasks required vastly more capable computers. Multi-tasking, security controls, networking, and mass storage were all too expensive or impractical to include in PCs at the time. However, they were considered table stakes for the vision of the workstation.

Despite an impressive technological boom in the workstation market, all of these vendors ran into the same road block: few developers knew anything about networks. In order to make a business case for their pricey workstations, they needed a programming environment. Developers needed a way to easily build applications that could fully exploit the networking capabilities of their respective products.

Apollo’s answer was the their Network Computing System (NCS). NCS borrowed ideas from object-oriented programming and was built around Remote Procedure Calls (RPC). While now mostly obsolete, this approach achieved the end result Apollo was hoping for: any developer knew how to call a function, and object-oriented was the programming paradigm du jour

In an article published in Network World in 1989 about RPC, one MIS Director at Burlington Coat Factory made a particularly salient observation: “It takes a good programmer only a day or so to learn how to build distributed applications using RPCs.” Cha-ching. That year Apollo sold to Hewlett-Packard for a whopping $476 million USD, almost a billion dollars when adjusted for inflation.

Things (objects, interfaces, operations [methods], etc), or “entities” in NCS terminology, all needed unique identities to be addressed in a networked environment. In the standard Von Neumann architecture this is trivial: the memory or mass storage address easily serves this purpose. In a distributed computing model which allows many computers to operate independently, this becomes non-trivial. The scale of their use case meant that coordination across the network was off the table — it was just too slow and prone to failure. 

NCS introduced the concept of the UID (Universal IDentifier), which served as the unique primary identity for entities. UIDs are 64-bit numbers that combine a monotonic clock with a unique host ID permanently embedded in the hardware of all of their workstations. Under this scheme, identifiers could be generated thousands of times per second at each host and remain globally unique for all time with no scaling bottleneck. The only point of coordination was at Apollo’s factories — where the machines were permanently branded with their respective identifiers.

The First UUID

When Apollo began to approach standardizing the principles of NCS as the Network Computing Architecture (NCA), it became clear that the existing UID design was insufficient. Apollo wanted all the workstation vendors to standardize on NCA, and while they all embedded host IDs in their workstations, the bit size varied from vendor to vendor. 

Apollo used 20-bits, good enough for around a million machines. While laughable at today’s scale, to run out, Apollo would have to sell more than $10 billion worth of hardware to a market that was an order of magnitude smaller.

NCA introduced UUIDs, which built on the UID design, but accommodated a broader range of vendors by extending the number space to 128-bits. Thus the UUID was born. This concept was so useful that even after NCA became a distant memory and RPC fell out of fashion, the UUID remained popular, eventually being standardized by ISO, IETF, and ITU.

Readers somewhat familiar with UUIDs will recognize that their contents were quite a bit different than the popular UUID Version 4 that is most used today. NCA UUIDs had a 48-bit timestamp, 16 reserved bits, an 8-bit indicator for network address family, and a 56-bit host ID. All told, pretty similar in concept to the Version 1 UUIDs defined in the current-day IETF standard.

All of this history sparked my curiosity about the implementations, and fortunately was able to dig up fragments of the original Apollo NCS source code online. If you’re like me, looking at source code that is many-decades old usually makes for a good time. The first peculiar thing I noticed about this code was the use of the dollar-sign ($) in identifiers like variable and function names.

It turns out that NCS uses a language called “Domain C” introduced by Apollo as part of their “Domain/OS” operating system. Courtesy of Bitsavers, I was able to find a reference manual from 1988 in PDF format. Domain C extends ANSI C in several ways, most importantly here that it allows $ to appear after the first character in any identifier.

In contemporary times, dollar-signs are used as syntax for variables in uncool programming languages, currency in economics, and to adorn the name of self-aggrandi$ing musicians. To understand their actual purpose in the now extinct world of Apollo Computer, it took digging through the trace amounts of code and documentation that still remains. 

After poring through what could be found, a rather anti-climactic conclusion is reached. While not explicitly stated, it seems that this is just a coding convention. Whatever comes before _$ identifies a particular module. The _$t indicates the “default type” such as the uuid_$t used above. It also appears it was useful as a way to easily pick out which identifiers were part of libraries that followed Apollo’s programming style. It was a bit disorienting that Apollo would make extensions to C just to suit a particular coding style.

I digress.

NCA UUIDs became the basis for the standardized Version 1 UUIDs. Reiterating from before: they included a high-precision timestamp and the hardware-based unique host identifier. It almost goes without saying that system clocks can’t be used to reliably generate unique sequence numbers as clock skew or even chance can cause repeated timestamps. Apollo addressed this by using a global file (literally /tmp/last_uuid) to coordinate across processes.

The file was necessary globally-writable by any user. While not particularly secure, Apollo sold end-user workstations used on somewhat trusted networks, so it was at least a somewhat reasonable decision. This technique carried forward in the IETF specification for UUIDs:

The source code for a current-day implementation of DCE somewhat surprisingly comes from Apple. They appear to use it mostly to communicate with Microsoft systems like Active Directory and Windows-based File Servers. This implementation, which bears the Open Software Foundation copyright, places the actual working stable store behind a preprocessor flag called UUID_NONVOLATILE_CLOCK.

I couldn’t find any code on the Internet that actually implements the non-volatile clock for DCE RPC’s UUID generation. However, libuuid, included in the package repositories of most Linux distributions, does include a non-volatile UUID clock implementation that can be inspected. Similar to NCS, it uses a file for monotonicity, but places it at a more sensible /var/lib/libuuid/clock.txt. It does attempt to manage the permissions in a slightly more sane way, but the same security caveats apply.

Both the NCS and the libuuid implementations spin to acquire a lock on the state file. This is fertile ground for some really nasty problems.

What libuuid does bring to the table is a daemon somewhat confusingly named uuidd which attempts to bring some safety to the table. uuidd can make a strong guarantee if one abides by its rules. Combined with the assumed uniqueness of Ethernet MAC addresses, this delivers a pretty strong guarantee within a given distributed system.

In practice, however, this is all quite a bit to ask. The file-based synchronization has a large number of problematic failure cases. The daemon-based solution is better, but it never really caught on. It is exceedingly rare to use a system that comes with it configured out-of-the-box.

It also turns out that MAC addresses are not actually globally unique as they can be user-modified. Their inclusion in UUIDs also poses a threat to privacy and security. Given their opaque nature, developers tend to not realize that UUIDs could come with machine-identifying information. The creator of the Melissa virus that impacted Windows in the late 90s was identified using the MAC address from a UUID found in the virus’ code. As the untrustworthy Internet became the dominant networking platform, UUID generation which depended on trust became obsolete. All of these concerns have lead most to abandon leveraging hardware identifiers in UUIDs.

In fact, the default paths in libuuid avoid time-based UUIDs on any system which provides a pseudo-random number generating block device at /dev/(u?)random, which has been available on popular UNIX variants since the 1990s. This has been a factor in the rise of UUID Version 4, which contains only random data: 122-bits of it. The simplicity of implementation has driven its ubiquity.

When Worlds Collide

When I first came across these random Version 4 UUIDs, the threat of a collision was concerning. While UUIDs should not be used in such a way that collisions would create a security threat, as a developer I would like some level of confidence that my own systems aren’t going to trip over themselves. The bad news is that UUID generation still requires some small amount of trust.

The most important aspect of collision safety is the source of entropy. Consider two common cases: a modern version of Linux deployed into a trusted cloud computing environment, and an untrusted mobile device. In the Linux on cloud case, we’re provided with a cryptographically secure Pseudorandom Number Generator (PRNG) in the form of /dev/urandom. This is the “cryptographer-approved” and non-blocking source of entropy. It blends several sources like the “noise” generated by hardware interrupts and I/O activity metadata with a cryptographic function.

However, on a mobile device, almost anything goes: mobile devices cannot be trusted. While most of these are just as good as what’s available in the scenario above, it’s routine that the PRNG source on these devices isn’t very random at all. Given that there’s no way to certify the quality of these, it’s a big gamble to bet on mobile PRNGs. ID generation on low-trust mobile devices is an interesting and active area of academic research[1].

Even in the environment with a trustworthy PRNG device, implementation bugs can lead to collisions. In one particular case, an obscure bug in how OpenSSL deals with process forking lead to a high collision rate for a pure PHP UUID library. This might sound a bit overboard, but it’s probably worth testing your UUID implementation for obvious collision-creating bugs. It’s more common than you might think. At the system level, dieharder is one of many well-regarded tools for analyzing the quality of a system’s PRNG.

Given the proper environment, there’s a vanishingly low risk of collision, bordering on the infeasible. To be able to depend on this uniqueness, a large enough number space must be used to make it much more likely that other extremely uncommon events will occur far sooner than a collision.

With 122 random bits, it would take petabyte-sized pile of UUIDs (2^46) to even bring the chance of a collision into the realm of feasibility, around 1 in 50 billion. It would take billions of petabytes of UUIDs to make it likely.

The threat of an implementation bug or misconfiguration is vastly more important than the threat of random collision. Those concerned with UUID collision in a properly-configured system would find their time better spent pondering far more probable events like solar flares, thermonuclear war, and alien invasion on their systems. Just make sure your systems are property-configured.

Time Is On Our Side

In some cases the timestamp component of UUID Version 1 is actually quite useful. The first time I ran into this was in Apache Cassandra, where they’re called “TimeUUID.” In Cassandra, TimeUUIDs are sortable by timestamp, quite useful when needing to roughly order by time. The implementation swaps some of the random bits with a timestamp and a host identifier. The host ID is derived from the node’s IP addresses, which also form the unique identifier in a Cassandra cluster. 

The implementation has suffered from weaknesses when trying to compromise the uniqueness in the face of clock skew (see CASSANDRA-11991). More importantly, host-identifiable information is embedded in the UUIDs, which, if we’ve learned anything in the past, is not a great idea. Even if these IDs are derived from local network addresses, security best practices discourage actively exposing this information to the outside world, even indirectly.

Flakey Friends

The ability to sort IDs by time was largely the motivation behind Twitter’s Snowflake, which largely popularized the concept of k-ordering by timestamp[2]. Twitter needed a way to sort piles of arbitrary tweets by creation time without global coordination. Embedding a timestamp in the ID provides this functionality without the overhead of an additional timestamp field.

K-ordering is a more precise way of saying roughly sorted. In Snowflake, a large amount of the design was driven by the need to fit these IDs into a 64-bit number space. This includes requiring dedicated ID-generation servers that use a separate strong coordination mechanism (ZooKeeper) to assign host IDs and store sequence checkpoints.

Inspired by Snowflake, the team at Boundary released Flake in early 2012. It also uses dedicated ID-generation server processes, but does not require a strong coordination mechanism. Flake is similar to UUID Version 1 in that it uses a much larger 128-bit number space and a 48-bit host identifier derived from the hardware address to protect against overlap in a distributed environment.

It primarily differs from UUID Version 1 in that it’s structured for lexicographic ordering. The bits of a Flake ID are arranged in such a way that users can expect that they will be ordered by their timestamp regardless of where they’re written. In contrast, Cassandra must implement specific collation logic to get the same behavior from their TimeUUID. 

Unfortunately it appears that Flake ID can expose host identification information to end users as this information is embedded in the generated IDs. While the implementation provided defends against clock skew situations, their uniqueness property does depend heavily on the forward movement of the wall clock.

One noteworthy feature included in Flake is base62 encoding, which provides a much more “portable” representation than UUID. The string representation of UUIDs is one of it’s weaker features. This may seem trivial, but the inclusion of the dash (-) character makes them less usable. An example of this is when UUIDs are indexed by a search engine, where the dashes will likely be interpreted as token delimiters. The base62 encoding avoids this pitfall and retains the lexicographic ordering properties of the binary encoding.

The Best of Both Worlds

When implementing an internal system at Segment, the team began using UUID Version 4 for generating unique identifiers. It was simple and required no additional dependencies.

After a few weeks, a requirement to order these identifiers by time emerged. This requirement wasn’t strict: the initial purpose was to enable log archiving to Amazon S3 where they are keyed by ranges of message identifiers. The existing UUIDs would have resulted in random dispersion of messages with no natural grouping property. However, if we could take advantage of the arrow of time, it would result in a natural grouping and a feasible number of objects in S3.

Thus KSUID was born. KSUID is an abbreviation for K-Sortable Unique IDentifier. It combines the simplicity and security of UUID Version 4 with the lexicographic k-ordering properties of Flake. KSUID makes some trade-offs to achieve these goals, but we believe these to be reasonable for both our use cases and many others out there.

KSUIDs are larger than UUIDs and Flake IDs, weighing in at 160 bits. They consist of a 32-bit timestamp and a 128-bit randomly generated payload. The uniqueness property does not depend on any host-identifiable information or the wall clock. Instead it depends on the improbability of random collisions in such a large number space, just like UUID Version 4. To reduce implementation complexity, the 122-bits of UUID Version 4 are rounded up to 128-bits, making it 64-times more collision resistant as a bonus, even when the additional 32-bit timestamp is not taken into account.

The timestamp provides 1-second resolution, which we found to be acceptable for a broad range of use cases. If a higher resolution timestamp is desired, payload bits can be traded for more timestamp bits. While high-resolution timestamp support is not included in our implementation, it is backwards compatible. Any implementation which uses 32-bit timestamps can safely work with KSUIDs that use higher resolution timestamps. 

A “custom” epoch is used that ensures >100 years of useful life. The epoch offset (14e8) was also chosen to be easily remembered and quickly singled out out by human eyes.

KSUID provides two fixed-length encodings: a 20-byte binary encoding and a 27-character base62 encoding. The lexicographic ordering property is provided by encoding the timestamp using big endian byte ordering. The base62 encoding is tailored to map to the lexicographic ordering of characters in terms of their ASCII order.

Fixed-length encodings result in simpler and safer implementations. As a small bonus they are sometimes more efficient, such as in SQL databases, where variable-length data types result in additional storage overhead. Regardless of the chosen format, KSUIDs can be lexicographically ordered by time. The string representation is entirely alphanumeric, thus avoids the problem of tokenized dashes in UUIDs.

Our Implementation

Today we’re open sourcing our KSUID implementation, which is written in Go. It implements popular idiomatic interfaces, making it easy to integrate into an existing codebase and use with other Go libraries. It comes packaged with a command-line tool for generating and inspecting KSUIDs.

Acknowledgements

This post would not have been possible without the efforts of Bitsavers, which gathered and archived the source material related to Apollo Computing. Thanks to Albert Strasheim, Calvin French-Owen, Evan Johnson, Peter Reinhardt, and Tido Carriero for their insightful comments and feedback.

References

[1] P. Jesus, C. Baquero, and P. Almeida: ID Generation in Mobile Environments (2006) [2] T. Altman, Y. Igarashi: Roughly sorting: sequential and parallel approach (1989)


See what Segment is all about

Since you've made it this far, perhaps you want to check out Segment? Sign up for a free workspace here or a get a demo here 👉

Kevin Burke on May 17th 2017

Recently we shared the techniques we used to save more than a million dollars annually on our AWS bill. While we went into detail about the various problems and solutions, the most common question we heard was: "I know I’m spending a ton on AWS, but how do I actually break that into understandable pieces?" 

At face value, this sounds like a fairly straightforward problem. 

You can easily split your spend by AWS service per month and call it a day. Ten thousand dollars of EC2, one thousand to S3, five hundred dollars to network traffic, etc. But what’s still missing is a synthesis of which products and engineering teams are dominating your costs. 

Then, add in the fact that you may have hundreds of instances and millions of containers that come and go. Soon, what started as simple analysis problem has quickly become unimaginably complex. 

In this follow-up post, we’d like to share details on the toolkit we used. Our hope is to offer up a few ideas to help you analyze your AWS spend, no matter whether you’re running only a handful of instances, or tens of thousands.

Grouping by ‘product areas’

If you’re operating AWS at scale–it’s likely that you’ve hit two major problems.

First, it’s difficult to notice if one part of the engineering team suddenly starts spending a lot more than it used to. 

Our AWS bill is six figures per month, and the charges for each AWS component change rapidly. In a given week, we might deploy five new services, optimize our DynamoDB throughput, and add hundreds of customers. In this environment it’s easy to overlook that a single team spent $20,000 more on EC2 this month than they did last month.

Second, it can be difficult to predict how much new customers will cost. 

As background, Segment offers a single API which can send analytics data to any number of third-party tools, data warehouses, S3, or internal data pipelines. 

While customers are good at predicting how much traffic they will have and the products they’d like to use, we’ve historically had trouble translating this usage information to a dollar figure. Ideally we’d like to be able to say "1 million new API calls will cost us $X so we should make sure we are charging at least $Y."

Our solution to these problems was to bucket our infrastructure into what we dubbed ‘product areas’. In our case, these product areas are loosely defined as:

  1. integrations (the code that sends data from Segment to various analytics providers)

  2. API (the service that receives data customer libraries sent to Segment)

  3. warehouses (the pipeline that loads Segment data into a customer's data warehouse)

  4. website and CDN

  5. internal (shared support logic for the four above)

In scoping the project, we realized it would be next to impossible to measure everything. So instead, we decided to target a percentage of the costs in the bill, say, 80%, and try to get that measurement working end-to-end. 

It's better to deliver business value analyzing 80% of the bill than to shoot for 100%, get bogged down in the collection step, and never deliver any results. Shooting for 80% completeness (being willing to say "it's good enough") ended up saving us again and again from rabbit-holing into analysis that didn’t meaningfully impact our spend.

Gather, then analyze

To break out costs by product area, we need to gather data for the billing system which we had to collect and then subsequently join together:

  1. the AWS billing CSV - the CSV generated by AWS to provide the full billing line items

  2. tagged AWS resources – resources which could be tagged within the billing CSV

  3. untagged resources – services like EBS and ECS that required custom pipelines to tag usage with ‘product areas’

Once we calculated the product areas for each of these pieces of data, we could load them into Redshift for analysis.

1. The AWS Billing CSV

The place to start to understand your spend is the AWS Billing CSV. You can enable a setting in the billing portal and Amazon will write a CSV with detailed billing information to S3 every day.

By detailed, I mean VERY detailed. Here is a typical billing row:

That row is a charge for a whopping $0.00000001, or one one-millionth of a penny, for DynamoDB storage on a single table between 3AM and 4AM on February 7th. There are about six million rows in our billing CSV for a typical month. (Unfortunately, most cost more than a millionth of a penny.)

We use Heroku's awsdetailedbilling tool to copy the billing data from S3 to Redshift. This was a good first step, but we didn't have a great way to correlate a specific AWS cost with our own product areas (e.g. whether a given instance-hour is used for the integrations or warehouses product areas).

What’s more, about 60% of the bill is consumed by EC2. Despite being the lions’ share of the cost, understanding how a given EC2 instance mapped to a product area was impossible with the data provided by the billing CSV.

There’s a good reason why we couldn’t just use instance names to determine product areas. Instead of running a single process per host, we make heavy use of ECS (Elastic Container Service), to stack hundreds of containers on a host and achieve much higher utilization. 

Unfortunately, Amazon bills only for the EC2 instance costs, so we had zero visibility into the costs of the containers running on an instance: how many containers we were running at a typical time, how much of the pool we were using, and how many CPU and memory units we were using.

Even worse, information about container auto-scaling isn’t reflected anywhere in the billing CSV. To get this data for analysis, we had to write our own tooling to gather and then process it. I’ll cover how this pipeline works in the following sections.

Still, the AWS Billing CSV will provide very good granular usage data that will become the basis for our analysis. We just need to associate that data with our product areas.

Note: This problem isn’t going away either. Billing by the instance-hour is going to be a bigger and bigger problem from a "what am I spending money on?" perspective, since more companies are running fleets of containers across a set of instances, with tools like ECS, Kubernetes and Mesos. In a slight twist of irony, Amazon has had this same problem for years - each EC2 instance is a Xen hypervisor, being run on the same bare metal machine as other instances.

2. Cost data from tagged AWS resources

The most important and readily available data comes from ‘tagged’ AWS resources.

Out of the box, the AWS billing CSV doesn’t include any tags in its analysis. As such, it’s impossible to discern how one EC2 instance or bucket might be used vs another.

However, you can enable certain tags to appear alongside your line item costs using cost allocation tags

These tags are officially supported by many AWS resources, S3 buckets, DynamoDB tables, etc. You can toggle a setting in the AWS billing console to make a cost allocation tag show up in the CSV. After a day or so, your chosen tag (we chose product_area) will start showing up as a new column next to the associated resources in the detailed billing CSV. 

If you are doing nothing else, start by using cost allocation tags to tag your infrastructure. It’s essentially ‘free’ and requires zero infrastructure to run.

After we enabled cost allocation tags, we had two challenges: 1) tagging all of the existing infrastructure, and 2) ensuring that any new resources would automatically have tags.

Tagging your existing infrastructure

Tagging your existing infrastructure is pretty easy: for a given AWS product, query Redshift for the resources with the highest costs, bug people in Slack until they tell you how those resources should be tagged, and stop when you've tagged 90% or more of the resources by cost.

However, enforcing that new resources stay tagged requires some automation and tooling. 

To do this, we use Terraform. In most cases, Terraform's configuration supports adding the same cost allocation tags that you can add via the AWS console. Here's an example Terraform configuration for a S3 bucket:

Though Terraform provided the base configuration, we wanted to verify that every time someone wrote resource "aws_s3_bucket" into a Terraform file, they included a product_area tag. 

Fortunately Terraform configurations are written in HCL (Hashicorp Configuration Language), which ships with a comment preserving configuration parser. So we wrote a checker that walks every Terraform file looking for taggable resources lacking a product_area tag.

We set up continuous integration for the repo with Terraform configs, and then added these checks, so the tests will fail if anyone tries to check in a tag-able resource that's not tagged with a product area. 

This isn't perfect - the tests are finicky, and people can still technically create untagged resources directly in the AWS console, but it's good enough for now–the easiest way to provision new infrastructure is via Terraform.

Rolling up cost allocation tag data

Once you've tagged resources, accounting for them is fairly simple.

  1. Find the product_area tags for each resource, so you have a map of resource id => product area tags.

  2. Sum the unblended costs for each resource

  3. Sum those costs by product area, and write the result to a rollup table.

    SELECT sum(unblended_cost) FROM awsbilling.line_items WHERE statement_month = $1 AND product_name='Amazon DynamoDB';

You might also want to break out data by AWS product - we have two separate tables, one for Segment product areas, and one for AWS products.

We were able to account for about 35% of the bill using traditional cost allocation tags.

Analyzing Reserved Instances

This approach works great for tagged, on-demand instances. But in some cases, may have paid AWS up front for a ‘reservation’. Reservations guarantee a certain amount of capacity, in exchange for up-front payment at a lower fixed rate.

In our case, this means several large charges that show up in the December 2016 billing CSV need to be amortized across each month in the year. 

To properly account for these costs, we wanted to use the unblended cost that was incurred in the desired time period. The query looks like this:

Subscription costs take the form "$X0000 of DynamoDB," so they are impossible to attribute to a single resource or product area. 

Instead, we sum the per-resource costs by product area and then amortize the subscription costs according to the percentages. If the warehouses pipeline used 60% of our EC2 compute costs, we assume it used 60% of the reservation as well. 

This isn't perfect. If a large percentage of your bill is reserved up front, this amortization strategy will be distorted by small changes in the on-demand costs. In that case you'll want to amortize based on the usage for each resource, which is more difficult to sum than the costs.

3. Cost data from untagged AWS resources

While tagging instances and DynamoDB tables is great, other AWS resources don't support cost allocation tags. These resources required that we build a Rube Goldberg-ian-style workflow to successfully get the cost data into Redshift. 

The two biggest untagged resources groups we had to deal with were ECS and EBS.

ECS

ECS is constantly scaling our services up and down, depending on how many containers a given service needs. It’s also responsible for re-balancing and bin-packing containers across individual instances.

ECS starts containers on hosts based upon “CPU and memory reservation”. A given service indicates how many CPU shares it requires, and ECS will either put new containers on a host with capacity, or scale up the number of instances to add more capacity. 

None of these ECS actions are directly reflected within our AWS Billing CSV–but ECS is still responsible for triggering the auto-scaling for each of our instances. 

Put simply, we wanted to understand what ‘slice’ of each machine a given container was using, but the billing CSV only gives us ‘whole unit’ breakdown by instance.

To determine the cost of a given service, we built our own pipeline that makes use of the following pieces:

  1. Set up a Cloudwatch subscription any time an ECS task gets started or stopped.

  2. Push the relevant data (Service name, CPU/memory usage, starting or stopping, EC2 instance ID) from the event to Kinesis Firehose (to aggregate individual events).

  3. Push the data from Kinesis Firehose to Redshift.

Once all of the task start/stop/size data is in Redshift, we multiply the amount of time a given ECS task ran (say, 120 seconds) by the number of CPU units it used on that machine (up to 4096 - this info is available in the task definition), to get a number of CPU-seconds for each service that ran on the instance. 

The total bill for the instance is then divided across services according to the number of CPU-seconds each one used.

It's not a perfect method. EC2 instances aren't running at 100% capacity all the time, and the excess currently gets divided across the services running on the instance, which may or may not be the right culprits for that overhead. But (and you may recognize this as a common theme in this post), it's good enough.

Additionally, we want to map the right product area for each ECS service. However we can't tag those services in AWS because ECS doesn't support cost allocation tags.

Instead we added a product_area key to the Terraform module for each ECS service. This key doesn't lead to any metadata being sent to AWS, but it does populate a script script that reads the product_area keys for each service. 

That script then publishes the service name => b64encoded product area mappings to DynamoDB on every new push to the master branch. 

Finally, our tests then validate that each new service has been tagged with a product area.

EBS

Elastic Block Storage (EBS) also makes up a significant portion of our bill. EBS volumes are typically attached to an EC2 instance, and for accounting purposes it makes sense to count the EBS volume costs together with the EC2 instance. However, the AWS billing CSV doesn't show you which EBS volume was attached to which instance.

We again used Cloudwatch for this - we subscribe to any "volume attached" or "volume unattached" events, and then record the EBS => EC2 mappings in a DynamoDB table. 

We can then add EBS volume costs to the relevant EC2 instances before accounting for ECS costs.

Combining data across accounts

So far we’ve talked about all of our costs within the context of a single AWS account. However, this doesn’t actually reflect our AWS setup, which is spread across different physical AWS accounts.

We use an ops account not only for consolidated, cross-account billing, but to help provide a single access point for engineers making changes to production. We separate staging from production to ensure that an API call which might, say, delete a DynamoDB table, can be run safely with the appropriate checks. 

Of these accounts, prod dominates the cost–but our staging costs are still a significant percentage of the overall AWS bill. 

Where this gets tricky is when we need to write the data about ECS services in the stage realm to the production Redshift cluster. 

To achieve writing ‘cross account’, we needed to allow the Cloudwatch subscription handlers to assume a role in production that can write to Firehose (for ECS) or to DynamoDB (for EBS). These are tricky to set up because you have to add the correct permissions to the right role in the staging account (sts.AssumeRole) and in the production account, and any mistake will lead to a confusing permission error.

For us, this means that we don't have a staging realm for our accounting code, since the accounting code in stage is writing to the production database.

While it’s possible to add a second service in stage that subscribes to the same data but doesn't write it, we decided that we can swallow the occasional problems with the stage accounting code.

Rolling up the statistics

Finally we have all of the pieces we need to run proper analysis: 

  1. tagged resources in the AWS billing CSV

  2. data about when every ECS event started and stopped

  3. a mapping between ECS service names and the relevant product areas

  4. a mapping between EBS volumes and the instances they are attached to

To roll all of this up for the analytics team, I broke out the analysis by AWS product. For each AWS product, I totaled the Segment product areas and their costs, for that AWS product. 

The data gets rolled up into three different tables:

  1. Total costs for a given ECS service in a given month

  2. Total costs for a given product area in a given month

  3. Total costs for a (AWS product, Segment product area) in a given month. For example, "The warehouses product area used $1000 worth of DynamoDB last month."

The total costs for a given product area look like this:

And the costs for an AWS product combined with Segment product area look like this:

For each of these tables, we have a finalized table that contains the finalized numbers for each month, and a rollup append-only table that writes new data for a month as it updates every day. A unique identifier in the rollup table identifies a given run, so you can sum the AWS bill by finding all of the rows in a given run.

Finalized data effectively becomes our golden ‘source of truth’ that we use for top-level metrics and board reporting. Rollup tables are used to monitor our spend over the course of the month.

Note: AWS does not "finalize" your bill until several days after the end of the month, so any sort of logic that marks the billing record as complete when the month flips over is incorrect. You can detect when the bill becomes "final" because the invoice_id field in the billing CSV will be an integer instead of the word "Estimated".

A few last gotchas

Before closing, we realized that there are a few places where a little bit of preparation and knowledge could have saved us a lot of time. In no particular order, they are:

  • Scripts that aggregate data or copy it from one place to another are infrequently touched and often under-monitored. As an example, we had a script that copied the Amazon billing CSV from one S3 bucket to another, but it failed on the 27th-28th of each month because the Lambda handler doing the copying ran out of memory as the CSV got large. It took a while to notice this, because the Redshift database had a lot of data and the right-ish numbers for each month. We’ve since added monitoring to the Lambda function to ensure that it runs without errors.

  • Be sure these scripts are well documented, especially with information about how they are deployed and what configuration they need. Link to the source code in other places where they are referenced - for example, any place you pull data out of an S3 bucket, link to the script that puts the data in the bucket. Also consider putting a README in the S3 bucket root.

  • Redshift queries can be really slow without optimization. Consult with the Redshift specialist at your company, and think about the queries you need, before creating new tables in Redshift. In our case we were missing the right sortkey on the billing CSV tables. You cannot add sortkeys after you create the table, so if you don't do it up front you have to create a second table with the right keys, send writes to that one and then copy all the data over.

  • Using the right sortkeys took the query portion of the rollup run from about 7 minutes to 10-30 seconds.

  • Initially we planned to run the rollup scripts on a schedule - Cloudwatch would trigger an AWS Lambda function a few times a day. However the run length was variable (especially when it involved writing data to Redshift) and exceeded the maximum Lambda timeout, so we moved it to an ECS service instead. 

  • We chose Javascript for the rollup code initially because it runs on Lambda and most of the other scripts at the company were in Javascript. If I had realized I was going to need to switch it to ECS, I would have chosen a language with better support for 64 bit integer addition, and parallelization and cancellation of work.

  • Any time you start writing new data to Redshift, the data in Redshift changes (say, new columns are added), or you fix integrity errors in the way the data is analyzed, add a note in the README with the date and information about what changed. This will be extremely helpful to your data analysis team.

  • The blended costs are not useful for this type of analysis - stick to the unblended costs, which show what AWS actually charged you for a given resource.

  • There are 8 or 9 rows in the billing CSV that don't have an Amazon product name attached. These represent the total invoice amount, but throw off any attempt to sum the unblended costs for a given month. Be sure to exclude these before trying to sum costs.

The bottom line

As you might imagine, getting visibility into your AWS bill takes a large amount of work–both in terms of custom tooling and identifying expensive resources within AWS.

The biggest win we’ve found comes from making it easy to continuously estimate your spend rather than running the occasional ‘one-time-analysis’.

To do that, we’ve automated all of the data collection, enforced tagging within Terraform and our CI, and educated the entire engineering team how to properly tag their infrastructure. 

Rather than sitting within a PDF,  all of our data is continuously updated within Redshift. If we want to answer new questions or generate new reports, we can instantly get results via a new SQL query. 

Additionally we’ve exported that data into an Excel model so we can estimate exactly how much a new customer will cost. And we can also see if a single service or a single product area is suddenly costing a lot more, before that causes too much of a hit to our bottom line.

While it may not exactly mirror your infrastructure, hopefully this case study will be useful for helping you get a better sense of your costs and manage them as you scale!

Joe Christiani on May 11th 2017

In a scrappy B2B startup, user feedback is super valuable, but guerrilla research won’t cut it when you need a more targeted group of users. The Segment Design team found the users we needed and developed an automated process for recruitment and coordinating interviews using our own product and a few integrated applications. 

Guerrilla research includes a range of fast and inexpensive techniques for designers and UX researchers (often the same person) to observe how users engage with products in the wild. Rather than recruiting participants, these ad hoc experiments are usually done on friends, peers, or strangers in coffee shops. But what if the users of your product are other businesses? Even worse, what if your product is (gasp) technically complex for a niche audience? The ‘check-out-my-mixtape’ methods that work well for consumer-facing products will get you a lot of blank stares.  

Sourcing users for research and feedback is particularly challenging for the Design Team at Segment because our product is deeply technical.  While the customer feedback we gather through success tickets (Zendesk) and NPS surveys is valuable, we needed a way to explore our user’s behaviors and needs in more depth.  

To do this, we designed an automated workflow to recruit participants for a UX research program and and to coordinate testing and feedback sessions throughout the product development process. This article explains how we developed this workflow and describes the final system, which you can borrow from and improve.

Our First Attempt

Preliminary Goals:

  1. Find users willing to give us feedback.

  2. Interview users to better understand how they were using our product.

  3. Identify pain points in the current experience of using our app.

  4. Do all of the above at regular intervals without spamming our customers! 

Preliminary Solution:

Our initial process was inspired by Mesosphere, who wrote about their experience bootstrapping a UX research program. We used Customer.io to email new users, asking if they would be willing to join our User Experience program. 

Since we shared a Customer.io account with other teams at Segment, we could specify that only users who had not been recently contacted by other teams would receive the emails. (Sorry, thirsty UX researchers: empathy for your users includes not spamming them.) Recipients of the opt-in email would fill out a Google Form, which recorded their email in a spreadsheet.  We would then periodically email our pool of opted-in users with invitations to remote or in-person research.

The upside of this process was that it didn’t take very long to set up, and the Google Suite tools were free.  But as the pool of participants grew, it became clear that the time-intensive nature of manually managing the spreadsheet and sending emails wasn’t scalable. Recalling the proverb, ‘Physician, heal thyself,’ we took a Design Thinking approach and treated our user research program like any other user experience challenge.

Version 1.1: Automating the Workflow

We want to spend our time researching and designing, not sourcing users and coordinating sessions - Segment Design Team

For this iteration, we took a much more rigorous design thinking approach.  We approached the process holistically, considering both the researcher and the participant as users when iterating on our process.

Jobs To Be Done:

  1. As a product designer, when I am exploring or validating an idea, I want to be able to interact with users so I can learn more about them and incorporate this understanding into the product.

  2. When I am doing user research and testing, I want to be able to find users easily, so I can spend my time learning more about them rather than coordinating the process.

  3. When I am engaging in user research as a participant, I want to be able to give feedback quickly and easily, so I can move on to my primary responsibilities.

Problems: 

The first iteration of our user research recruiting and coordinating process required too much manual input on both sides. The opt-in experience for our participants was not ideal, since we were sending an email which, somewhat paradoxically, requested that they enter their email in a Google form.  The pain points on our side centered on the way email addresses of participants and the records of when they had been contacted were trapped in spreadsheets and not accessible in other tools.

What worked well from the last iteration:

Allowing users to explicitly opt in to our program made sure that we weren’t spamming people who weren’t interested in participating.

Solution:

We began to explore ways to streamline the recruitment and coordination process. When we mentioned the project to our developer counterparts, they were aghast at the existing manual process. Apparently repetition is the Comic Sans of engineering.  

As we ideated on how to automate the workflow, a crazy idea emerged: What if we used our product, a platform for customer data, to collect and manage other kinds of data? 

Sidebar: We did use our own product as part of this solution. No, this is not a sales pitch. Yes, there’s a free plan that should let you achieve this workflow.

Roll your own UX Research recruitment system in 20 minutes

1. Fill the top of the funnel

We continued to use Customer.io to send a triggered email to new users who signed up for Segment and ask if they’d like to opt in to the UX Research program.

2. Tag users who opt in

We designed a landing page and launched it with WebFlow. When users reached the page, we used Segment to assign an attribute to their user IDs with a tracking API. In this case, we assigned the attribute ux_research_opt_in=true to the user ID. If users chose to opt out of the program, we simply changed this attribute to ux_research_opt_in=false to remove them from the UX Research program without unsubscribing them from all Segment emails.

Pro tip: If you’re a designer who isn’t also a developer, get some help from a friend with this step. If you’re a designer who doesn’t know any developers, you might have some trouble shipping products.

3. Invite participants to sessions

Using Segment meant that the the attribute assigned to the participant’s user ID was available in hundreds of third party tools. Having the opt-in attribute associated with user IDs allowed us to have consistent cohorts across various channels, in this case Customer.io for emails and Intercom for in-app messaging.

We planned to contact participants no more than once a month and sent them Calendly links so they could self-schedule remote or in-person sessions at their convenience for the subsequent two weeks.

Pro tip: Create an email alias for your UX Research program. Otherwise replies and out-of-office messages will end up in your inbox.  This also allows you to create a research and testing calendar that the entire Design Team or broader organization can view.

In addition to the remote and in-person sessions, we also sent out surveys and exercises like card sorting and tree testing asynchronously via Verify and Optimal Workshop.  This was important because it allowed users who live in different time zones across the world to share their perspective.

4. Show some love

Finally, we sent thank you gifts to our participants using Printfection, which lets customers select from various Segment swag and handles all aspects of fulfillment.

The overall flow ended up looking like this:

From an end-user perspective, an invitation email allows people to easily opt in to the program by entering their email address.  They then receive monthly invitations to participate in specific feature tests or research modules with the option to schedule either remote or in-person sessions. Then, all they have to do is attend the session itself.  Everything from the anti-spam opt-in process and self-scheduling to the customizable gifts are designed to give our participants flexibility and control over how often and when they choose to give us feedback.

We’d love to hear from folks tackling similar challenges in user research and design for B2B products.  And if you’d like to participate in our User Experience Program, just sign up for a free Segment account and we’ll be in touch— automatically.

Calvin French-Owen on April 20th 2017

The barrier holding back most open source projects is surprisingly mundane. It’s not test coverage. It’s not performance. It’s not code quality.

No, adoption often comes down to a single question: “is the tool easy to install and use?” 

Do I have to re-compile the project from source? Do I have to wait 15 minutes while I fetch hundreds of dependencies? Do I have to download a new package manager? Do I really need a picture of Guy Fieri?

Most open source authors recognize that configuration and setup is a barrier. And yet, we continue to underestimate how large of a barrier it is. The most successful projects make it as easy as possible to install and deploy a tool out of the box.

And some projects nail it; the massive adoption of Bootstrap, Redis, and Rails is largely due to their ‘out-of-the-box’ nature. And we have package managers like yarn , brew , and docker to thank for such an easy setup. 

But hundreds of projects don’t have that “zero-to-60-in-5-seconds” approach.

That got me wondering: if package managers have revolutionized how easy it is to install and run software locally, why is it so much harder to get full applications into production? Why is installing a node module trivial, but running full services with a webserver, database, and job queue so difficult?

Where is our “package manager for the cloud?”

Managed vs Packaged

Before imagining what a cloud package manager might look like, it’s worth defining some terms. In particular, differentiating packaged infrastructure from managed infrastructure.

Packaging gives the user a consistent means of installing and then running software. Docker images, Debian packages, VMs, and dependency managers all come to mind here. The promise of the package manager is that you install it once, and then run all other code through it. 

Managed infrastructure is different. While it is packaged in various ways–the core value proposition is that you don’t handle operations. Instead, hard drive, machine, and system failures are off-loaded to a cloud provider. 

As a client, you agree to respect the limits set by whomever is maintaining the infrastructure. And in return, you get a set of guarantees related to uptime, SLOs, and fault-tolerance.

Managed infrastructure is appealing because it reduces the expertise needed to run complex software. It certainly wont excuse poor architecture decisions or bad query patterns. But it is incredible that with a single click, you can create a high-throughput database which automatically handles re-balancing, backups, and scaling. 

By using managed infrastructure, you’ve effectively ‘leased’ the expertise of site reliability engineers from Google or Amazon, and am willing to pay a small margin on top of the hardware costs to outsource a large portion of that skillset. 

What’s most interesting about cloud primitives like RDS or Cloud Spanner is that outsourcing this expertise puts maintenance in the hands of, well, experts. We’re not talking about your usual “cheap outsourced devshop that churns out PHP apps.” Instead, startups get to borrow from internal teams at Amazon and Google–arguably some of the most experienced engineers on the planet when it comes to running large, multi-tenant infrastructure. 

That’s why, for the vast majority of use cases, managed infrastructure just makes sense. It’s not only cheaper, but ultimately more reliable as well.

Of course, leveraging ‘expert knowledge’ is only half the benefit. There’s also the ability to leverage complex (and proprietary) distributed systems like Dynamo or Bigtable. Building your own auto-scaling version of either system requires millions of dollars in R&D. But with the cloud, you can rent hourly access to them for pennies on the dollar.

It’s a trend that is continuing to grow in popularity, and it doesn’t seem to show any signs of slowing.

Unsurprisingly, both packaged and managed software have been undergoing a bit of a renaissance when it comes to new tech. And in particular, local development has been completely re-invented in the last 5 years with the advent of Docker and high-level package managers.

The Rising Tide of Packaged Infrastructure

Ever since the rise of containers and Docker, packaging has come a long way.

It’s now relatively trivial to spin up an entire environment that runs only within containers. 

Running individual containers is straightforward, simply grab the image you want and supply the right configuration:

But what if you have an entire application run across multiple containers that you would like to coordinate? 

As a trivial example, suppose we want to run a Wordpress installation that connects to a database.

One way to do this would be using Docker-compose. You can see that we lay out our two services, one for our DB: 

And one for our Wordpress site:

If those are both defined in our docker-compose.yml file, we can then run both containers on any Docker infrastructure using:

Or we could use Kubernetes Pods, which would even get us encrypted secrets automatically injected into the container. We can even grab the full YAML provided in the examples to boot the resources we need.

Because every cloud provider, container startup, and grad student is writing their own scheduler these days, there’s really no limit to coordinating containers. And while each implementation comes with a certain set of trade-offs and quirks, they all do a pretty good job of orchestrating many different services together.

But here’s the weird thing: we’ve had good packaging for a long time and we’ve started to see really nice ways to coordinate infrastructure we manage–and yet, there’s still not a lot of good tools for configuring managed infrastructure. 

Who Manages the Managed?

In the not-so-distant past, there was a reasonable solution to quickly launching managed services: The Heroku Button

With one click, you could instantly get applications running within your Heroku account, along with whatever databases or queues you’d need. No obscure cron jobs, kernel tuning, or arcane init scripts required.

And that was great. IF you run your infrastructure on Heroku. But if you don’t, it’s a totally different story. 

If you’re running on AWS or GCE, now you also have to setup the proper security and network permissions, pay for additional data transfer outside your AWS account, and make sure that Heroku is actually scaling properly. Oh, and it probably uses a totally separate monitoring and logging pipelines.

Unless you end up booting that infrastructure as a one-off or totally separate service, there’s still a lot of work to be done to mold an open source project to run successfully within your infrastructure. We’ve done this three or four times now for Segment–and it’s invariably annoying.

As more companies have transitioned off Heroku and onto the bigger clouds, the simplicity and speed of the Heroku button has really fallen by the wayside. And the need for cross-cloud managed infrastructure has appeared. 

So let’s explore the offerings on the market. Each is attacking this problem from a slightly different angle, at different layers of the stack.

An Example: Discourse

Before looking at the different stacks, I’d like to quickly introduce Discourse as a sort of prototypical app that can really benefit from managed infrastructure. 

Discourse is an open source platform for community discussion that you can run within your own infrastructure. 

Users wanting to run Discourse require three separate processes:

  • a Ruby on Rails server running Discourse on a single machine

  • a Postgres server

  • a Redis server

If this is an internal tool under relatively light load, that’s probably it. But if Discourse is running with a large number of clients, we’ll also need:

  • a load balancer

  • auto-scaling rules for the web service

And oh yeah, we’ll probably want some way of backing up that Postgres instance so we don’t drop everything accidentally, as well as a nice way to scale up the number of clients and the size of the database.

Now it’s starting to sound like we either require a full-fledged ops person to spin up this infrastructure, or a nice way of outsourcing it to managed products.

So how can we quickly provision it? 

There are a few major players who all seem well-positioned to tackle this problem: Terraform, Kubernetes, and a handful of cloud startups. Each has a different strategic position, so I’d like to explore how each tool can provide cross-cloud managed infrastructure.

The Terraform Solution

The product with the most cross-cloud adoption to date is Hashicorp’s Terraform.

First, a 30-second intro of Terraform: Terraform is a CLI tool for managing cloud resources. You can use it to provision, and then change the configuration of load balancers, auto-scaling groups, instances, and more. 

Terraform uses a static DSL to create resources. It then tracks these resources in a “state file” and creates “diffs” between your desired state and the current state of your infrastructure. 

Creating instances is simple. As a quick example, the configuration for a bastion resource might look like this:

It has a type (aws_instance) and an identifier (bastion), as well as a bunch of attributes that are passed in as configuration. Whenever we plan and apply, any changes will be modified using the appropriate AWS API to update our resource.

What’s more interesting is that groups of resources can be created together using modules. Modules are re-usable pieces of configuration that will automatically create and manage all the resources within it. 

Modules effectively serve as a higher-level API to collections of managed infrastructure. Just what we needed!

As an example for Discourse, we could imagine that the repo itself could package a Terraform folder:

And then anyone interested in booting up Discourse within their own infrastructure could simply reference the module and pass in their own VPC ID.

Internally this module would then create the internal resources that it needs:

Under the hood, it would give us:

  • auto-scaling instances with an AMI running discourse

  • a managed RDS instance running postgres with hourly backups

  • a managed elasticache instance running redis

And voila! We have our production infrastructure, ready to go! By referencing the module, and passing in our required infrastructure–Terraform will automatically boot all the pieces that we need.

The best part here is that modules can support multiple providers just by using different Terraform. 

Need the GCE version instead? If the author has built an adapter for it, just reference a new path:

Still want our good ol’ Heroku button? No problem, Terraform supports that too (with managed addons):

Under the hood, this might look something like this:

It solves both the problems of running a service within your own infrastructure and relying on managed infrastructure for any team, regardless of whether they use Heroku or not. 

The Kubernetes Solution

Where Terraform is primarily used to provision the ‘base infrastructure’ and machine images, Kubernetes lives at a different part of the stack. It’s a scheduler and service orchestrator–designed to coordinate services both locally and in the cloud.

As a user, you first describe services within Kubernetes, and then install a Kublet (the Kubernetes agent) on each host. Kubernetes will then determine how to optimally utilize the cluster such that multiple containers are run and exposed on each machine. 

In short, Kubernetes a scheduler that runs applications and handles service discovery, load balancing, and configuration–all across a cluster of machines. 

Ordinarily I wouldn’t even put Terraform and Kubernetes in the same category–since Kubernetes is focused on orchestrating infrastructure that you are responsible for running yourself. It’s not really about booting up managed infrastructure at all. 

However, Kubernetes does support one important cloud primitive: managed load balancers. When booting your service, you can specify as part of the pod that it should manage a load balancer to send traffic to:

This load balancer will use the managed load balancers within any of the big clouds: AWS, GCE, Azure, and Openstack to name a few.

By combining primitives like the LoadBalancer with simple package management using Helm charts, Kubernetes is starting to become the common ‘substrate’ that any application can build upon.

The CLI command to install our discourse app might look like this:

As a developer who wants to support Helm, all we’d have to do is add the corresponding Helm charts: 

And we’d allow anyone running Kubernetes to immediately run Discourse within their infrastructure.

What makes this approach interesting is the fact that Kubernetes has been gaining so much traction at the application level. It feels like it would be relatively doable to build a Kubernetes service type of Database or ObjectStore that maps to RDS or S3. It could be the trojan horse that slowly beings to spread Kubernetes pods as the ‘unit’ of managed hosting. 

That said, there’s an immense amount of work involved to gain the sort of provider coverage that Terraform has. Even with the tight application integration that Kubernetes can provide, moving to support the managed infrastructure of just the major clouds would still be a major undertaking. 

The Managed Managed Solution?

As some last food for thought–it seems like there’s a big opportunity here for a startup that manages to smooth over the wrinkles and inconsistencies between various cloud providers. 

Imagine if you really could run your infrastructure dynamically on whichever cloud was cheapest. Or whichever provided the lowest latency. 

It effectively turns the decision of which cloud to use from an annual procurement to a real-time auction. 

There are a handful of startups doing interesting things when it comes to making managed infrastructure more user-friendly:

Convox: Convox provides a CLI and layer for working over AWS. It’s an open source tool that imagines where the fundamental abstraction is ‘applications’ just like Heroku. But unlike Heroku, Convox is built entirely atop an AWS pipeline, leveraging ECS for scheduling, Cloudwatch for logs, and Lambda to coordinate scheduled jobs. Because apps and logs are the core abstraction, Convox isn’t tied to the underlying infrastructure, though it certainly leverages AWS utilities heavily. 

Skyliner: Like Convox, Skyliner launched more recently as a nice UI and deploy pipeline that layers over AWS. Their whole pitch is that they will give you a ‘best practice’ AWS setup that’s extremely easy to interface with. While they don’t yet support other clouds, being the ‘owner’ of the customer opens them up to start moving and building higher level primitives that can work across clouds. They ‘own’ the customer in the sense that the user really only interfaces with Skyliner, not AWS.

Zeit: Zeit’s Now takes the most radical approach (and the most similar to  ‘serverless’) of these three providers. Zeit provides the user the ability to upload a node module that it will run and host using only a single API command. Zeit abstracts almost all managed infrastructure away from the user, so they aren’t really sure what cloud they are running upon, or what resources are being consumed under the hood. It’s less about managing cloud infrastructure, and more about solving the problem of running code with applications as the focus. 

We’ve still yet to see these approaches gain more widespread adoption. But it’s still early days for most of them. 

The Second-order Cycle

Now, amongst all the rosy ideas of a universal package manager for the cloud, I intentionally skipped past the elephant in the room. Creating ‘one package manager for any cloud’ is incredibly difficult because cloud providers are incentivized to lock-in customers to their platform.

And overcoming that incentive is tough. It’s a good part of the reason why each cloud provider maintains idiosyncratic APIs, workflows, and specialized tooling. It certainly makes the job of a ‘blanket’ API that papers over these inconsistencies difficult at best.

This barrier is particularly acute with pieces of technology that are “hard” to develop. I’m talking about the R&D-heavy efforts behind proprietary platforms like DynamoDB, Cloud Spanner, BigQuery and Redshift.

But that said, there’s a second-order cycle at play with these sorts of expensive pieces of proprietary software. In many cases, we see databases and streaming systems mature along the following milestones:

  1. the software is developed internally at a larger tech company (Google, FB, etc)

  2. the software architecture is published as an academic paper

  3. an open-sourced reference implementation is released

  4. commercial and cloud support follow the tool’s user adoption

You can see it this cycle again and again with the likes of Hadoop, Cassandra, HBase, Kafka, and Kubernetes. All these tools were borne out of companies initially, and then gained widspread adoption over time, either as Apache projects or sponsored directly by the company that conceived them.

As a textbook example, take Hadoop. 

Hadoop’s journey from open source project to managed infrastructure hits the following milestones [2]:

And Hadoop isn’t the only example: we’ve started to see the emergence of hosted Kafka on Heroku and Spark on AWS.

It’s software in this cycle that does stand a fighting chance to become cross-cloud infrastructure. Databases like Redis, MySQL, and Postgres are so ubiquitous that they have effectively become table stakes that cloud providers must to offer as managed products. 

We’ll continue to see these sorts of products emerge as new technology gains widespread adoption. It would come as no surprise if we started seeing hosted Kubernetes deployments emerging outside of just GCE.

Now, there’s certainly still a lot of ground to be made up when it comes to booting up infrastructure “anywhere” and running applications at a higher level of abstraction than instances and networks. But in a world where it’s so easy to boot similar types of infrastructure in different clouds–there’s a huge opportunity to pave over all those differences and create a unified abstraction layer across all of them. 

Instead of treating infrastructure at the machine or even container level–I’m expecting that we’ll start seeing infrastructure where “applications” are the core unit of abstraction. And I’m excited to see how that world plays out.


[1]: Kafka is a great example of this. Simple architecture combined with extremely high throughput. The Kubernetes architecture is another one that allows people to build on it quickly. But these seem to be the exception rather than the rule.

[2]: https://en.wikipedia.org/wiki/Apache_Hadoop#Timeline

Peter Reinhardt on April 12th 2017

Running QA tests for Segment’s UI was taking way too long. Sure, we had strong component-level tests for our UI kit. But to test our whole app we needed to painstakingly poke around looking for oddities.

Manual testing like this is extremely time-consuming, and you can easily miss accidental, small visual differences that degrade the user experience. Shipping even the smallest of these bugs to production then creates an even costlier bug reporting cycle that involves customers and the support team. That’s no good, we wanted a better way!

So we began experimenting with perceptual diffing. Perceptual diffing compares screenshots of new releases by comparing pixel-values, and then highlights those differences.

This article explains exactly what perceptual diffing looks like and how to set up perceptual diffing easily with Nightmare and Niffy — a new open-source library we’re releasing today.

Let’s Play a Game… Can you Spot the Regression?

Below is a real release of Segment’s UI from September 2015. This is a screenshot of our “Workspaces” page on staging (left) and production (right):

Can you see the regression?

Well, there’s actually two regressions! And I didn’t see either of them when I was testing this manually in 2015.

This is where perceptual diffing comes in: it highlights every pixel change. Here’s what Niffy sees:

As you can see, perceptual diffing makes both regressions immediately obvious:

  1. The lock icon is missing from the bottom paragraph of text.

  2. The “Enterprise Plan” text under the “Segment” workspace has been replaced with “Business Plan” (broken logic that should standardize the naming).

That said, not all perceptual diffing highlights are regressions. If you ship an update to part of the product, the diffing will go nuts with red highlights. But that’s a good thing! Perceptual diffing really shines by catching bugs on all the other views, where you expect to see zero changes.

Implementation

When we first heard about perceptual diffing Somewhere on the Internet™, we were quite intrigued. Demos like the one above felt extremely promising for reducing our manual testing burden, and we wanted to get this working for Segment. But as we researched the available tools, they bifurcated into two groups: (1) hosted tools like VisualPing are designed for change detection on public static sites, (2) open source tools like pdiff are aging and also work best on public static sites. The existing tools weren’t the right solution for us because they weren’t able to navigate into our app, click around, and test workflows.

So we decided to build a lightweight perceptual diffing layer on top of Nightmare, our browser automation library. It’s called Niffy and we’ll show you how to use both Nightmare and Niffy below.

The Basics with Nightmare

Perceptual diffing has three main steps:

  1. Capture screenshots of pages and views in your app.

  2. Diff two sets of screenshots and produce a diff-highlight.

  3. Trigger these capture and diff steps at the appropriate moment in the release process.

Capture

Nightmare makes it easy to capture a screenshot. Here’s a fully-functional example:

But capturing static urls is not that interesting. Where Nightmare really shines is more complex interactions and app states. For example, you likely want to (1) login, (2) navigate to some part of the app, (3) open up a modal and then take screenshots to make sure core workflows are tested. 

Here’s a working example you can copy+paste and run:

With simple Nightmare scripts like this you’re able to get the UI into complex states and easily capture screenshots.

Diff

Once you have matched screenshots of two versions of the same UI, you need a way to generate a highlighted difference. The naive solution is to just take the difference of the pixel values and display that, but this turns out to be unreadable because you just get a giant black image. If you average opacity value instead of taking the difference, you still just get a few randomly colored pixels here and there:

So we dug into other perceptual diffing tools more closely, and then approximately copied what they do: copy over equivalent pixels with partial transparency, and make mismatched pixels red (you can see the exact diffing algorithm we use in Niffy here.

Trigger

We’ve looked at several different triggers for doing this perceptual diffing. There are a few challenges:

  1. Where do you (reliably) store screenshots of sequential versions?

  2. When exactly is your new release deployed to staging and ready to be diffed?

The answers to these questions are pretty different depending on each company’s cloud provider, continuous integration environment, and deployment process. We’ve found so far that the simplest trigger is to run the diffing manually (make test), comparing staging and production. This is the method we outline next with Niffy.

Niffy Makes this Simple

Niffy is designed to bundle up the capture and diff steps into a library that can be easily used in a mocha test. Niffy exposes the internal Nightmare instance so that you can do arbitrary clicking, typing, checkboxing, etc. before you take your diffing screenshot (see Logged In example below).

Here’s the output of our Niffy tests run at time-of-writing:

All you need to do is run those open /tmp/niffy/…  commands to see immediately what broke…

First, it looks like our Settings Overview page got a big update!

And second, we’re seeing an error alert on staging on the Settings Move Source page that we should fix in our staging environment for better testing:

Setting up Niffy

To help you get started with Niffy, here’s an abbreviated snippet from the diffing test suite (test/index.js) we use on Segment itself (and there’s a ready-made example test suite in the iffy repo that you can run with make test):

With Makefile :

And test/mocha.opts :

And package.json :

To get started with perceptual diffing, head over to the Niffy repo, or use Nightmare directly. And lastly, if you like building software to solve complicated business problems like this, we’re hiring! Or if you’re working on open source full-time, check out our Open Fellowship!

Fouad Matin on March 23rd 2017

Today we’re proud to announce the Segment Open Fellowship. The Fellowship is a three month long program supporting three to five open-source developers with $8k per month to focus full-time on their project, no other strings attached.

Since the very beginning, open-source has played a critical role in Segment’s journey. As we’ve grown, we’ve continued to try and create useful tools and utilities that we open source to continue giving back to the community that gave us our start. As Segment scales, we’d like to continue scaling our support for the open-source community even beyond our internal needs.

We’ll award grants to 3-5 developers to work on an open-source project for 3 months. They can work out of our office in SF or remotely. We’re hoping to give them a chance to accelerate the adoption of a new, fast-growing project.

The primary goal of the fellowship is to enable participants to fully dedicate themselves to a project for a few months. We’re hoping to give them a chance to speed the adoption of a new, fast-growing project. Or maybe help them build some long-awaited key features of a library that’s already widely used. Or perhaps even jump-start an entirely new idea altogether.

It’s certainly an experiment–but one which we believe has the potential to benefit developers all over the world. And we couldn’t be more excited to help support progress in the open-source community.

Get the details and apply here →

Know someone who should apply? Let them know on Twitter or on Facebook

Thank you Stripe for the Open Source Retreat and Google for Summer of Code as inspiration for this program.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.