Go back to Blog

Engineering

Benjamin Yolken on September 15th 2020

Today, we're releasing kubeapply, a lightweight tool for git-based management of Kubernetes configs. Here's why.

All Engineering articles

Achille Roussel, Rick Branson on March 14th 2017

For an early startup, using the cloud isn’t even a question these days. No RFPs, provisioning orders, or physical shipments of servers. Just the promise of getting up and running on “infinitely scalable” compute power within minutes.

But, the ability to provision thousands of dollars worth of infrastructure with a single API call comes with a very large hidden cost. And it’s something you won’t find on any pricing page.

Because outsourcing infrastructure is so damn easy (RDS, Redshift, S3, etc), it’s easy to fall into a cycle where the first response to any problem is to spend more money__.

And if your startup is trying to move as quickly as possible, the company may soon be staring at a five, six, or seven figure bill at the end of every month.

At Segment, we found ourselves in a similar situation near the end of last year. We were hitting the classic startup scaling problems, and our costs were starting to grow a bit too quickly. So we decided to focus on reducing the primary contributor: our AWS bill.

After a three months of focused work, we managed to cut our AWS bill by over one million dollars annually. Here is the story of how we did it.

Cash rules everything around me

Before diving in, it’s worth explaining the business reasons that really pushed us to build discipline around our infrastructure costs.

The costs for most SaaS products tend to find economies of scale early. If you are just selling software, distribution is essentially free, and you can support millions of users after the initial development. But the cost for infrastructure-as-a-service products (like Segment) tends to grow linearly with adoption. Not sub-linearly.

As a concrete example: a single Salesforce server supports thousands or millions of users, since each user generates a handful of requests per second. A single Segment container, on the other hand, has to process thousands of messages per second–all of which may come from a single customer.

By the end of Q3 2016, two thirds of our cost of goods sold (COGS) was the bill from AWS. Here’s the graph of the spend on a monthly basis, normalized against our May spend.

Our infrastructure cost was unacceptably high, and starting to impact our efforts to create a sustainable long-term business. It was time for a change.

Getting a lay of the land

If the first step in cost reduction is “admitting you have a problem”, the second is “identifying potential savings.” And with AWS, that turns out to be a surprisingly hard thing to do.

How do you determine the costs of an environment that is billed hourly with blended annual commits, auto-scaling instances, and bandwidth costs?

There are plenty of tools out there that promise to help optimize your infrastructure spend, but let’s get this out of the way: there is no magic bullet.

In our case, this meant digging through the bill line-by-line and scrutinizing every single resource.

To do this, we enabled AWS Detailed billing. It dumps the full raw logs of instance-hours, provisioned databases, and all other resources into S3. In turn, we then imported that data into Redshift using Heroku’s AWSBilling worker for further analysis.

It was a messy dataset, but some deep analysis netted a list of the top ~15 problem areas, which totaled up to around 40% of our monthly bill.

Some issues were fairly pedestrian: hundreds of large EBS drives, over-provisioned cache and RDS instances. Relics left over from incidents of increased load that had not been sized back down.

But some issues required clear investment and dedicated engineering effort to solve. Of these, there were three fixes which stood out to us above all else:

  • DynamoDB hot shards ($300,000 annually)

  • Service auto-scaling ($60,000 annually)

  • Bin-packing and consolidating instance types ($240,000 annually)

The long-tail of cost reductions accounted for the remaining $400,000/year. And while there were a handful of lessons from eliminating those pieces, we’ll focus on the top three.

DynamoDB hot shards

Segment makes heavy use of DynamoDB for various parts of our processing pipeline. Dynamo is Amazon’s hosted version of Cassandra–it’s a NoSQL database that acts as a combination K/V and document store. It has support for secondary indexes to do multiple queries and scans efficiently, and abstracts away the underlying partitioning and replication schemes.

The Dynamo pricing model works in terms of throughput. As a user, you pay for a certain capacity on a given table (in terms of reads and writes per second), and Dynamo will throttle any reads or writes that go over your capacity. At face value, it feels like a fairly straightforward model: the more you pay, the more throughput you get.

However, correctly provisioning the throughput required is a bit more nuanced, and requires understanding what’s going on under the hood.

According to the official documentation, DynamoDB servers split partitions based upon a consistent hashing scheme:

Under the hood, that means that all writes for a given key will go to the same server and same partition.

Now, it makes common sense that we should distribute reads and writes so they are uniformly distributed. You don’t want a hot partition or single server which is being constantly overloaded with writes, while your other servers are sitting idle.

Unfortunately, we were seeing a ton of throttling even though we’d provisioned significantly more capacity on our DynamoDB instances.

To get an understanding of the upstream events, our dynamo setup looks something like this:

We have a bunch of unpartitioned, randomly distributed queues that are read by multiple consumers. These objects are then written into Dynamo. If Dynamo slowed down, it would cause the entire queue to back up. And what’s more, we would have to increase throughput capacity far more significantly than the required write throughput in order to drain the queue.

What had us confused was that our keys are partitioned by the end tracked user. And tracking keys across hundreds of millions of users per day should evenly distribute the write load uniformly. We’d followed the exact recommendation from the AWS documentation:

So why was Dynamo still getting throttled? It appeared there were two answers.

The first was the fact that the throughput pricing for dynamo actually dictates the number of partitions rather than the total throughput.

It’s easy to overlook, but the Amazon DynamoDB docs state the following when it comes to partitions:

A single partition can hold approximately 10 GB of data, and can support a maximum of 3,000 read capacity units or 1,000 write capacity units.

The implication here is that you aren’t paying for total throughput, but rather partition count. And if you happen to have a few keys which saturate the same individual partitions, you have to double capacity to split a single hot partition onto their own partitions rather than scale the capacity linearly. And even there you are limited to the throughput for a single partition.

When we talked with the AWS team, their internal monitoring told a different story than our imagined ‘uniform distribution’. And it explained why we were seeing throughput far below what we had provisioned:

This is a heatmap they provided of the total partitions, along with the key pressure on each. The Y-axis maps partitions (we had 647 partitions on this table) and the X-access marks time over the course of the hour. More frequently accessed ‘hot’ partitions show up as red, while partitions that aren’t accessed show up as blue.

Vertical, non-blue, lines are good–they indicate that a bulk load happened, and was evenly spread across the keyspace, maximizing our throughput. However, if you look down at the 19th partition, you can see a thin streak of red:

Uh oh. We’d found our smoking gun: a single slow partition.

It was clear something needed to be done. The heat map they provided was a major key, but it’s granularity is at the partition-level, not the key. And unfortunately, there’s no provided out-of-the-box way to identify hot keys (hint hint!).

So we dreamt up a simple hack to give us the data we needed: anytime we were throttled by DynamoDB, we logged the key. The table’s provisioned capacity was temporarily reduced to induce the throttling behavior. And then logs were aggregated together and the top keys were extracted.

The findings? A number of keys that were the result of, shall-we-say, “creative” uses of Segment.

Here’s an example of what we were seeing:

Spot the issue?

At a certain time every day, it appeared as though there was a daily automated test against our production API that resulted in a burst of hundreds of thousands of events attached to a single userID (literallyuser_id in this case). And that userId that was either set statically, or incorrectly interpolated.

While we can fix bugs in our own code, we can’t control our customers.

It was clear from examining each case that there was no value in properly handling this data, so a set of blocked keys (“userId”, “user_id”, “#{user_id}” and variants) was built from the throttling logs. Over a few days we slowly decreased the provisioned capacity, blocking any new discovered badly behaved keys. Eventually we reduced capacity by 4x.

Of course, fixing individual partitions and blacklisting keys is only half the battle. We’re in the process of moving from NSQ to Kafka which will provide proper partitioning upstream of Dynamo. Partitioning upstream of Dynamo will ensure that we are batching writes efficiently and merging changes on a small subset of servers rather than spreading writes globally.

Service auto-scaling

A little bit of background on our stack: Segment adopted a micro-service architecture early on. We were among the first users of ECS (EC2 Container Service) for container orchestration, and Terraform for managing all of our AWS resources.

ECS manages all of our container scheduling. It’s a hosted AWS service, which requires each instance to run a local ECS-agent. You submit jobs to the ECS API, and it communicates with the agent running on each host to determine which containers should run on which instances.

When we first started using ECS, it was easy to auto-scale instances, but there was no convenient way to auto-scale individual containers.

The recommended approach was to build a frankensteinian pipeline of Cloudwatch alerts which would trigger a Lambda function that updated the ECS API. But in May 2016, the ECS team launched first class auto-scaling for services.

The approach is fairly simple. It’s effectively the same as the automated approach, but requires a lot fewer moving parts.

Step one: set limits on CPU and memory thresholds for the ECS service:

It takes about 30 seconds to do, and then the service will automatically scale the number of tasks up and down in relation to the amount of resources it’s using.

Step two: we enabled our instances to scale based upon the desired ECS resource allocation. That means if a cluster no longer had enough CPU or memory to place a given task, AWS would automatically add a new instance to the auto-scaling-group (ASG).

How are the results?

In practice this works really well (as modeled by our API containers):

Our traffic load pretty closely follows the U.S. peaks and troughs (large rise at 9:00am EST). Because we only have 60% of peak traffic at nights and on weekends, we’ve been able to save substantially by adding auto-scaling, and not have to worry about sudden traffic spikes.

The additional benefit has been automatically scaling down after over-provisioning to deal with excess load. We no longer have to run at 2x the capacity, since the capacity is set dynamically. Which brings us to the last improvement: bin packing.

Bin packing and consolidating instance types

We’ve long contemplating switching to bigger instances, and then packing them with containers. But until we started on “project benjamin” (the internal name for our cost-cutting effort), we didn’t have a clear plan to get there.

There’s been a lot written about getting better performance from running on bigger virtual hosts. The general argument is that you can get less steal from noisy neighbors if you are the only one on a physical machine. And there’s more likelihood that you will be the sole VM on a physical machine if you are running at the largest possible instance size.

There’s a handful of additional benefits as well: fewer hosts means a lower cost of per-host monitoring and quicker image rollouts.

Moreover, if you are using the same instance type (big or small) you can get a much cheaper bill using reserved instances. Reserved instances are nearly 40% off the per-hour price, but require an annual commit.

So, we realized it was in our best interest to start consolidating the instances we were running on, and start building an army of c4.8xlarges (our workload is largely compute and I/O bound). But to get there, we needed a necessary requisite: moving off elastic load balancers (ELBs) to the new application load balancers (ALBs).

To understand what moving to ALBs gives us vs the classic ELB, it’s worth talking through how they work under the hood.

From our best estimation, ELBs are essentially built atop an army of small, auto-scaling instances running HAProxy.

When using ECS with ELBs, each container runs on a single host port specified by the service definition. The ELB then connects to that port and forwards traffic to each instance.

This has three major ramifications:

  1. If you want to run more than one service on a given host, each service must listen on a unique port so they don’t collide.

  2. You cannot run two containers of the same service on a single host because they will collide and attempt to listen on the same port. (no bin packing)

  3. If you have n running containers, you must keep n+1 hosts available to deploy new containers (assuming that you want to maintain a 100% healthy containers during deploys).

In short, using ELBs in combination with ECS required us to over-provision instances and stack only a few services per instance. Hello cost city, population: us.

Fortunately for us, the port collision problem was solved with the introduction of the ALB.

The ALB allows ECS to set ports dynamically for individual containers, and then pack as many containers as can fit onto a given instance. Additionally, the ALB uses a mesh routing system vs individual hosts, meaning that it does not need to be ‘pre-warmed’ and can scale automatically to meet traffic demands.

In some cases, we’re currently packing 100-200 containers per instance. It’s dramatically increased our utilization and cut the number of instances required to run our infrastructure (at the same time as we 4x’d api volume).

Utilization over time

Easy by default with Terraform

Of course, it’s easy to cut costs with these sorts of focused ‘one-time’ efforts. The hardest part of maintaining solid margins is systematically keeping costs low as your team and product scale. Otherwise, we knew we would be doomed to repeat the process in another 6 months.

To do that, we had to make the easy way, the right way. Whenever a member of the eng team wanted to add a new service, we had to ensure that it would get all of our efficiency measures for free without extra boilerplate or configuration.

That’s where Terraform comes in. It’s the configuration language we use at Segment to provision and apply changes to our production infrastructure.

As part of our efforts, we created the following modules to give our teammates a high-level set of primitives that are “efficient by default”. They don’t have to supply any extra configuration, and they’ll automatically get the following by using our modules:

  • Clusters which configures an Autoscaling Groups linked to an ECS cluster.

  • Services to setup ECS services that are exposed behind an ALB (Application Load Balancer).

  • Workers to setup ECS services that consume jobs from queues but don’t expose a remote API.

  • Auto-Scaling as a default behavior for all hosts and containers running on the infrastructure.

If you’re curious about how they fit together, you can check out our open-sourced version on Github: The Segment Stack. It contains all of these pieces out of the box, and will soon support per-service autoscaling automatically.

Takeaways

After being in the weeds for three months, we managed to hit our goal. We eliminated over $1m dollars in annual spend off our AWS bill. And managed to increase our average utilization by 20%.

While we hoped to share some insights behind a few of the very specific issues we encountered in our effort to reduce costs, there are a few bigger takeaways that should be useful for anyone looking to increase the efficiency of their infrastructure:

Efficient By Default: It’s important that efficiency efforts aren’t just a rule book or a one-time strategy. While cost management does requires ongoing vigilance, the most important investment is to prevent problems from occurring in the first place. The easy-mode should be efficient. We accomplished this by providing an environment and building blocks in Terraform that made services efficient by default.

However, this extends beyond configuration tools, and includes picking infrastructure that simplifies capacity planning. S3 is notoriously great at this: it requires zero up-front capacity planning. When considering a SQL database, where the team may have picked MySQL or PostgreSQL, consider using something like Amazon’s Aurora. Aurora automatically scales disk capacity in 10GB increments, eliminating the need to plan capacity ahead of time. After this project efficiency became our default, and is now part of how our infrastructure is planned.

Auto-scaling: During this effort we found that auto-scaling was incredibly important for efficiency, but not only for the obvious reason of scaling along with demand. In practice, engineers would configure their service to give them a few months of headroom before they had to re-evaluate their capacity allocation. This meant that services were actually being allocated far above their weekly peak requirements. That configuration itself is often imperfect, and wastes precious engineering time tuning these settings. At this point, we’d say that ubiquitous auto-scaling is a practical requirement for a micro-services architecture. It’s relatively easy to manage capacity for a monolithic system, but with dozens of services, this becomes a nightmare.

Elbow Grease: There are some tools that aid with cloud efficiency efforts, but in practice it requires serious effort from the engineering team. Don’t fall for vendor hype. Only you know your systems, your requirements, your financial objectives, and thus the right trade-offs to make. Tools can make this process easier, but they’re no magic bullet.


For any growing startup, cost management is a discipline that has to be built over time. And like security, or policies, it’s often far easier to institute the earlier you start measuring it.

Now that all is said and done, we’re glad cost-management and measurement is a muscle we’ve started exercising early. And it should continue to have compounding effects as we continue to scale and grow.

Peter Reinhardt on February 27th 2017

Nightmare is a browser automation library for node.js, designed to be much simpler and easier to use than Phantomjs. We originally built Nightmare to create integration logos with 99Designs Tasks before they had an API, and we still use it in Sherlock. But the vast majority of Nightmare developers—now 55k+ downloads per month—use it for web UI testing and crawling.

This article is a quick introduction to using Nightmare for web UI testing. It uses Mocha as the testing framework, but you could similarly use Jest.

Overview

Nightmare’s API methods are designed to mimic real user actions:

  • .goto(url)

  • .type(elementSelector, text)

  • .click(elementSelector)

This makes testing with Nightmare very similar to how a human tester would navigate, click and type into your actual web app. In the next few sections we’ll dive into how to set your repo, then how to test page loads, submitting forms, and interacting with an app.

Repo Setup

First we need to install mocha and nightmare, and make sure our basic test harness is working.

Starting on the command line in your repo folder…

In test/test.js you can get started with:

const Nightmare = require('nightmare') const assert = require('assert')

describe('Load a Page', function() { // Recommended: 5s locally, 10s to remote server, 30s from airplane ¯\_(ツ)_/¯ this.timeout('30s')

let nightmare = null beforeEach(() => { nightmare = new Nightmare() })

describe('/ (Home Page)', () => { it('should load without error', done => { // your actual testing urls will likely be `http://localhost:port/path` nightmare.goto('https://gethoodie.com') .end() .then(function (result) { done() }) .catch(done) }) }) })

Add mocha as the test script to your package.json:

"scripts": { "test": "mocha" }

Finally, to test this complete setup you can run npm test on the command line…

npm test > Load a Page > ✓ should load a web page (12223ms) > 1 passing (12s)

Loading a Page

Most web products have a set of public pages used for documentation, support, marketing, authentication and signup. Here’s how you can test that these pages load successfully:

describe('Public Pages', function() { // Recommended: 5s locally, 10s to remote server, 30s from airplane ¯\_(ツ)_/¯ this.timeout('30s')

let nightmare = null beforeEach(() => { nightmare = new Nightmare() })

describe('/ (Home Page)', () => { it('should load without error', done => { // your actual testing urls will likely be `http://localhost:port/path` nightmare.goto('https://gethoodie.com') .end() .then(function (result) { done() }) .catch(done) }) })

describe('/auth (Login Page)', () => { it('should load without error', done => { nightmare.goto('https://gethoodie.com/auth') .end() .then(result => { done() }) .catch(done) }) }) })

Submitting a Form

This example tests that Hoodie’s login function fails with bad credentials. It’s always worth testing failed states as well as successful states. 🤖

describe('Login Page', function () { this.timeout('30s')

let nightmare = null beforeEach(() => { // show true lets you see wth is actually happening :) nightmare = new Nightmare({ show: true }) })

describe('given bad data', () => { it('should fail', done => { nightmare .goto('https://gethoodie.com/auth') .on('page', (type, message) => { if (type == 'alert') done() }) .type('.login-email-input', 'notgonnawork') .type('.login-password-input', 'invalid password') .click('.login-submit') .wait(2000) .end() .then() .catch(done) }) }) })

Using the App

This example is more involved, and includes signing up with text fields, select fields, and clicking and waiting through a flow that spans multiple pages.

describe('Using the App', function () { this.timeout('60s')

let nightmare = null beforeEach(() => { // show true lets you see wth is actually happening :) nightmare = new Nightmare({ show: true }) })

describe('signing up and finishing setup', () => { it('should work without timing out', done => { nightmare .goto('https://gethoodie.com/auth') .type('.signup-email-input', 't'+Math.round(Math.random()*100000)+'@test.com') .type('.signup-password-input', 'valid password') .type('.signup-password-confirm-input', 'valid password') .click('.signup-submit') .wait(2000) .select('.sizes-jeans-select', '30W x 30L') .select('.sizes-shoes-select', '9.5') .click('.sizes-submit') .wait('.shipit') // this selector only appears on the catalog page .end() .then(result => { done() }) .catch(done) }) }) })

All Together Now

The final example ties all these together into a cleanly formatted test/test.js:

const Nightmare = require('nightmare') const assert = require('assert')

describe('UI Flow Tests', function() { this.timeout('60s')

let nightmare = null beforeEach(() => { nightmare = new Nightmare({ show: true }) })

describe('Public Pages', function() { describe('/ (Home Page)', () => { it('should load without error', done => { // your actual testing urls will likely be `http://localhost:port/path` nightmare.goto('https://gethoodie.com') .end() .then(function (result) { done() }) .catch(done) }) }) describe('/auth (Login Page)', () => { it('should load without error', done => { nightmare.goto('https://gethoodie.com/auth') .end() .then(result => { done() }) .catch(done) }) }) })

describe('Login Page', function () { describe('given bad data', () => { it('should fail', done => { nightmare .goto('https://gethoodie.com/auth') .on('page', (type, message) => { if (type == 'alert') done() }) .type('.login-email-input', 'notgonnawork') .type('.login-password-input', 'invalid password') .click('.login-submit') .wait(2000) .end() .then() .catch(done) }) }) })

describe('Using the App', function () { describe('signing up and finishing setup', () => { it('should work without timing out', done => { nightmare .goto('https://gethoodie.com/auth') .type('.signup-email-input', 'test+'+Math.round(Math.random()*1000000)+'@test.com') .type('.signup-password-input', 'valid password') .type('.signup-password-confirm-input', 'valid password') .click('.signup-submit') .wait(2000) .select('.sizes-jeans-select', '30W x 30L') .select('.sizes-shoes-select', '9.5') .click('.sizes-submit') .wait('.shipit') // this selector only appears on the catalog page .end() .then(result => { done() }) .catch(done) }) }) }) })

If you have additional questions or want to join the 90+ people who have contributed to Nightmare, head over to the Github repo. Happy testing.

Peter Reinhardt on October 19th 2016

At Segment, focus is one of our four core values. But it was difficult for team members to focus in the office, so in June we ran an internal team survey about what helps and hurts focus. The results showed that “chatter and noise” was one of the biggest culprits for distraction around the office. “Slack group channels” came in second.

These answers left us with two difficult questions: how do you solve a noise problem in an open floor plan? And where is the noise even coming from?

To get to the bottom of it, I decided to build an iOS app to collect decibel levels from around the office. We found that noise levels varied widely throughout the office, and using the new data, we changed the office layout to increase our ability to focus. Numerically speaking, the increased focused time (as measured by survey) has been equivalent to hiring 10–15 teammates. And beyond the numbers, it feels great to focus more. 😃

Where is the noise coming from?

At first we thought we were just being a bit too chatty. But demanding “be quiet!” is horrible in a collaborative work environment. We also noticed something odd: when people walked into our office they’d often say, “Wow! This is one of the quietest offices I’ve ever been to.” Of course, the survey said the opposite… that the office was noisy and distracting.

This discrepancy was particularly confusing because our office is an open floor plan. Sound ought to travel well around such a big open space. It was strange that we had two widely divergent stories around the quietness and loudness of the office.

Here’s a picture showing the high ceilings, plants and open layout:

And here’s our floorplan showing the lack of walls… lots of ways for sound to bounce around (view above shown in blue):

The conflicting anecdotal stories from visitors and teammates were perplexing. To get to the bottom of things, we needed hard data. And what better way to collect that data than the ambient sensors called iPhones already sitting around our office? So I built an app to record decibel levels in different areas and give us some real data to understand the situation.

The iOS app is tiny: it uses the Apple AVAudioRecorder class’s level-metering to passively collect and report average and maximum ambient decibel levels every 10 seconds to our server monitoring tool Datadog. We’ve open-sourced our Decibel noise-recording app for you to use. Just add your Datadog API key and off you go.

Originally I planned to ask a bunch of people to install it on their phones around the office. But then our VP Engineering had a much better idea: deploy it on the iPads we have outside every conference room.

Below you can see our data collection points (iPad minis) as red dots outside each room:

The graph below shows measurements from August 31, 2016, clearly showing spikes of noise in the office throughout the day (the absolute values are arbitrary, but the data is good for relative comparison):

Among other things, you can see a full 10 dB difference between the quietest and loudest parts of the office! 10 dB feels roughly “twice as loud,” so this is a big difference!

Finally, here are the average noise level results (plus some manual interpolation of the sparse data collection points) overlaid on the floor plan of the office (red loudest, green quietest):

This resolved the mystery. The front of the office where visitors hang out was twice as quiet (10dB quieter) compared to the areas where the people work. Both sets of anecdotal stories were right. When I showed this graph to Tony, one of our security guards, he said “Oh yeah, it’s WAY quieter up at the front of the office, even at night.”

What to do about it?

We can’t immediately ditch our open floor plan (although we’re looking at various options for our next office.) But this new noise level data gave us an obvious way to reduce distractions: the teams needing the quietest work area (engineering, product and design) should move to the quietest part of the office.

Last month we made the big move. The teams needing the most verbal collaboration — Segment’s sales, support, and marketing teams — moved to the naturally louder parts of the office. The teams needing the most quiet — engineering, product, and design — moved to the quietest parts of the office.

We’re still dialing in parts of the office that became a bit cramped, but a re-run of our original focus survey with the team showed that total focus time had increased from 45% to 60% of time in the office! In a purely numerical sense, you could equate that to hiring 10–15 people. It also feels awesome to focus.

The combination of survey and noise data has been incredibly helpful in iterating towards a more productive office, and we still have lots of ideas to test. For example, we’ve started an experiment with some teams in a new “war room” layout, and will keep looking for other ways to optimize the environment to be productive and fun, measuring results as we go. Would love to share results if you’ve experimented and measured results in your own space.

Lauren Venell on October 11th 2016

When it comes to your app, size makes a difference. Bigger apps have fewer downloads, worse reviews, and a harder time penetrating the international market. We measured the exact impact of increased app size, shown below. We’ve also included learnings on how to prevent bloat in your own app.

Peter Reinhardt on October 5th 2016

A few months ago we were at a conference in Half Moon Bay talking with a general manager at a large mobile company, and he said, “One of my projects for the next six months is to reduce SDK bloat in all our apps.” Six months is a substantial investment! So we asked why this mattered so much to him.

He gave three reasons:

  1. They’re up against the 100MB app size cellular download limit, and they need to find ways to reduce the size of their app so that they don’t take a hit on installs.

  2. Occasionally their SDK vendors have bugs that crash their app, giving them bad reviews and lowering installs.

  3. Often their SDK vendors are unprepared for major iOS version upgrades, blocking key releases and big announcements.

While it makes intuitive sense that a large app download size would demotivate people and reduce installs, we wanted to dig deeper into these reasons and really assess the quantitative impact. Are we sure app size reduces installs? How big is the effect?

So, over the last few months we’ve done some extensive research into these issues… and in this article we’re going to share some new experimental results on how app size affects install rate.

How to gain 147MB in just 17 days…

To measure the impact of increased app size, we needed to buy a small app, with no active marketing activities but significant steady downloads. Then we needed to increase the app’s size, leaving everything else constant, and observe the impact on app install rate. To the best of our abilities, this would simulate the impact of SDK bloat, or anything else that just makes an app bulky.

So, we bought the Mortgage Calculator Free iOS app through some friends in the YC Founders network. It was a minuscule 3MB, had a steady pattern of organic installs (~50 installs per day for several years), and had no active marketing activities.

Then, without making any additional changes, we bloated the app from 3MB to 99MB, 123 MB and then finally 150MB, observing the isolated impact on install rate with each change in app size. In the real world, app sizes can increase substantially with the addition of seemingly simple things, like an SDK, an explainer video, a bunch of fonts, or a beautiful background picture for your loading screen. For the purposes of our experiment, we just bloated the app with a ton of hidden album art from our engineering team’s favorite artist.

To quantitatively measure the impact of each successive bloating, we looked at data provided directly by Apple in iTunes Analytics; specifically conversion from “Product Page Views” to “App Units” (or colloquially “installs”/”install rate”).

…and lose 66% of your installs

With the larger app sizes, we saw substantial losses in product page to app install rate. In particular, there was a substantial drop around the cellular download limit (~100MB), above which Apple does not let users download the app over 3G or 4G. (Note: the conversion rate can be greater than 100% due to installs direct from search results that skip product page view.)

From these results we estimate a linear decrease in install conversion rate below the cellular download limit from 3–99MB at 0.45% per MB. Above the cellular download limit we estimate a linear decrease in install rate of 0.32% per MB. To our best estimate, the gap between the two lines is covered by a 10% instantaneous install rate drop across the cellular download limit. (Although Apple says the cellular download limit is 100MB, we found in practice that a 101MB IPA did not trigger the cellular download block… the actual limit was somewhere between 101MB and 123MB and varied depending on the exact build.)

Increasing the size of our app from 3MB to 99MB reduced installs by 43%, and the increase to 150MB reduced installs by 66% in total. For mobile companies striving for growth, an increase in app size is extremely costly.How to destroy an app

In an attempt to be proper growth scientists, we tried to replicate the experiment by returning the app to its original 3MB size (plus other intermediate sizes) and re-measuring the install rate. Unfortunately, as a result of our earlier bloating, the app attracted several critical ratings and reviews, which stick around forever:

In our measurements, the app’s growth appears to be semi-permanently damaged and we only saw a minor rebound to 59% install conversion rate for the 16 days the 3MB version was available. (As a side note you can see these customers cite 140MB and 181MB as the download sizes… the true download size varies depending on the customer’s device and OS version.)

What makes an app big

As an example, we randomly chose to inspect the NBC Sports app that’s been popular and featured on the App Store during the Olympic games. The app is 90.5MB in total. By the far the biggest part of app’s size is images, accounting for 51% of the entire app’s size, with code (23%), fonts (16%) and video (9%) accounting for most of the rest.

Images were both in the raw app package and in the asset catalog (hidden away inside assets.car). There are huge numbers of local station & team logos, startup screens, and soccer field layouts.

Within the “code” category (which is a single encrypted file) are around a dozen different SDKs, a handful of which we could detect and estimate for size contribution. SDKs contributed roughly 3.5MB to the total app size, with an estimated impact of -1.54% install rate.

Overall, there’s an incredible amount of low-hanging fruit in optimizing an app’s size. One particularly honorable mention is the (presumably accidental) inclusion of the Adobe VideoHeartbeat SDK docs (955 kb) in the production IPA package downloaded through iTunes.

The impact of SDKs

While most SDKs are small individually, the net impact of installing many SDKs in your app is a significant increase to the app’s size, and therefore a meaningful negative impact on install rate, not to mention engineering time & maintenance. SDK vendors can be unprepared for new iOS releases, which can block key releases and announcements. Or, SDKs can occasionally introduce bugs, and even a single crash can mean a would-be user will never use your app again.

Therefore, it makes sense to limit the number of SDKs you bundle into your app in order to optimize for performance. But you still need to make sure you’re tracking and collecting key lifecycle events. Segment built an easy, lightweight solution called the Native Mobile Spec. The Native Mobile Spec automatically collects events that allow you measure top mobile metrics without any tracking code.

Building your best possible app

Now that we conclusively know how to destroy an app, we also know how to improve one. Here are a few key steps that you can take to make sure your app is ready for holiday season:

  1. Use a mobile app size calculator to find out the exact size of your app.

  2. Review our guide for getting your app ready for launch, for a complete set of tips and advice on how to make your app great.

  3. Read about how Segment can help mobile teams.

A version of this blog post originally appeared on Recode.

Andy Jiang on July 28th 2016

At Segment, we’re working hard to make our mobile SDKs the best possible collection options for your analytics data. An SDK can make your data more durable, minimize data transfer, and optimize your app’s battery usage. In this article, we’ll take you through what happens under the hood as a piece of data flows through our iOS and Android SDKs from a button handler in your app all the way through to our API.

The Initialization

When you initialize Segment’s iOS and Android library, it will automatically start tracking a few app lifecycle events that are part of our Native Mobile Spec.

  • Application Installed — User installs your application for the first time.

  • Application Opened — User opens your application.

  • Application Updated — User updates to a newer version of your app.

The automatic lifecycle tracking allows you to focus on building your app and thinking less about your analytics tracking plan.

Now that we’ve initialized the library, we’re ready for user interaction.

The Tap

Let’s take a look at a fictitious mobile app, Bluth’s Banana Stand, and see what happens with our SDKs when the user completes an order.

Listening on the purchase button handler, you’ll fire off a track call:

It’s safe to invoke this call directly from your button handler, as our SDKs automatically dispatch it to a background queue thread.

These queues represent various threads on the mobile device. The .track() event is moved to the background queue so the UI queue can continue to operate for the user.

Inside the Queue

Once dispatched to the background thread, the event is immediately written to the device’s disk to maximize the chances of deliverability. In a world where power can be lost at any time, this method maximizes the durability of your data. (You can read about why we chose QueueFile for reliable request batching on Android).

The Road to Segment’s API

On the way out of the queue and onto Segment’s tracking API, Segment’s SDKs use two different strategies to optimize your app’s bandwidth and power usage:

  1. Batching — Combining multiple events into each outbound request allows us to minimize the amount of times we power up the wireless hardware

  2. Compression — Compressing each request with gzip which allows us to drastically reduce the amount of bandwidth used

Before the request is sent outbound, the event is now batched together with other events before being sent to our servers. Our SDKs minimize battery use with auto-batching, since powering up the radio on every call can waste battery life. Auto-batching allows our SDKs to send events in batches, without powering up the radio on every call.

Visual representation of batched events, with the green circle representing a .track() event.

We batch our requests to minimize the number of times we access the underlying hardware. This allows us to send fewer requests, spaced further apart, and therefore allow wireless hardware to remain offline for majority of the time.

Next, we’ll compress the outbound request body with gzip, reducing the total size drastically.

In our testing, we sent one thousand .track() calls without the SDK (directly to our API) and with the SDK. The amount of data on the wire without the SDK was 1.1mb and with the SDK is 63kb — roughly 17x saving in bandwidth. We’re able to accomplish this via intelligent batching and aggressive data “middle-out compression” technologies.

These are the bandwidth impacts realized with our SDKs’ batching and compression.

The battery story is even better—we’re able to reduce the wasted energy by almost 3x from 56% overhead to 20% overhead. The net result of this is low average energy impact on the app and much more efficient battery usage—and happier users!

These are the battery enhancements our SDKs enable. (Xcode).

Finally, the request leaves the mobile device and makes its way to our servers.

Trace Completed

By tracing the SDKs functionality from tap to API, we’re able to see how a few simple data transfer strategies can significantly strengthen your mobile analytics stack:

  1. Increase durability — Immediately persist every message to a disk-backed queue, preventing you from losing data in the event of battery or network loss.

  2. Auto-Retries — If the network is spotty or not available, our SDKs will retry transferring the batch until the request is successful. This drastically improves data deliverability.

  3. Reduce Network Usage — Each batch is gzip compressed, decreasing the amount of bytes on the wire by 10x-20x.

  4. Save Battery — Because of data batching and compression, Segment’s SDKs reduce energy overhead by 2-3x which means longer battery life for your app’s users.

These features would not have been possible without the help of our incredibly supportive open source community. With 80+ contributors, 550+ pull requests, 240+ releases, our iOS and Android libraries just kept improving. Thanks to their work, more than 3,000 apps now collect customer data using Segment’s mobile SDKs every day.

We can’t wait to keep improving the mobile analytics ecosystem alongside you. If you have any ideas, we’re all ears! Send a pull request or email us at friends@segment.com. 😃

Want to learn about and grow your mobile users? Sign up today.

Interested in building infrastructural components for mobile apps and SDKs? We’re hiring.

Calvin French-Owen on June 23rd 2016

As part of our push to open up what’s going on internally at Segment – we’d like to share how we run our CI builds. Most of our approaches follow standard practices, but we wanted to share a few tips and tricks we use to speed up our build pipeline.

Powering all of our builds are CircleCIGithub, and Docker Hub. Whenever there’s a push to Github, the repository triggers a build on CircleCI. If that build is a tagged release, and passes the tests, we build an image for that container.

The image is then pushed to Docker Hub, and is ready to be deployed to our production infrastructure.

CircleCI and Travis CI

Before going any further, I’d like to talk about the elephant in the room: Travis CI. Pretty much every discussion of CI tools has some debate around using Travis CI vs CircleCI. Both are sleek, hosted, and responsive. And both are incredibly easy to use.

Honestly, we love both tools. We use Travis CI for a lot of our open source libraries, while CircleCI powers many of our private repos. Both products work extremely well and do a fantastic job meeting our needs.

However, there’s one feature we really love about CircleCI: SSH access.

Most of the time, we don’t have any problems configuring our test environments. But when we do, the ability to SSH into the container running your code is invaluable.

To put this in perspective, Segment runs our entire infrastructure with hundreds of different microservices. Each one is pulled from a different repo, and runs with a few different dependencies via docker-compose (more on that later).

Most of our CI is relatively standard, but occasionally setting up a service with a fresh environment requires some custom work. It’s in a new repo, and will require it’s own set of dependencies and build steps. And that’s where being able to run commands from within the environment you’re testing against is so handy – you can tweak configuration right on the box.

No more hundreds of “fixing CI” commits!

Dotfiles

To work with all of these different repos, we wanted to make it trivially easy to setup a repo so that it has CI enabled. We have three different circle commands we use regularly, which are shared amongst our common dotfiles. First, there’s circle() which sets up all the right environment variables and automatically enabled our slack notifications.

Additionally, we have a circle.open() command which automatically opens the test results right from your CLI in your browser

And finally there’s the circle.badge() command to automatically add to a badge to a repo

Shared Scripts

Now given the fact that we have hundreds of repos, we have the task of keeping all the testing scripts and repositories in-sync when we make changes to our circle.yml files.

Maintaining the same behavior across a few hundred repos is annoying, but we’ve decided we’d rather trade abstraction problems (hard to solve) for investing more heavily in tooling (generally easier).

For that, we use a common set of scripts inside a shared git repo. The scripts are pulled down every time the test runs, and handle the shared packaging and deployments. Each service’s circle.yml file looks something like this:

It means that if we change our deploy scheme, we only have to update the code in one place rather than updating each individual repo’s circle.yml. We can then reference different scripts depending on what sorts of builds we need within the individual service repo.

Docker Containers

Finally, the entire build process wouldn’t be possible without Docker containers. Containers have greatly simplified how we push code to production, test against our internal services, and develop locally.

When testing our services, we make use of docker-compose.yml files to run our tests. That way, a given service can actually test against the exact same images in CI as are running in production. It reduces the need for mocks or stubs.

What’s more, when the images are built by CI–we can pull those same images down and run them locally as well.

To actually build that code and push it to production, CircleCI will first run the tests, then check whether the build is a tagged release. For any tagged releases, we have CircleCI build the container via a Dockerfile, then tag it and push the deploy to Docker Hub.

Instead of using latest everywhere, we explicitly deploy the tagged image to Docker Hub, along with the major version (1.x) and minor version (1.2.x).

This way, we’re able to specify rollbacks to a specific version when we need it – or deploy the latest build of a certain release branch if we don’t need a specific version (useful for local development and in docker-compose files).

The code to do this is relatively straightforward, first we detect the versions:

And then we build, tag, and push our docker images:

Once our images are pushed to Docker Hub, we’re guaranteed to have the right version of the code built so that we can deploy it to production and run it inside ECS.

Thanks to containers, our CI pipeline gives us much better confidence when deploying our microservices to production.

Coming Full Circle

So there you have it: our CI build pipeline, heavily powered by Github, CircleCI, and Docker.

While we’re constantly trying to find ways to make the entire pipeline a bit more seamless, we’ve been happy with the low maintenance, parallelization and isolation provided by using third-party tools.

On that note, if you’re managing a large number of repos, we’d love to hear about your own techniques for managing your build pipeline. Drop us a note by email (friends@segment) or on Twitter!

P.S. We’ll be at CircleCI’s office hours next Thursday (June 30) in San Francisco. Please join us for a special talk on Google’s AMP project!

Calvin French-Owen on June 15th 2016

AWS is the default for running production infrastructure. It’s cheap, scalable, and flexible to whatever configuration you’d like to run on top of it. But that flexibility comes with a cost: it makes AWS endlessly configurable.

You can build whatever you want on top of AWS, but that means it’s difficult to know whether you’re taking the right approach. Pretty much every startup we talk with has the same question: What’s the right way to setup our infrastructure?

To help solve that problem, we’re excited to open source the Segment AWS Stack. It’s our first pass at building a collection of Terraform modules for creating production-ready architecture on AWS. It’s largely based on the service architecture we use internally to process billions of messages every month, but built solely on AWS.

The steps are incredibly simple. Add 5 lines of Terraform, run terraform apply, and you’ll have your base infrastructure up and running in just three minutes.

It’s like a mini-Heroku that you host yourself. No magic, just AWS.

Batteries Included

Our major goals with Stack are:

  • to provide a good set of defaults for production infrastructure

  • make the AWS setup process incredibly simple

  • allow users to easily customize the core abstractions and run their own infrastructure

To achieve those goals, Stack is built with Hashicorp’s Terraform.

Terraform provides a means of configuring infrastructure as code. You write code that represents things like EC2 instances, S3 buckets, and more–and then use Terraform to create them.

Terraform manages the state of your infrastructure internally by building a dependency graph of which resources depend on one another:

and then applies only the “diff” of changes to your production environment. Terraform makes changing your infrastructure incredibly seamless because it already knows which resources have to be re-created and which can remain untouched.

Terraform provides easy-to-use, high level abstractions for provisioning cloud infrastructure, but also exposes the low-level AWS resources for custom configuration. This low-level access provides a marvelous “escape hatch” for truly custom needs.

To give you a flavor of what the setup process looks like, run terraform applyagainst this basic configuration:

It will automatically create all of the following:

Networking: Stack includes a new VPC, with public and private subnets. All routing tables, Internet Gateways, NAT Gateways, and basic security groups are automatically provisioned.

Auto-scaling default cluster: Stack ships with an auto-scaling group and basic lifecycle rules to automatically add new instances to the default cluster as they are needed.

ECS configuration: in Stack, all services run atop ECS. Simply create a new service, and the auto-scaling default cluster will automatically pick it up. Each instance ships with Docker and the latest ecs-agent.

CloudWatch logging & metrics: Stack sends all container logs to CloudWatch. Because all requests between services go through ELBs, metrics around latency and status codes are automatically collected as well.

Bastion: Stack also includes a bastion host for manual SSH access to your cluster. Besides the public services, it’s the only instance exposed to the outside world and acts as the “jump point” for manual access.


This basic setup uses the stack module as a unit, but Terraform can also reference the components of Stack individually.

That means that you can reference any of the internal modules that the stack uses, while continuing to use your own custom networking and instance configuration.

Want to only create Stack services, but bring your own VPC? Just source the service module and pass in your existing VPC ID. Don’t need a bastion and want custom security groups? Source only the vpc and cluster modules to set up only the default networking.

You’re free to take the pieces you want and leave the rest.

If you’d like to dig into more about how this works in-depth, and each of the modules individually, check out the Architecture section of the Readme.

Now, let’s walkthrough how to provision a new app and add our internal services.

Walkthrough

Note: this walkthrough assumes you have an AWS account and Terraform installed. If not, first get the pre-requisites from the requirements section.

For this tutorial, we’ll reference the pieces of the demo app we’ve built: Pingdummy, a web-based uptime monitoring system.

All of the Docker images we use in this example are public, so you can try them yourself!

The Pingdummy infrastructure runs a few different services to demonstrate how services can be deployed and integrated using Stack.

  • the pingdummy-frontend is the main webpage users hit to register and create healthchecks. It uses the web-service module to run as a service that is publicly accessible to the internet.

  • the pingdummy-beacon is an internal service which makes requests to other third-party services, and responds with their status. It uses the servicemodule, and is not internet facing. (though here it’s used for example purposes, this service could eventually be run in many regions for HA requests)

  • the pingdummy-worker is a worker which periodically sends requests to the pingdummy-beacon service. It uses the worker module as it only needs a service definition, not a load balancer.

  • an RDS instance used for persistence

First, you’ll want to add a Terraform file to define all of the pieces of your infrastructure on AWS. Start by creating a terraform.tf file in your project directory.

Then, copy the basic stack setup to it:

And then use the Terraform CLI to actually apply the infrastructure:

This will create all the basic pieces of infrastructure we described in the first section.

Note: for managing Terraform’s remote state with more than a single user, we recommend configuring the remote state to use Terraform Enterprise or S3. You can use our pingdummy repo’s Makefile as an example.

Now we’ll add RDS as our persistence layer. We can pull the rds module from Stack, and then reference the outputs of the base networking and security groups we’ve already created. Terraform will automatically interpolate these and set up a dependency graph to re-create the resources if they change.

Again, we’ll need to run plan and apply again to create the new resources:

And presto! Our VPC now has an RDS cluster to use for persistence, managed by Terraform.

Now that we have our persistence and base layers setup, it’s time to add the services that run the Pingdummy app.

We can start with the internal beacon service for our health-checks. This service will listens on port 3001 and makes outbound HTTP requests to third-parties to check if a given URL is responding properly.

We’ll need to use the service module which creates an internal service that sits behind an ELB. That ELB will be automatically addressable at beacon.stack.local,and ECS will automatically add the service containers to the ELB once they pass the health check.

Next, we’ll add the pingdummy-worker service. It is responsible for making requests to our internal beacon service.

As you can see, we’ve used the worker module since this program doesn’t need a load balancer or DNS name. We can also pass custom configuration to it via environment variables or command line flags. In this case, it’s passed the address of the beacon service.

Finally, we can add our pingdummy-frontend web app which will be Internet-accessible. This will use the web-service module so that the ELB can serve requests from the public subnet.

In order to make the frontend work, we need a few extra pieces of configuration beyond just what the base web-service module provides.

We’ll first need to add an SSL certificate that’s been uploaded to AWS. Sadly, there’s no terraform configuration for doing this (it requires a manual step), but you can find instructions in the AWS docs.

From there, we can tell our module that we’d like it to be accessible on the public subnets and security groups and be externally facing. The stack module creates these all individually, so we can merely pass them in and we’ll be off to the races.

Finally, run the plan and apply commands one more time:

And we’re done! Just like that, we have a multi-AZ microservice architecture running on vanilla AWS.

Looking in the AWS console, you should see logs streaming into CloudWatch from our brand new services. And whenever a request is made to the service, you should see HTTP metrics on each of the service ELBs.

To deploy new versions of these services, simply change the versions in the Terraform configuration, then re-apply. New task definitions will be created and the appropriate containers will be cycled with zero downtime.

There’s a few other pieces you’ll need to add, which you can see examples for in the main Pingdummy terraform file. Keep in mind that the example is a dummy app, and is not how we’d recommend doing things like security groups or configuration in production. We’ll have more on that in terraform to come :).

One More Thing…

Additionally, we’re excited to open source a few other pieces that were involved in releasing the Stack:

Amir Abu Shareb created terraform-docs, a command-line tool to automatically generate documentation for Terraform modules. You can think of it as the godoc of the Terraform world, automatically extracting inputs, outputs, and module usage in an easily consumable format.

We use terraform-docs to build all of the module reference documentation for Stack.

Achille Roussel created ecs-logs, an agent for sending logs from journald to CloudWatch. It provides all the built-in logging for Stack, and makes sure to create a log group for each service and a single log stream per container.

Go Forth, and Stack

It’s our hope that this post gave you a brief look at the raw power of what can be achieved with the AWS APIs these days. The ease of Terraform paired with the flexibility and scale of AWS is an extremely powerful combination.

Stack is a “first pass” of what combining these technologies can achieve. It’s by no means finished, and only provides the foundation for many of the ideas that we’ve put into production. Additionally, we’re trying some new experiments around log drivers and instances (reflected by the 0.1 tag) which we think will pay off in the future.

Nonetheless, we’ve open sourced Stack today as the first step to gather as much community wisdom around running infrastructure atop AWS.

In that vein, we’ll happily accept pull requests for new modules that fall within the spirit of the project. It’s our goal to provide the community with a good set of Terraform modules that provide sane defaults and simpler abstractions on top of the raw AWS infrastructure.

So go ahead and try out the Stack today, and please let us know what you think!


Part of the Segment infrastructure team hacking on The Segment Stack: Amir Abu Shareb, Rick Branson, Calvin French-Owen, Kevin Lo, and Achille Roussel. Open sourced at a team-offsite in Amsterdam.

Prateek Srivastava on June 14th 2016

Segment’s mobile SDKs are designed to track behavioral data from your app and translate and route that data to hundreds of downstream integrations. One of the SDK’s core tasks is to upload behavioral data to our servers. Since every network request requires your app to power up the device’s radio, uploading this data in real-time can quickly drain a battery.

To minimize the impact of the SDK on battery life, we queue these behavioral events and upload them in batches periodically. This results in 3x less battery drain over an implementation without batching.

Our Android queuing system is built on QueueFile, which was developed by Square. In this article we’re going to run though our queueing requirements, some of traditional solutions, and explain in detail why we chose to build on Queue File.

Our Queueing Requirements

Queues are a deceptively simple concept, but there are a two main considerations that make it complicated in practice: durability and atomicity.

  1. Durability - An element that has been added to the queue will survive permanently. An in-memory queue is easy to implement, but it lacks durability — events are queued until the process dies, and lost thereafter.

  2. Atomicity - An element is either added to the queue or isn’t; the queue won’t ever be in an partial/invalid state. For our purposes, we needed to save events to disk as quickly and reliably as possible, and then worry about uploading them later.

We were looking for a solution that would guarantee these contracts would be be honored even in the face of process deaths or system crashes. Such scenarios are inevitable on mobile, e.g. the user could lose battery power in the middle of an operation, or the operating system could kill your application to reclaim memory.

Traditional Solutions

File: The most obvious way to persist data to disk is to use a plain old Java file. However, writing to a file is not atomic (in most cases), and it’s easy to run into cases of a corrupted file. A trivial implementation would also keep the entire queue in memory, which is not ideal for devices that might go offline for a long time.

AtomicFile: AtomicFile (also available from the support library) is a simple helper that can guarantee atomic writes on a regular file. It guarantees atomicity by creating a backup of the original file before writing any changes, and waiting until the write is completely written to disk before deleting the backup. As long as the backup file exists, the original file is considered to be invalid. However, AtomicFile requires the caller to maintain track of the corrupted state and restore itself from one.

SharedPreferences: The simplest way to persist data onto disk is by using SharedPreferences from the Android framework. Although designed to store small key-value pairs, it can certainly be used to store any data in a pinch. The SharedPreferences class is a high level wrapper around its own implementation of an atomic file and an in-memory cache. Writes are committed to a map in memory first, and then saved to disk. However, writing to SharedPreferences is not durable. SharedPreferences silently swallows any disk write error, leaving the memory cache and disk out of sync, with callers left guessing about the result of an operation.

SQLite: The typical solution to designing a reliable disk queue on Android is to layer it on top a SQLite database. SQLite works great for large datasets that need multithreaded access, but being a fully featured database, it is designed with optimizations for advanced querying and not for the purposes of a queue. SQLite is complex with many moving parts, and this made us wary of relying on it to back such a critical section of our code.

Why We Chose QueueFile

Although all of the above solutions could have been coerced to work for us, none of them were designed specifically to be used as queues. Luckily, Square had also run into this problem for accepting payments, and they built QueueFile to accomplish this. QueueFile guarantees that all operations are atomic, writes are durable, and is designed to survive process and even system-level crashes.

It was a solution tailor-made for our use case. Adding and removing elements from QueueFile is ordered first in first out (FIFO) and takes constant time, both of which are a nice bonus. QueueFile’s tiny size also makes it perfect for us to embed in our SDK.

Core Concepts of QueueFile

There are a few guarantees that the filesystem provides:

  • renaming a file is an atomic operation

  • fsync is durable

  • segment writes are atomic

QueueFile is particularly clever about the way it stores and updates data — Bob Lee has an excellent talk on the subject. QueueFile consists of a 16 byte file header, and a series of items called elements, as a circular buffer. The grey area (not to scale) represents empty space in the file.

The file header consists of four 4-byte integers that represent the length of the file, the number of elements in the file, and a pointer to the location of the first and last elements. Since the length of the file header is only 16 bytes (smaller than the size of a segment), changes to the file header are atomic. QueueFile relies on this by making modifications to the file visible only when the header is committed as well.

The components of a file header.

Each element itself is comprised of a 4-byte element header that stores the length of the element, and the element data (variable length) itself.

The components of an element.

Adding Data

Consider adding an element to the QueueFile below.

QueueFile first writes (and fsyncs) the element and its length. Notice that the file header remains unchanged. If writing the second element fails, the QueueFile is still left in a valid state since the header hasn’t been updated (even though the new data may be on disk). Upon restart, the QueueFile would still report that queue contains only a single element.

When the write operation completes successfully, the header is committed (and fsynced) as well. If updating the header fails, then the QueueFile remains the same as above, and the change is aborted. Otherwise, we’ve successfully added our data!

Calling fsync after every write prevents the filesystem from reordering our writes, and makes the transaction durable. This ensures that the QueueFile is never left in an invalid or corrupted state.

Removing Data

Removing and clearing do the reverse. Let’s start from the result of our previous operation.

QueueFile writes (and fsyncs) the header first. If committing the header fails, then the change is simply aborted and the QueueFile remains the same as above. When the header is written successfully, the removal is committed, and now has a size of 1 (even though the data is on disk).

QueueFile goes one step further and zeroes out the removed data, which leaves us just with the element we added previously.

In Conclusion

QueueFile has been a key part of the queueing system we’ve built which powers data collection for over 1,400 Android apps running on more than 200 million devices. While the incredibly small size of the library helps prevent SDK bloat, its simplicity does also create a few limitations. For instance, QueueFile is limited to a size of 1GB, and it cannot be used by multiple processes concurrently. Neither were deal breakers for us — we don’t want your queued analytics data using too much disk space and we can create separate queues for different processes — but are good to be aware of.

We’re working on porting this approach over to iOS and have some ideas to expand QueueFile to work on a broader range of filesystems like iOS and desktop apps. If you’re interested in helping us build infrastructural components like this for mobile apps and SDKs, we’re hiring.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.