Go back to Blog


Tim French on April 29th 2021

Say goodbye to long customer support wait times and lengthy back-and-forth with Segment Personas and Twilio Flex.

All Engineering articles

Peter Reinhardt on February 27th 2017

Nightmare is a browser automation library for node.js, designed to be much simpler and easier to use than Phantomjs. We originally built Nightmare to create integration logos with 99Designs Tasks before they had an API, and we still use it in Sherlock. But the vast majority of Nightmare developers—now 55k+ downloads per month—use it for web UI testing and crawling.

This article is a quick introduction to using Nightmare for web UI testing. It uses Mocha as the testing framework, but you could similarly use Jest.


Nightmare’s API methods are designed to mimic real user actions:

  • .goto(url)

  • .type(elementSelector, text)

  • .click(elementSelector)

This makes testing with Nightmare very similar to how a human tester would navigate, click and type into your actual web app. In the next few sections we’ll dive into how to set your repo, then how to test page loads, submitting forms, and interacting with an app.

Repo Setup

First we need to install mocha and nightmare, and make sure our basic test harness is working.

Starting on the command line in your repo folder…

In test/test.js you can get started with:

const Nightmare = require('nightmare') const assert = require('assert')

describe('Load a Page', function() { // Recommended: 5s locally, 10s to remote server, 30s from airplane ¯\_(ツ)_/¯ this.timeout('30s')

let nightmare = null beforeEach(() => { nightmare = new Nightmare() })

describe('/ (Home Page)', () => { it('should load without error', done => { // your actual testing urls will likely be `http://localhost:port/path` nightmare.goto('https://gethoodie.com') .end() .then(function (result) { done() }) .catch(done) }) }) })

Add mocha as the test script to your package.json:

"scripts": { "test": "mocha" }

Finally, to test this complete setup you can run npm test on the command line…

npm test > Load a Page > ✓ should load a web page (12223ms) > 1 passing (12s)

Loading a Page

Most web products have a set of public pages used for documentation, support, marketing, authentication and signup. Here’s how you can test that these pages load successfully:

describe('Public Pages', function() { // Recommended: 5s locally, 10s to remote server, 30s from airplane ¯\_(ツ)_/¯ this.timeout('30s')

let nightmare = null beforeEach(() => { nightmare = new Nightmare() })

describe('/ (Home Page)', () => { it('should load without error', done => { // your actual testing urls will likely be `http://localhost:port/path` nightmare.goto('https://gethoodie.com') .end() .then(function (result) { done() }) .catch(done) }) })

describe('/auth (Login Page)', () => { it('should load without error', done => { nightmare.goto('https://gethoodie.com/auth') .end() .then(result => { done() }) .catch(done) }) }) })

Submitting a Form

This example tests that Hoodie’s login function fails with bad credentials. It’s always worth testing failed states as well as successful states. 🤖

describe('Login Page', function () { this.timeout('30s')

let nightmare = null beforeEach(() => { // show true lets you see wth is actually happening :) nightmare = new Nightmare({ show: true }) })

describe('given bad data', () => { it('should fail', done => { nightmare .goto('https://gethoodie.com/auth') .on('page', (type, message) => { if (type == 'alert') done() }) .type('.login-email-input', 'notgonnawork') .type('.login-password-input', 'invalid password') .click('.login-submit') .wait(2000) .end() .then() .catch(done) }) }) })

Using the App

This example is more involved, and includes signing up with text fields, select fields, and clicking and waiting through a flow that spans multiple pages.

describe('Using the App', function () { this.timeout('60s')

let nightmare = null beforeEach(() => { // show true lets you see wth is actually happening :) nightmare = new Nightmare({ show: true }) })

describe('signing up and finishing setup', () => { it('should work without timing out', done => { nightmare .goto('https://gethoodie.com/auth') .type('.signup-email-input', 't'+Math.round(Math.random()*100000)+'@test.com') .type('.signup-password-input', 'valid password') .type('.signup-password-confirm-input', 'valid password') .click('.signup-submit') .wait(2000) .select('.sizes-jeans-select', '30W x 30L') .select('.sizes-shoes-select', '9.5') .click('.sizes-submit') .wait('.shipit') // this selector only appears on the catalog page .end() .then(result => { done() }) .catch(done) }) }) })

All Together Now

The final example ties all these together into a cleanly formatted test/test.js:

const Nightmare = require('nightmare') const assert = require('assert')

describe('UI Flow Tests', function() { this.timeout('60s')

let nightmare = null beforeEach(() => { nightmare = new Nightmare({ show: true }) })

describe('Public Pages', function() { describe('/ (Home Page)', () => { it('should load without error', done => { // your actual testing urls will likely be `http://localhost:port/path` nightmare.goto('https://gethoodie.com') .end() .then(function (result) { done() }) .catch(done) }) }) describe('/auth (Login Page)', () => { it('should load without error', done => { nightmare.goto('https://gethoodie.com/auth') .end() .then(result => { done() }) .catch(done) }) }) })

describe('Login Page', function () { describe('given bad data', () => { it('should fail', done => { nightmare .goto('https://gethoodie.com/auth') .on('page', (type, message) => { if (type == 'alert') done() }) .type('.login-email-input', 'notgonnawork') .type('.login-password-input', 'invalid password') .click('.login-submit') .wait(2000) .end() .then() .catch(done) }) }) })

describe('Using the App', function () { describe('signing up and finishing setup', () => { it('should work without timing out', done => { nightmare .goto('https://gethoodie.com/auth') .type('.signup-email-input', 'test+'+Math.round(Math.random()*1000000)+'@test.com') .type('.signup-password-input', 'valid password') .type('.signup-password-confirm-input', 'valid password') .click('.signup-submit') .wait(2000) .select('.sizes-jeans-select', '30W x 30L') .select('.sizes-shoes-select', '9.5') .click('.sizes-submit') .wait('.shipit') // this selector only appears on the catalog page .end() .then(result => { done() }) .catch(done) }) }) }) })

If you have additional questions or want to join the 90+ people who have contributed to Nightmare, head over to the Github repo. Happy testing.

Peter Reinhardt on October 19th 2016

At Segment, focus is one of our four core values. But it was difficult for team members to focus in the office, so in June we ran an internal team survey about what helps and hurts focus. The results showed that “chatter and noise” was one of the biggest culprits for distraction around the office. “Slack group channels” came in second.

These answers left us with two difficult questions: how do you solve a noise problem in an open floor plan? And where is the noise even coming from?

To get to the bottom of it, I decided to build an iOS app to collect decibel levels from around the office. We found that noise levels varied widely throughout the office, and using the new data, we changed the office layout to increase our ability to focus. Numerically speaking, the increased focused time (as measured by survey) has been equivalent to hiring 10–15 teammates. And beyond the numbers, it feels great to focus more. 😃

Where is the noise coming from?

At first we thought we were just being a bit too chatty. But demanding “be quiet!” is horrible in a collaborative work environment. We also noticed something odd: when people walked into our office they’d often say, “Wow! This is one of the quietest offices I’ve ever been to.” Of course, the survey said the opposite… that the office was noisy and distracting.

This discrepancy was particularly confusing because our office is an open floor plan. Sound ought to travel well around such a big open space. It was strange that we had two widely divergent stories around the quietness and loudness of the office.

Here’s a picture showing the high ceilings, plants and open layout:

And here’s our floorplan showing the lack of walls… lots of ways for sound to bounce around (view above shown in blue):

The conflicting anecdotal stories from visitors and teammates were perplexing. To get to the bottom of things, we needed hard data. And what better way to collect that data than the ambient sensors called iPhones already sitting around our office? So I built an app to record decibel levels in different areas and give us some real data to understand the situation.

The iOS app is tiny: it uses the Apple AVAudioRecorder class’s level-metering to passively collect and report average and maximum ambient decibel levels every 10 seconds to our server monitoring tool Datadog. We’ve open-sourced our Decibel noise-recording app for you to use. Just add your Datadog API key and off you go.

Originally I planned to ask a bunch of people to install it on their phones around the office. But then our VP Engineering had a much better idea: deploy it on the iPads we have outside every conference room.

Below you can see our data collection points (iPad minis) as red dots outside each room:

The graph below shows measurements from August 31, 2016, clearly showing spikes of noise in the office throughout the day (the absolute values are arbitrary, but the data is good for relative comparison):

Among other things, you can see a full 10 dB difference between the quietest and loudest parts of the office! 10 dB feels roughly “twice as loud,” so this is a big difference!

Finally, here are the average noise level results (plus some manual interpolation of the sparse data collection points) overlaid on the floor plan of the office (red loudest, green quietest):

This resolved the mystery. The front of the office where visitors hang out was twice as quiet (10dB quieter) compared to the areas where the people work. Both sets of anecdotal stories were right. When I showed this graph to Tony, one of our security guards, he said “Oh yeah, it’s WAY quieter up at the front of the office, even at night.”

What to do about it?

We can’t immediately ditch our open floor plan (although we’re looking at various options for our next office.) But this new noise level data gave us an obvious way to reduce distractions: the teams needing the quietest work area (engineering, product and design) should move to the quietest part of the office.

Last month we made the big move. The teams needing the most verbal collaboration — Segment’s sales, support, and marketing teams — moved to the naturally louder parts of the office. The teams needing the most quiet — engineering, product, and design — moved to the quietest parts of the office.

We’re still dialing in parts of the office that became a bit cramped, but a re-run of our original focus survey with the team showed that total focus time had increased from 45% to 60% of time in the office! In a purely numerical sense, you could equate that to hiring 10–15 people. It also feels awesome to focus.

The combination of survey and noise data has been incredibly helpful in iterating towards a more productive office, and we still have lots of ideas to test. For example, we’ve started an experiment with some teams in a new “war room” layout, and will keep looking for other ways to optimize the environment to be productive and fun, measuring results as we go. Would love to share results if you’ve experimented and measured results in your own space.

Lauren Venell on October 11th 2016

When it comes to your app, size makes a difference. Bigger apps have fewer downloads, worse reviews, and a harder time penetrating the international market. We measured the exact impact of increased app size, shown below. We’ve also included learnings on how to prevent bloat in your own app.

Peter Reinhardt on October 5th 2016

A few months ago we were at a conference in Half Moon Bay talking with a general manager at a large mobile company, and he said, “One of my projects for the next six months is to reduce SDK bloat in all our apps.” Six months is a substantial investment! So we asked why this mattered so much to him.

He gave three reasons:

  1. They’re up against the 100MB app size cellular download limit, and they need to find ways to reduce the size of their app so that they don’t take a hit on installs.

  2. Occasionally their SDK vendors have bugs that crash their app, giving them bad reviews and lowering installs.

  3. Often their SDK vendors are unprepared for major iOS version upgrades, blocking key releases and big announcements.

While it makes intuitive sense that a large app download size would demotivate people and reduce installs, we wanted to dig deeper into these reasons and really assess the quantitative impact. Are we sure app size reduces installs? How big is the effect?

So, over the last few months we’ve done some extensive research into these issues… and in this article we’re going to share some new experimental results on how app size affects install rate.

How to gain 147MB in just 17 days…

To measure the impact of increased app size, we needed to buy a small app, with no active marketing activities but significant steady downloads. Then we needed to increase the app’s size, leaving everything else constant, and observe the impact on app install rate. To the best of our abilities, this would simulate the impact of SDK bloat, or anything else that just makes an app bulky.

So, we bought the Mortgage Calculator Free iOS app through some friends in the YC Founders network. It was a minuscule 3MB, had a steady pattern of organic installs (~50 installs per day for several years), and had no active marketing activities.

Then, without making any additional changes, we bloated the app from 3MB to 99MB, 123 MB and then finally 150MB, observing the isolated impact on install rate with each change in app size. In the real world, app sizes can increase substantially with the addition of seemingly simple things, like an SDK, an explainer video, a bunch of fonts, or a beautiful background picture for your loading screen. For the purposes of our experiment, we just bloated the app with a ton of hidden album art from our engineering team’s favorite artist.

To quantitatively measure the impact of each successive bloating, we looked at data provided directly by Apple in iTunes Analytics; specifically conversion from “Product Page Views” to “App Units” (or colloquially “installs”/”install rate”).

…and lose 66% of your installs

With the larger app sizes, we saw substantial losses in product page to app install rate. In particular, there was a substantial drop around the cellular download limit (~100MB), above which Apple does not let users download the app over 3G or 4G. (Note: the conversion rate can be greater than 100% due to installs direct from search results that skip product page view.)

From these results we estimate a linear decrease in install conversion rate below the cellular download limit from 3–99MB at 0.45% per MB. Above the cellular download limit we estimate a linear decrease in install rate of 0.32% per MB. To our best estimate, the gap between the two lines is covered by a 10% instantaneous install rate drop across the cellular download limit. (Although Apple says the cellular download limit is 100MB, we found in practice that a 101MB IPA did not trigger the cellular download block… the actual limit was somewhere between 101MB and 123MB and varied depending on the exact build.)

Increasing the size of our app from 3MB to 99MB reduced installs by 43%, and the increase to 150MB reduced installs by 66% in total. For mobile companies striving for growth, an increase in app size is extremely costly.How to destroy an app

In an attempt to be proper growth scientists, we tried to replicate the experiment by returning the app to its original 3MB size (plus other intermediate sizes) and re-measuring the install rate. Unfortunately, as a result of our earlier bloating, the app attracted several critical ratings and reviews, which stick around forever:

In our measurements, the app’s growth appears to be semi-permanently damaged and we only saw a minor rebound to 59% install conversion rate for the 16 days the 3MB version was available. (As a side note you can see these customers cite 140MB and 181MB as the download sizes… the true download size varies depending on the customer’s device and OS version.)

What makes an app big

As an example, we randomly chose to inspect the NBC Sports app that’s been popular and featured on the App Store during the Olympic games. The app is 90.5MB in total. By the far the biggest part of app’s size is images, accounting for 51% of the entire app’s size, with code (23%), fonts (16%) and video (9%) accounting for most of the rest.

Images were both in the raw app package and in the asset catalog (hidden away inside assets.car). There are huge numbers of local station & team logos, startup screens, and soccer field layouts.

Within the “code” category (which is a single encrypted file) are around a dozen different SDKs, a handful of which we could detect and estimate for size contribution. SDKs contributed roughly 3.5MB to the total app size, with an estimated impact of -1.54% install rate.

Overall, there’s an incredible amount of low-hanging fruit in optimizing an app’s size. One particularly honorable mention is the (presumably accidental) inclusion of the Adobe VideoHeartbeat SDK docs (955 kb) in the production IPA package downloaded through iTunes.

The impact of SDKs

While most SDKs are small individually, the net impact of installing many SDKs in your app is a significant increase to the app’s size, and therefore a meaningful negative impact on install rate, not to mention engineering time & maintenance. SDK vendors can be unprepared for new iOS releases, which can block key releases and announcements. Or, SDKs can occasionally introduce bugs, and even a single crash can mean a would-be user will never use your app again.

Therefore, it makes sense to limit the number of SDKs you bundle into your app in order to optimize for performance. But you still need to make sure you’re tracking and collecting key lifecycle events. Segment built an easy, lightweight solution called the Native Mobile Spec. The Native Mobile Spec automatically collects events that allow you measure top mobile metrics without any tracking code.

Building your best possible app

Now that we conclusively know how to destroy an app, we also know how to improve one. Here are a few key steps that you can take to make sure your app is ready for holiday season:

  1. Use a mobile app size calculator to find out the exact size of your app.

  2. Review our guide for getting your app ready for launch, for a complete set of tips and advice on how to make your app great.

  3. Read about how Segment can help mobile teams.

A version of this blog post originally appeared on Recode.

Andy Jiang on July 28th 2016

At Segment, we’re working hard to make our mobile SDKs the best possible collection options for your analytics data. An SDK can make your data more durable, minimize data transfer, and optimize your app’s battery usage. In this article, we’ll take you through what happens under the hood as a piece of data flows through our iOS and Android SDKs from a button handler in your app all the way through to our API.

The Initialization

When you initialize Segment’s iOS and Android library, it will automatically start tracking a few app lifecycle events that are part of our Native Mobile Spec.

  • Application Installed — User installs your application for the first time.

  • Application Opened — User opens your application.

  • Application Updated — User updates to a newer version of your app.

The automatic lifecycle tracking allows you to focus on building your app and thinking less about your analytics tracking plan.

Now that we’ve initialized the library, we’re ready for user interaction.

The Tap

Let’s take a look at a fictitious mobile app, Bluth’s Banana Stand, and see what happens with our SDKs when the user completes an order.

Listening on the purchase button handler, you’ll fire off a track call:

It’s safe to invoke this call directly from your button handler, as our SDKs automatically dispatch it to a background queue thread.

These queues represent various threads on the mobile device. The .track() event is moved to the background queue so the UI queue can continue to operate for the user.

Inside the Queue

Once dispatched to the background thread, the event is immediately written to the device’s disk to maximize the chances of deliverability. In a world where power can be lost at any time, this method maximizes the durability of your data. (You can read about why we chose QueueFile for reliable request batching on Android).

The Road to Segment’s API

On the way out of the queue and onto Segment’s tracking API, Segment’s SDKs use two different strategies to optimize your app’s bandwidth and power usage:

  1. Batching — Combining multiple events into each outbound request allows us to minimize the amount of times we power up the wireless hardware

  2. Compression — Compressing each request with gzip which allows us to drastically reduce the amount of bandwidth used

Before the request is sent outbound, the event is now batched together with other events before being sent to our servers. Our SDKs minimize battery use with auto-batching, since powering up the radio on every call can waste battery life. Auto-batching allows our SDKs to send events in batches, without powering up the radio on every call.

Visual representation of batched events, with the green circle representing a .track() event.

We batch our requests to minimize the number of times we access the underlying hardware. This allows us to send fewer requests, spaced further apart, and therefore allow wireless hardware to remain offline for majority of the time.

Next, we’ll compress the outbound request body with gzip, reducing the total size drastically.

In our testing, we sent one thousand .track() calls without the SDK (directly to our API) and with the SDK. The amount of data on the wire without the SDK was 1.1mb and with the SDK is 63kb — roughly 17x saving in bandwidth. We’re able to accomplish this via intelligent batching and aggressive data “middle-out compression” technologies.

These are the bandwidth impacts realized with our SDKs’ batching and compression.

The battery story is even better—we’re able to reduce the wasted energy by almost 3x from 56% overhead to 20% overhead. The net result of this is low average energy impact on the app and much more efficient battery usage—and happier users!

These are the battery enhancements our SDKs enable. (Xcode).

Finally, the request leaves the mobile device and makes its way to our servers.

Trace Completed

By tracing the SDKs functionality from tap to API, we’re able to see how a few simple data transfer strategies can significantly strengthen your mobile analytics stack:

  1. Increase durability — Immediately persist every message to a disk-backed queue, preventing you from losing data in the event of battery or network loss.

  2. Auto-Retries — If the network is spotty or not available, our SDKs will retry transferring the batch until the request is successful. This drastically improves data deliverability.

  3. Reduce Network Usage — Each batch is gzip compressed, decreasing the amount of bytes on the wire by 10x-20x.

  4. Save Battery — Because of data batching and compression, Segment’s SDKs reduce energy overhead by 2-3x which means longer battery life for your app’s users.

These features would not have been possible without the help of our incredibly supportive open source community. With 80+ contributors, 550+ pull requests, 240+ releases, our iOS and Android libraries just kept improving. Thanks to their work, more than 3,000 apps now collect customer data using Segment’s mobile SDKs every day.

We can’t wait to keep improving the mobile analytics ecosystem alongside you. If you have any ideas, we’re all ears! Send a pull request or email us at friends@segment.com. 😃

Want to learn about and grow your mobile users? Sign up today.

Interested in building infrastructural components for mobile apps and SDKs? We’re hiring.

Calvin French-Owen on June 23rd 2016

As part of our push to open up what’s going on internally at Segment – we’d like to share how we run our CI builds. Most of our approaches follow standard practices, but we wanted to share a few tips and tricks we use to speed up our build pipeline.

Powering all of our builds are CircleCIGithub, and Docker Hub. Whenever there’s a push to Github, the repository triggers a build on CircleCI. If that build is a tagged release, and passes the tests, we build an image for that container.

The image is then pushed to Docker Hub, and is ready to be deployed to our production infrastructure.

CircleCI and Travis CI

Before going any further, I’d like to talk about the elephant in the room: Travis CI. Pretty much every discussion of CI tools has some debate around using Travis CI vs CircleCI. Both are sleek, hosted, and responsive. And both are incredibly easy to use.

Honestly, we love both tools. We use Travis CI for a lot of our open source libraries, while CircleCI powers many of our private repos. Both products work extremely well and do a fantastic job meeting our needs.

However, there’s one feature we really love about CircleCI: SSH access.

Most of the time, we don’t have any problems configuring our test environments. But when we do, the ability to SSH into the container running your code is invaluable.

To put this in perspective, Segment runs our entire infrastructure with hundreds of different microservices. Each one is pulled from a different repo, and runs with a few different dependencies via docker-compose (more on that later).

Most of our CI is relatively standard, but occasionally setting up a service with a fresh environment requires some custom work. It’s in a new repo, and will require it’s own set of dependencies and build steps. And that’s where being able to run commands from within the environment you’re testing against is so handy – you can tweak configuration right on the box.

No more hundreds of “fixing CI” commits!


To work with all of these different repos, we wanted to make it trivially easy to setup a repo so that it has CI enabled. We have three different circle commands we use regularly, which are shared amongst our common dotfiles. First, there’s circle() which sets up all the right environment variables and automatically enabled our slack notifications.

Additionally, we have a circle.open() command which automatically opens the test results right from your CLI in your browser

And finally there’s the circle.badge() command to automatically add to a badge to a repo

Shared Scripts

Now given the fact that we have hundreds of repos, we have the task of keeping all the testing scripts and repositories in-sync when we make changes to our circle.yml files.

Maintaining the same behavior across a few hundred repos is annoying, but we’ve decided we’d rather trade abstraction problems (hard to solve) for investing more heavily in tooling (generally easier).

For that, we use a common set of scripts inside a shared git repo. The scripts are pulled down every time the test runs, and handle the shared packaging and deployments. Each service’s circle.yml file looks something like this:

It means that if we change our deploy scheme, we only have to update the code in one place rather than updating each individual repo’s circle.yml. We can then reference different scripts depending on what sorts of builds we need within the individual service repo.

Docker Containers

Finally, the entire build process wouldn’t be possible without Docker containers. Containers have greatly simplified how we push code to production, test against our internal services, and develop locally.

When testing our services, we make use of docker-compose.yml files to run our tests. That way, a given service can actually test against the exact same images in CI as are running in production. It reduces the need for mocks or stubs.

What’s more, when the images are built by CI–we can pull those same images down and run them locally as well.

To actually build that code and push it to production, CircleCI will first run the tests, then check whether the build is a tagged release. For any tagged releases, we have CircleCI build the container via a Dockerfile, then tag it and push the deploy to Docker Hub.

Instead of using latest everywhere, we explicitly deploy the tagged image to Docker Hub, along with the major version (1.x) and minor version (1.2.x).

This way, we’re able to specify rollbacks to a specific version when we need it – or deploy the latest build of a certain release branch if we don’t need a specific version (useful for local development and in docker-compose files).

The code to do this is relatively straightforward, first we detect the versions:

And then we build, tag, and push our docker images:

Once our images are pushed to Docker Hub, we’re guaranteed to have the right version of the code built so that we can deploy it to production and run it inside ECS.

Thanks to containers, our CI pipeline gives us much better confidence when deploying our microservices to production.

Coming Full Circle

So there you have it: our CI build pipeline, heavily powered by Github, CircleCI, and Docker.

While we’re constantly trying to find ways to make the entire pipeline a bit more seamless, we’ve been happy with the low maintenance, parallelization and isolation provided by using third-party tools.

On that note, if you’re managing a large number of repos, we’d love to hear about your own techniques for managing your build pipeline. Drop us a note by email (friends@segment) or on Twitter!

P.S. We’ll be at CircleCI’s office hours next Thursday (June 30) in San Francisco. Please join us for a special talk on Google’s AMP project!

Calvin French-Owen on June 15th 2016

AWS is the default for running production infrastructure. It’s cheap, scalable, and flexible to whatever configuration you’d like to run on top of it. But that flexibility comes with a cost: it makes AWS endlessly configurable.

You can build whatever you want on top of AWS, but that means it’s difficult to know whether you’re taking the right approach. Pretty much every startup we talk with has the same question: What’s the right way to setup our infrastructure?

To help solve that problem, we’re excited to open source the Segment AWS Stack. It’s our first pass at building a collection of Terraform modules for creating production-ready architecture on AWS. It’s largely based on the service architecture we use internally to process billions of messages every month, but built solely on AWS.

The steps are incredibly simple. Add 5 lines of Terraform, run terraform apply, and you’ll have your base infrastructure up and running in just three minutes.

It’s like a mini-Heroku that you host yourself. No magic, just AWS.

Batteries Included

Our major goals with Stack are:

  • to provide a good set of defaults for production infrastructure

  • make the AWS setup process incredibly simple

  • allow users to easily customize the core abstractions and run their own infrastructure

To achieve those goals, Stack is built with Hashicorp’s Terraform.

Terraform provides a means of configuring infrastructure as code. You write code that represents things like EC2 instances, S3 buckets, and more–and then use Terraform to create them.

Terraform manages the state of your infrastructure internally by building a dependency graph of which resources depend on one another:

and then applies only the “diff” of changes to your production environment. Terraform makes changing your infrastructure incredibly seamless because it already knows which resources have to be re-created and which can remain untouched.

Terraform provides easy-to-use, high level abstractions for provisioning cloud infrastructure, but also exposes the low-level AWS resources for custom configuration. This low-level access provides a marvelous “escape hatch” for truly custom needs.

To give you a flavor of what the setup process looks like, run terraform applyagainst this basic configuration:

It will automatically create all of the following:

Networking: Stack includes a new VPC, with public and private subnets. All routing tables, Internet Gateways, NAT Gateways, and basic security groups are automatically provisioned.

Auto-scaling default cluster: Stack ships with an auto-scaling group and basic lifecycle rules to automatically add new instances to the default cluster as they are needed.

ECS configuration: in Stack, all services run atop ECS. Simply create a new service, and the auto-scaling default cluster will automatically pick it up. Each instance ships with Docker and the latest ecs-agent.

CloudWatch logging & metrics: Stack sends all container logs to CloudWatch. Because all requests between services go through ELBs, metrics around latency and status codes are automatically collected as well.

Bastion: Stack also includes a bastion host for manual SSH access to your cluster. Besides the public services, it’s the only instance exposed to the outside world and acts as the “jump point” for manual access.

This basic setup uses the stack module as a unit, but Terraform can also reference the components of Stack individually.

That means that you can reference any of the internal modules that the stack uses, while continuing to use your own custom networking and instance configuration.

Want to only create Stack services, but bring your own VPC? Just source the service module and pass in your existing VPC ID. Don’t need a bastion and want custom security groups? Source only the vpc and cluster modules to set up only the default networking.

You’re free to take the pieces you want and leave the rest.

If you’d like to dig into more about how this works in-depth, and each of the modules individually, check out the Architecture section of the Readme.

Now, let’s walkthrough how to provision a new app and add our internal services.


Note: this walkthrough assumes you have an AWS account and Terraform installed. If not, first get the pre-requisites from the requirements section.

For this tutorial, we’ll reference the pieces of the demo app we’ve built: Pingdummy, a web-based uptime monitoring system.

All of the Docker images we use in this example are public, so you can try them yourself!

The Pingdummy infrastructure runs a few different services to demonstrate how services can be deployed and integrated using Stack.

  • the pingdummy-frontend is the main webpage users hit to register and create healthchecks. It uses the web-service module to run as a service that is publicly accessible to the internet.

  • the pingdummy-beacon is an internal service which makes requests to other third-party services, and responds with their status. It uses the servicemodule, and is not internet facing. (though here it’s used for example purposes, this service could eventually be run in many regions for HA requests)

  • the pingdummy-worker is a worker which periodically sends requests to the pingdummy-beacon service. It uses the worker module as it only needs a service definition, not a load balancer.

  • an RDS instance used for persistence

First, you’ll want to add a Terraform file to define all of the pieces of your infrastructure on AWS. Start by creating a terraform.tf file in your project directory.

Then, copy the basic stack setup to it:

And then use the Terraform CLI to actually apply the infrastructure:

This will create all the basic pieces of infrastructure we described in the first section.

Note: for managing Terraform’s remote state with more than a single user, we recommend configuring the remote state to use Terraform Enterprise or S3. You can use our pingdummy repo’s Makefile as an example.

Now we’ll add RDS as our persistence layer. We can pull the rds module from Stack, and then reference the outputs of the base networking and security groups we’ve already created. Terraform will automatically interpolate these and set up a dependency graph to re-create the resources if they change.

Again, we’ll need to run plan and apply again to create the new resources:

And presto! Our VPC now has an RDS cluster to use for persistence, managed by Terraform.

Now that we have our persistence and base layers setup, it’s time to add the services that run the Pingdummy app.

We can start with the internal beacon service for our health-checks. This service will listens on port 3001 and makes outbound HTTP requests to third-parties to check if a given URL is responding properly.

We’ll need to use the service module which creates an internal service that sits behind an ELB. That ELB will be automatically addressable at beacon.stack.local,and ECS will automatically add the service containers to the ELB once they pass the health check.

Next, we’ll add the pingdummy-worker service. It is responsible for making requests to our internal beacon service.

As you can see, we’ve used the worker module since this program doesn’t need a load balancer or DNS name. We can also pass custom configuration to it via environment variables or command line flags. In this case, it’s passed the address of the beacon service.

Finally, we can add our pingdummy-frontend web app which will be Internet-accessible. This will use the web-service module so that the ELB can serve requests from the public subnet.

In order to make the frontend work, we need a few extra pieces of configuration beyond just what the base web-service module provides.

We’ll first need to add an SSL certificate that’s been uploaded to AWS. Sadly, there’s no terraform configuration for doing this (it requires a manual step), but you can find instructions in the AWS docs.

From there, we can tell our module that we’d like it to be accessible on the public subnets and security groups and be externally facing. The stack module creates these all individually, so we can merely pass them in and we’ll be off to the races.

Finally, run the plan and apply commands one more time:

And we’re done! Just like that, we have a multi-AZ microservice architecture running on vanilla AWS.

Looking in the AWS console, you should see logs streaming into CloudWatch from our brand new services. And whenever a request is made to the service, you should see HTTP metrics on each of the service ELBs.

To deploy new versions of these services, simply change the versions in the Terraform configuration, then re-apply. New task definitions will be created and the appropriate containers will be cycled with zero downtime.

There’s a few other pieces you’ll need to add, which you can see examples for in the main Pingdummy terraform file. Keep in mind that the example is a dummy app, and is not how we’d recommend doing things like security groups or configuration in production. We’ll have more on that in terraform to come :).

One More Thing…

Additionally, we’re excited to open source a few other pieces that were involved in releasing the Stack:

Amir Abu Shareb created terraform-docs, a command-line tool to automatically generate documentation for Terraform modules. You can think of it as the godoc of the Terraform world, automatically extracting inputs, outputs, and module usage in an easily consumable format.

We use terraform-docs to build all of the module reference documentation for Stack.

Achille Roussel created ecs-logs, an agent for sending logs from journald to CloudWatch. It provides all the built-in logging for Stack, and makes sure to create a log group for each service and a single log stream per container.

Go Forth, and Stack

It’s our hope that this post gave you a brief look at the raw power of what can be achieved with the AWS APIs these days. The ease of Terraform paired with the flexibility and scale of AWS is an extremely powerful combination.

Stack is a “first pass” of what combining these technologies can achieve. It’s by no means finished, and only provides the foundation for many of the ideas that we’ve put into production. Additionally, we’re trying some new experiments around log drivers and instances (reflected by the 0.1 tag) which we think will pay off in the future.

Nonetheless, we’ve open sourced Stack today as the first step to gather as much community wisdom around running infrastructure atop AWS.

In that vein, we’ll happily accept pull requests for new modules that fall within the spirit of the project. It’s our goal to provide the community with a good set of Terraform modules that provide sane defaults and simpler abstractions on top of the raw AWS infrastructure.

So go ahead and try out the Stack today, and please let us know what you think!

Part of the Segment infrastructure team hacking on The Segment Stack: Amir Abu Shareb, Rick Branson, Calvin French-Owen, Kevin Lo, and Achille Roussel. Open sourced at a team-offsite in Amsterdam.

Prateek Srivastava on June 14th 2016

Segment’s mobile SDKs are designed to track behavioral data from your app and translate and route that data to hundreds of downstream integrations. One of the SDK’s core tasks is to upload behavioral data to our servers. Since every network request requires your app to power up the device’s radio, uploading this data in real-time can quickly drain a battery.

To minimize the impact of the SDK on battery life, we queue these behavioral events and upload them in batches periodically. This results in 3x less battery drain over an implementation without batching.

Our Android queuing system is built on QueueFile, which was developed by Square. In this article we’re going to run though our queueing requirements, some of traditional solutions, and explain in detail why we chose to build on Queue File.

Our Queueing Requirements

Queues are a deceptively simple concept, but there are a two main considerations that make it complicated in practice: durability and atomicity.

  1. Durability - An element that has been added to the queue will survive permanently. An in-memory queue is easy to implement, but it lacks durability — events are queued until the process dies, and lost thereafter.

  2. Atomicity - An element is either added to the queue or isn’t; the queue won’t ever be in an partial/invalid state. For our purposes, we needed to save events to disk as quickly and reliably as possible, and then worry about uploading them later.

We were looking for a solution that would guarantee these contracts would be be honored even in the face of process deaths or system crashes. Such scenarios are inevitable on mobile, e.g. the user could lose battery power in the middle of an operation, or the operating system could kill your application to reclaim memory.

Traditional Solutions

File: The most obvious way to persist data to disk is to use a plain old Java file. However, writing to a file is not atomic (in most cases), and it’s easy to run into cases of a corrupted file. A trivial implementation would also keep the entire queue in memory, which is not ideal for devices that might go offline for a long time.

AtomicFile: AtomicFile (also available from the support library) is a simple helper that can guarantee atomic writes on a regular file. It guarantees atomicity by creating a backup of the original file before writing any changes, and waiting until the write is completely written to disk before deleting the backup. As long as the backup file exists, the original file is considered to be invalid. However, AtomicFile requires the caller to maintain track of the corrupted state and restore itself from one.

SharedPreferences: The simplest way to persist data onto disk is by using SharedPreferences from the Android framework. Although designed to store small key-value pairs, it can certainly be used to store any data in a pinch. The SharedPreferences class is a high level wrapper around its own implementation of an atomic file and an in-memory cache. Writes are committed to a map in memory first, and then saved to disk. However, writing to SharedPreferences is not durable. SharedPreferences silently swallows any disk write error, leaving the memory cache and disk out of sync, with callers left guessing about the result of an operation.

SQLite: The typical solution to designing a reliable disk queue on Android is to layer it on top a SQLite database. SQLite works great for large datasets that need multithreaded access, but being a fully featured database, it is designed with optimizations for advanced querying and not for the purposes of a queue. SQLite is complex with many moving parts, and this made us wary of relying on it to back such a critical section of our code.

Why We Chose QueueFile

Although all of the above solutions could have been coerced to work for us, none of them were designed specifically to be used as queues. Luckily, Square had also run into this problem for accepting payments, and they built QueueFile to accomplish this. QueueFile guarantees that all operations are atomic, writes are durable, and is designed to survive process and even system-level crashes.

It was a solution tailor-made for our use case. Adding and removing elements from QueueFile is ordered first in first out (FIFO) and takes constant time, both of which are a nice bonus. QueueFile’s tiny size also makes it perfect for us to embed in our SDK.

Core Concepts of QueueFile

There are a few guarantees that the filesystem provides:

  • renaming a file is an atomic operation

  • fsync is durable

  • segment writes are atomic

QueueFile is particularly clever about the way it stores and updates data — Bob Lee has an excellent talk on the subject. QueueFile consists of a 16 byte file header, and a series of items called elements, as a circular buffer. The grey area (not to scale) represents empty space in the file.

The file header consists of four 4-byte integers that represent the length of the file, the number of elements in the file, and a pointer to the location of the first and last elements. Since the length of the file header is only 16 bytes (smaller than the size of a segment), changes to the file header are atomic. QueueFile relies on this by making modifications to the file visible only when the header is committed as well.

The components of a file header.

Each element itself is comprised of a 4-byte element header that stores the length of the element, and the element data (variable length) itself.

The components of an element.

Adding Data

Consider adding an element to the QueueFile below.

QueueFile first writes (and fsyncs) the element and its length. Notice that the file header remains unchanged. If writing the second element fails, the QueueFile is still left in a valid state since the header hasn’t been updated (even though the new data may be on disk). Upon restart, the QueueFile would still report that queue contains only a single element.

When the write operation completes successfully, the header is committed (and fsynced) as well. If updating the header fails, then the QueueFile remains the same as above, and the change is aborted. Otherwise, we’ve successfully added our data!

Calling fsync after every write prevents the filesystem from reordering our writes, and makes the transaction durable. This ensures that the QueueFile is never left in an invalid or corrupted state.

Removing Data

Removing and clearing do the reverse. Let’s start from the result of our previous operation.

QueueFile writes (and fsyncs) the header first. If committing the header fails, then the change is simply aborted and the QueueFile remains the same as above. When the header is written successfully, the removal is committed, and now has a size of 1 (even though the data is on disk).

QueueFile goes one step further and zeroes out the removed data, which leaves us just with the element we added previously.

In Conclusion

QueueFile has been a key part of the queueing system we’ve built which powers data collection for over 1,400 Android apps running on more than 200 million devices. While the incredibly small size of the library helps prevent SDK bloat, its simplicity does also create a few limitations. For instance, QueueFile is limited to a size of 1GB, and it cannot be used by multiple processes concurrently. Neither were deal breakers for us — we don’t want your queued analytics data using too much disk space and we can create separate queues for different processes — but are good to be aware of.

We’re working on porting this approach over to iOS and have some ideas to expand QueueFile to work on a broader range of filesystems like iOS and desktop apps. If you’re interested in helping us build infrastructural components like this for mobile apps and SDKs, we’re hiring.

Stephen Mathieson, Calvin French-Owen on May 26th 2016

For the past year, we’ve been heavy users of Amazon’s EC2 Container Service (ECS). It’s given us an easy way to run and deploy thousands of containers across our infrastructure.

ECS gives us free Cloudwatch metrics, automatic healthchecks, and scheduling across hundreds of nodes. It’s radically simplified how we deploy code to production. In short, it’s pretty awesome.

But ECS had one major shortcoming: navigating the AWS Console is a massive pain. It’s hard to find which containers are running and what version of an image is deployed.

That’s why we built Specs

Specs gives you a high-level window into your ECS clusters and services.

It provides a handy search box to find particular services in production, and then helps diagnose what the service’s exact state. Specs lets us know if a container isn’t able to be placed due to insufficient capacity or if all containers in a service are dead.

It gives the entire development team a better picture of how our production environment is configured.

We run Specs behind an internal Google oAuth server, so any engineer can view it even if they don’t have an IAM account. It’s allowed a lot more of the team to debug issues without bothering the core infra team.

Getting started with Specs couldn’t be easier. Assuming you’re running Docker and have your AWS credentials exported, just run the following command:

You’ll get convenient dashboards with zero configuration.

If you’d like to help contribute, please check out the github repo. We’re excited to add features like log tailing and a real-time events feed to make working with ECS a truly seamless experience.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.