Go back to Blog


Calvin French-Owen on June 5th 2020

This blog should not be construed as legal advice. Please discuss with your counsel what you need to do to comply with the GDPR, CCPA, and other similar laws.

Under the GDPR and CCPA, any company which serves users in the EU or users in California must allow its users to request that their data is either deleted or suppressed.

  • Deletion all identifying info related to the user must be properly deleted.

  • Suppression the user should be able to specify where their data is used and sent (e.g. for a marketing, advertising, or product use case).

When you get a deletion request, it doesn’t just mean deleting a few rows of data in your database. It’s your responsibility to purge data about your users from all of your tools – email, advertising, and push notifications.

Typically, this process is incredibly time-consuming. We have seen companies create custom JIRA workflows, in-depth checklists, and other manual work to comply with the law. 

In this article we’ll show you how to automate and easily respect user privacy by:

  • Managing consent with our open source consent manager.

  • Issuing DSAR (Data Subject Access Requests) on behalf of your users.

  • Federating those requests to downstream tools.

Let's dive in.

Step 1: Set up a Javascript source and identify calls

If you haven’t already, you’ll want to be sure you have a source data setup on your website, and collecting your user data through Segment.

The easiest way to do this is via our Javascript, and analytics.identify calls.

// when a user first logs in, identify them with name and email analytics.identify('my-user-id', { email: 'jkim@email.com', firstName: 'Jane', lastName: 'Kim' })

Generally, we recommend you first:

  • Generate user ID in your database a user ID should never change! It’s best to generate these in your database, so they can stay constant even if a user changes their email address. We’ll handle anonymous IDs automatically.

  • Collect the traits you have you don’t have to worry about collecting all traits with every call. We’ll automatically merge them for you, so just collect what you have.

  • Start with messaging if you’re trying to come up with a list of traits to collect, start with email personalization. Most customers start by collecting email, first and last name, age, phone, role, and company info so they can send personalized emails or push notifications.

Once you’ve collected data, you’re ready to start your compliance efforts.

Step 2: Enable the open-source consent manager

Giving users the ability to control what personal data is collected is a huge part of any privacy compliance regime. 

We’ve built an open source drop-in consent manager that automatically works with Analytics.js.

Adding it in is straightforward.

Updating the snippet

First, you’ll want to remove the two lines from your analytics.js snippet.

analytics.load("<Your Write Key") // <-- delete meanalytics.page() // <-- delete me

These will automatically be called by the consent manager.

Add in your config

We’ve included some boilerplate configuration, which dictates when the consent manager is shown and what the text looks like. You’ll want to add this somewhere and customize it to your liking.

You’ll also want to add a target container for the manager to load. <div id="target-container"></div>

You can and should also customize this to your liking.

Load the consent manager

Finally, we’re ready to load the consent manager.

<script  src="https://unpkg.com/@segment/consent-manager@5.0.0/standalone/consent-manager.js"  defer></script>

Once you’re done, it should look like this.

Great, now we can let users manage their preferences! They can opt-in to all data collection, or just the portion they want to. 

Step 3: Collecting deletion requests

Now it’s time to allow users to delete their data. The simplest way to do this is to start an Airtable sheet to keep track of user requests, and then create a form from it.

At a minimum, you’ll want to have columns for:

  • The user identifier – either an email or user ID.

  • A confirmation if your page is public (making sure the user was authenticated).

  • A checkbox indicating that the deletion was submitted.

From there, we can automatically turn it into an Airtable form to collect this data.

To automate this you can use our GDPR Deletion APIs. You can automatically script these so that you don’t need to worry about public form submissions. We’ve done this internally at Segment. 

Tip: Make sure deletions are guarded by some sort of confirmation step, or only accessible when the user is logged in.

Step 4: Issuing deletions and receipts

Now we’re ready to put it all together. We can issue deletion requests within Segment for individual users.

This will remove user records from:

  • Segment archives

  • Your warehouses and data lakes

  • Downstream destinations that support deletion

To do so, simply go to the deletion manager under Workspace Settings > End User Privacy.

This will allow you to make a new request by ID.

Simply select “New Request”, and enter the user ID from your database.

This will automatically kick off deletions in any end tools which support them. You’ll see receipts in Segment indicating that these deletions went through.

As your different destinations begin processing this data, they will send you notifications as well.

And just like that, we’ve built deletion and suppression into our pipeline, all with minimal work!

Wrapping up

Here’s what we’ve accomplished in this article. We’ve:

  • Collected our user data thoughtfully and responsibly by asking for consent with the Segment open source consent manager.

  • Accepted deletion requests via Airtable or the Segment deletion API.

  • Automated that deletion in downstream tools with the deletion requests.

Try this recipe for yourself...

Get help implementing this use case by talking with a Segment Team member or by signing up for a free Segment workspace here.

All Engineering articles

Peter Reinhardt on October 5th 2016

A few months ago we were at a conference in Half Moon Bay talking with a general manager at a large mobile company, and he said, “One of my projects for the next six months is to reduce SDK bloat in all our apps.” Six months is a substantial investment! So we asked why this mattered so much to him.

He gave three reasons:

  1. They’re up against the 100MB app size cellular download limit, and they need to find ways to reduce the size of their app so that they don’t take a hit on installs.

  2. Occasionally their SDK vendors have bugs that crash their app, giving them bad reviews and lowering installs.

  3. Often their SDK vendors are unprepared for major iOS version upgrades, blocking key releases and big announcements.

While it makes intuitive sense that a large app download size would demotivate people and reduce installs, we wanted to dig deeper into these reasons and really assess the quantitative impact. Are we sure app size reduces installs? How big is the effect?

So, over the last few months we’ve done some extensive research into these issues… and in this article we’re going to share some new experimental results on how app size affects install rate.

How to gain 147MB in just 17 days…

To measure the impact of increased app size, we needed to buy a small app, with no active marketing activities but significant steady downloads. Then we needed to increase the app’s size, leaving everything else constant, and observe the impact on app install rate. To the best of our abilities, this would simulate the impact of SDK bloat, or anything else that just makes an app bulky.

So, we bought the Mortgage Calculator Free iOS app through some friends in the YC Founders network. It was a minuscule 3MB, had a steady pattern of organic installs (~50 installs per day for several years), and had no active marketing activities.

Then, without making any additional changes, we bloated the app from 3MB to 99MB, 123 MB and then finally 150MB, observing the isolated impact on install rate with each change in app size. In the real world, app sizes can increase substantially with the addition of seemingly simple things, like an SDK, an explainer video, a bunch of fonts, or a beautiful background picture for your loading screen. For the purposes of our experiment, we just bloated the app with a ton of hidden album art from our engineering team’s favorite artist.

To quantitatively measure the impact of each successive bloating, we looked at data provided directly by Apple in iTunes Analytics; specifically conversion from “Product Page Views” to “App Units” (or colloquially “installs”/”install rate”).

…and lose 66% of your installs

With the larger app sizes, we saw substantial losses in product page to app install rate. In particular, there was a substantial drop around the cellular download limit (~100MB), above which Apple does not let users download the app over 3G or 4G. (Note: the conversion rate can be greater than 100% due to installs direct from search results that skip product page view.)

From these results we estimate a linear decrease in install conversion rate below the cellular download limit from 3–99MB at 0.45% per MB. Above the cellular download limit we estimate a linear decrease in install rate of 0.32% per MB. To our best estimate, the gap between the two lines is covered by a 10% instantaneous install rate drop across the cellular download limit. (Although Apple says the cellular download limit is 100MB, we found in practice that a 101MB IPA did not trigger the cellular download block… the actual limit was somewhere between 101MB and 123MB and varied depending on the exact build.)

Increasing the size of our app from 3MB to 99MB reduced installs by 43%, and the increase to 150MB reduced installs by 66% in total. For mobile companies striving for growth, an increase in app size is extremely costly.How to destroy an app

In an attempt to be proper growth scientists, we tried to replicate the experiment by returning the app to its original 3MB size (plus other intermediate sizes) and re-measuring the install rate. Unfortunately, as a result of our earlier bloating, the app attracted several critical ratings and reviews, which stick around forever:

In our measurements, the app’s growth appears to be semi-permanently damaged and we only saw a minor rebound to 59% install conversion rate for the 16 days the 3MB version was available. (As a side note you can see these customers cite 140MB and 181MB as the download sizes… the true download size varies depending on the customer’s device and OS version.)

What makes an app big

As an example, we randomly chose to inspect the NBC Sports app that’s been popular and featured on the App Store during the Olympic games. The app is 90.5MB in total. By the far the biggest part of app’s size is images, accounting for 51% of the entire app’s size, with code (23%), fonts (16%) and video (9%) accounting for most of the rest.

Images were both in the raw app package and in the asset catalog (hidden away inside assets.car). There are huge numbers of local station & team logos, startup screens, and soccer field layouts.

Within the “code” category (which is a single encrypted file) are around a dozen different SDKs, a handful of which we could detect and estimate for size contribution. SDKs contributed roughly 3.5MB to the total app size, with an estimated impact of -1.54% install rate.

Overall, there’s an incredible amount of low-hanging fruit in optimizing an app’s size. One particularly honorable mention is the (presumably accidental) inclusion of the Adobe VideoHeartbeat SDK docs (955 kb) in the production IPA package downloaded through iTunes.

The impact of SDKs

While most SDKs are small individually, the net impact of installing many SDKs in your app is a significant increase to the app’s size, and therefore a meaningful negative impact on install rate, not to mention engineering time & maintenance. SDK vendors can be unprepared for new iOS releases, which can block key releases and announcements. Or, SDKs can occasionally introduce bugs, and even a single crash can mean a would-be user will never use your app again.

Therefore, it makes sense to limit the number of SDKs you bundle into your app in order to optimize for performance. But you still need to make sure you’re tracking and collecting key lifecycle events. Segment built an easy, lightweight solution called the Native Mobile Spec. The Native Mobile Spec automatically collects events that allow you measure top mobile metrics without any tracking code.

Building your best possible app

Now that we conclusively know how to destroy an app, we also know how to improve one. Here are a few key steps that you can take to make sure your app is ready for holiday season:

  1. Use our mobile app size calculator to find out the exact size of your app.

  2. Review our guide for getting your app ready for launch, for a complete set of tips and advice on how to make your app great.

  3. Read about how Segment can help mobile teams.

A version of this blog post originally appeared on Recode.

Andy Jiang on July 28th 2016

At Segment, we’re working hard to make our mobile SDKs the best possible collection options for your analytics data. An SDK can make your data more durable, minimize data transfer, and optimize your app’s battery usage. In this article, we’ll take you through what happens under the hood as a piece of data flows through our iOS and Android SDKs from a button handler in your app all the way through to our API.

The Initialization

When you initialize Segment’s iOS and Android library, it will automatically start tracking a few app lifecycle events that are part of our Native Mobile Spec.

  • Application Installed — User installs your application for the first time.

  • Application Opened — User opens your application.

  • Application Updated — User updates to a newer version of your app.

The automatic lifecycle tracking allows you to focus on building your app and thinking less about your analytics tracking plan.

Now that we’ve initialized the library, we’re ready for user interaction.

The Tap

Let’s take a look at a fictitious mobile app, Bluth’s Banana Stand, and see what happens with our SDKs when the user completes an order.

Listening on the purchase button handler, you’ll fire off a track call:

It’s safe to invoke this call directly from your button handler, as our SDKs automatically dispatch it to a background queue thread.

These queues represent various threads on the mobile device. The .track() event is moved to the background queue so the UI queue can continue to operate for the user.

Inside the Queue

Once dispatched to the background thread, the event is immediately written to the device’s disk to maximize the chances of deliverability. In a world where power can be lost at any time, this method maximizes the durability of your data. (You can read about why we chose QueueFile for reliable request batching on Android).

The Road to Segment’s API

On the way out of the queue and onto Segment’s tracking API, Segment’s SDKs use two different strategies to optimize your app’s bandwidth and power usage:

  1. Batching — Combining multiple events into each outbound request allows us to minimize the amount of times we power up the wireless hardware

  2. Compression — Compressing each request with gzip which allows us to drastically reduce the amount of bandwidth used

Before the request is sent outbound, the event is now batched together with other events before being sent to our servers. Our SDKs minimize battery use with auto-batching, since powering up the radio on every call can waste battery life. Auto-batching allows our SDKs to send events in batches, without powering up the radio on every call.

Visual representation of batched events, with the green circle representing a .track() event.

We batch our requests to minimize the number of times we access the underlying hardware. This allows us to send fewer requests, spaced further apart, and therefore allow wireless hardware to remain offline for majority of the time.

Next, we’ll compress the outbound request body with gzip, reducing the total size drastically.

In our testing, we sent one thousand .track() calls without the SDK (directly to our API) and with the SDK. The amount of data on the wire without the SDK was 1.1mb and with the SDK is 63kb — roughly 17x saving in bandwidth. We’re able to accomplish this via intelligent batching and aggressive data “middle-out compression” technologies.

These are the bandwidth impacts realized with our SDKs’ batching and compression.

The battery story is even better—we’re able to reduce the wasted energy by almost 3x from 56% overhead to 20% overhead. The net result of this is low average energy impact on the app and much more efficient battery usage—and happier users!

These are the battery enhancements our SDKs enable. (Xcode).

Finally, the request leaves the mobile device and makes its way to our servers.

Trace Completed

By tracing the SDKs functionality from tap to API, we’re able to see how a few simple data transfer strategies can significantly strengthen your mobile analytics stack:

  1. Increase durability — Immediately persist every message to a disk-backed queue, preventing you from losing data in the event of battery or network loss.

  2. Auto-Retries — If the network is spotty or not available, our SDKs will retry transferring the batch until the request is successful. This drastically improves data deliverability.

  3. Reduce Network Usage — Each batch is gzip compressed, decreasing the amount of bytes on the wire by 10x-20x.

  4. Save Battery — Because of data batching and compression, Segment’s SDKs reduce energy overhead by 2-3x which means longer battery life for your app’s users.

These features would not have been possible without the help of our incredibly supportive open source community. With 80+ contributors, 550+ pull requests, 240+ releases, our iOS and Android libraries just kept improving. Thanks to their work, more than 3,000 apps now collect customer data using Segment’s mobile SDKs every day.

We can’t wait to keep improving the mobile analytics ecosystem alongside you. If you have any ideas, we’re all ears! Send a pull request or email us at friends@segment.com. 😃

Want to learn about and grow your mobile users? Sign up today.

Interested in building infrastructural components for mobile apps and SDKs? We’re hiring.

Calvin French-Owen on June 23rd 2016

As part of our push to open up what’s going on internally at Segment – we’d like to share how we run our CI builds. Most of our approaches follow standard practices, but we wanted to share a few tips and tricks we use to speed up our build pipeline.

Powering all of our builds are CircleCIGithub, and Docker Hub. Whenever there’s a push to Github, the repository triggers a build on CircleCI. If that build is a tagged release, and passes the tests, we build an image for that container.

The image is then pushed to Docker Hub, and is ready to be deployed to our production infrastructure.

CircleCI and Travis CI

Before going any further, I’d like to talk about the elephant in the room: Travis CI. Pretty much every discussion of CI tools has some debate around using Travis CI vs CircleCI. Both are sleek, hosted, and responsive. And both are incredibly easy to use.

Honestly, we love both tools. We use Travis CI for a lot of our open source libraries, while CircleCI powers many of our private repos. Both products work extremely well and do a fantastic job meeting our needs.

However, there’s one feature we really love about CircleCI: SSH access.

Most of the time, we don’t have any problems configuring our test environments. But when we do, the ability to SSH into the container running your code is invaluable.

To put this in perspective, Segment runs our entire infrastructure with hundreds of different microservices. Each one is pulled from a different repo, and runs with a few different dependencies via docker-compose (more on that later).

Most of our CI is relatively standard, but occasionally setting up a service with a fresh environment requires some custom work. It’s in a new repo, and will require it’s own set of dependencies and build steps. And that’s where being able to run commands from within the environment you’re testing against is so handy – you can tweak configuration right on the box.

No more hundreds of “fixing CI” commits!


To work with all of these different repos, we wanted to make it trivially easy to setup a repo so that it has CI enabled. We have three different circle commands we use regularly, which are shared amongst our common dotfiles. First, there’s circle() which sets up all the right environment variables and automatically enabled our slack notifications.

Additionally, we have a circle.open() command which automatically opens the test results right from your CLI in your browser

And finally there’s the circle.badge() command to automatically add to a badge to a repo

Shared Scripts

Now given the fact that we have hundreds of repos, we have the task of keeping all the testing scripts and repositories in-sync when we make changes to our circle.yml files.

Maintaining the same behavior across a few hundred repos is annoying, but we’ve decided we’d rather trade abstraction problems (hard to solve) for investing more heavily in tooling (generally easier).

For that, we use a common set of scripts inside a shared git repo. The scripts are pulled down every time the test runs, and handle the shared packaging and deployments. Each service’s circle.yml file looks something like this:

It means that if we change our deploy scheme, we only have to update the code in one place rather than updating each individual repo’s circle.yml. We can then reference different scripts depending on what sorts of builds we need within the individual service repo.

Docker Containers

Finally, the entire build process wouldn’t be possible without Docker containers. Containers have greatly simplified how we push code to production, test against our internal services, and develop locally.

When testing our services, we make use of docker-compose.yml files to run our tests. That way, a given service can actually test against the exact same images in CI as are running in production. It reduces the need for mocks or stubs.

What’s more, when the images are built by CI–we can pull those same images down and run them locally as well.

To actually build that code and push it to production, CircleCI will first run the tests, then check whether the build is a tagged release. For any tagged releases, we have CircleCI build the container via a Dockerfile, then tag it and push the deploy to Docker Hub.

Instead of using latest everywhere, we explicitly deploy the tagged image to Docker Hub, along with the major version (1.x) and minor version (1.2.x).

This way, we’re able to specify rollbacks to a specific version when we need it – or deploy the latest build of a certain release branch if we don’t need a specific version (useful for local development and in docker-compose files).

The code to do this is relatively straightforward, first we detect the versions:

And then we build, tag, and push our docker images:

Once our images are pushed to Docker Hub, we’re guaranteed to have the right version of the code built so that we can deploy it to production and run it inside ECS.

Thanks to containers, our CI pipeline gives us much better confidence when deploying our microservices to production.

Coming Full Circle

So there you have it: our CI build pipeline, heavily powered by Github, CircleCI, and Docker.

While we’re constantly trying to find ways to make the entire pipeline a bit more seamless, we’ve been happy with the low maintenance, parallelization and isolation provided by using third-party tools.

On that note, if you’re managing a large number of repos, we’d love to hear about your own techniques for managing your build pipeline. Drop us a note by email (friends@segment) or on Twitter!

P.S. We’ll be at CircleCI’s office hours next Thursday (June 30) in San Francisco. Please join us for a special talk on Google’s AMP project!

Calvin French-Owen on June 15th 2016

AWS is the default for running production infrastructure. It’s cheap, scalable, and flexible to whatever configuration you’d like to run on top of it. But that flexibility comes with a cost: it makes AWS endlessly configurable.

You can build whatever you want on top of AWS, but that means it’s difficult to know whether you’re taking the right approach. Pretty much every startup we talk with has the same question: What’s the right way to setup our infrastructure?

To help solve that problem, we’re excited to open source the Segment AWS Stack. It’s our first pass at building a collection of Terraform modules for creating production-ready architecture on AWS. It’s largely based on the service architecture we use internally to process billions of messages every month, but built solely on AWS.

The steps are incredibly simple. Add 5 lines of Terraform, run terraform apply, and you’ll have your base infrastructure up and running in just three minutes.

It’s like a mini-Heroku that you host yourself. No magic, just AWS.

Batteries Included

Our major goals with Stack are:

  • to provide a good set of defaults for production infrastructure

  • make the AWS setup process incredibly simple

  • allow users to easily customize the core abstractions and run their own infrastructure

To achieve those goals, Stack is built with Hashicorp’s Terraform.

Terraform provides a means of configuring infrastructure as code. You write code that represents things like EC2 instances, S3 buckets, and more–and then use Terraform to create them.

Terraform manages the state of your infrastructure internally by building a dependency graph of which resources depend on one another:

and then applies only the “diff” of changes to your production environment. Terraform makes changing your infrastructure incredibly seamless because it already knows which resources have to be re-created and which can remain untouched.

Terraform provides easy-to-use, high level abstractions for provisioning cloud infrastructure, but also exposes the low-level AWS resources for custom configuration. This low-level access provides a marvelous “escape hatch” for truly custom needs.

To give you a flavor of what the setup process looks like, run terraform applyagainst this basic configuration:

It will automatically create all of the following:

Networking: Stack includes a new VPC, with public and private subnets. All routing tables, Internet Gateways, NAT Gateways, and basic security groups are automatically provisioned.

Auto-scaling default cluster: Stack ships with an auto-scaling group and basic lifecycle rules to automatically add new instances to the default cluster as they are needed.

ECS configuration: in Stack, all services run atop ECS. Simply create a new service, and the auto-scaling default cluster will automatically pick it up. Each instance ships with Docker and the latest ecs-agent.

CloudWatch logging & metrics: Stack sends all container logs to CloudWatch. Because all requests between services go through ELBs, metrics around latency and status codes are automatically collected as well.

Bastion: Stack also includes a bastion host for manual SSH access to your cluster. Besides the public services, it’s the only instance exposed to the outside world and acts as the “jump point” for manual access.

This basic setup uses the stack module as a unit, but Terraform can also reference the components of Stack individually.

That means that you can reference any of the internal modules that the stack uses, while continuing to use your own custom networking and instance configuration.

Want to only create Stack services, but bring your own VPC? Just source the service module and pass in your existing VPC ID. Don’t need a bastion and want custom security groups? Source only the vpc and cluster modules to set up only the default networking.

You’re free to take the pieces you want and leave the rest.

If you’d like to dig into more about how this works in-depth, and each of the modules individually, check out the Architecture section of the Readme.

Now, let’s walkthrough how to provision a new app and add our internal services.


Note: this walkthrough assumes you have an AWS account and Terraform installed. If not, first get the pre-requisites from the requirements section.

For this tutorial, we’ll reference the pieces of the demo app we’ve built: Pingdummy, a web-based uptime monitoring system.

All of the Docker images we use in this example are public, so you can try them yourself!

The Pingdummy infrastructure runs a few different services to demonstrate how services can be deployed and integrated using Stack.

  • the pingdummy-frontend is the main webpage users hit to register and create healthchecks. It uses the web-service module to run as a service that is publicly accessible to the internet.

  • the pingdummy-beacon is an internal service which makes requests to other third-party services, and responds with their status. It uses the servicemodule, and is not internet facing. (though here it’s used for example purposes, this service could eventually be run in many regions for HA requests)

  • the pingdummy-worker is a worker which periodically sends requests to the pingdummy-beacon service. It uses the worker module as it only needs a service definition, not a load balancer.

  • an RDS instance used for persistence

First, you’ll want to add a Terraform file to define all of the pieces of your infrastructure on AWS. Start by creating a terraform.tf file in your project directory.

Then, copy the basic stack setup to it:

And then use the Terraform CLI to actually apply the infrastructure:

This will create all the basic pieces of infrastructure we described in the first section.

Note: for managing Terraform’s remote state with more than a single user, we recommend configuring the remote state to use Terraform Enterprise or S3. You can use our pingdummy repo’s Makefile as an example.

Now we’ll add RDS as our persistence layer. We can pull the rds module from Stack, and then reference the outputs of the base networking and security groups we’ve already created. Terraform will automatically interpolate these and set up a dependency graph to re-create the resources if they change.

Again, we’ll need to run plan and apply again to create the new resources:

And presto! Our VPC now has an RDS cluster to use for persistence, managed by Terraform.

Now that we have our persistence and base layers setup, it’s time to add the services that run the Pingdummy app.

We can start with the internal beacon service for our health-checks. This service will listens on port 3001 and makes outbound HTTP requests to third-parties to check if a given URL is responding properly.

We’ll need to use the service module which creates an internal service that sits behind an ELB. That ELB will be automatically addressable at beacon.stack.local,and ECS will automatically add the service containers to the ELB once they pass the health check.

Next, we’ll add the pingdummy-worker service. It is responsible for making requests to our internal beacon service.

As you can see, we’ve used the worker module since this program doesn’t need a load balancer or DNS name. We can also pass custom configuration to it via environment variables or command line flags. In this case, it’s passed the address of the beacon service.

Finally, we can add our pingdummy-frontend web app which will be Internet-accessible. This will use the web-service module so that the ELB can serve requests from the public subnet.

In order to make the frontend work, we need a few extra pieces of configuration beyond just what the base web-service module provides.

We’ll first need to add an SSL certificate that’s been uploaded to AWS. Sadly, there’s no terraform configuration for doing this (it requires a manual step), but you can find instructions in the AWS docs.

From there, we can tell our module that we’d like it to be accessible on the public subnets and security groups and be externally facing. The stack module creates these all individually, so we can merely pass them in and we’ll be off to the races.

Finally, run the plan and apply commands one more time:

And we’re done! Just like that, we have a multi-AZ microservice architecture running on vanilla AWS.

Looking in the AWS console, you should see logs streaming into CloudWatch from our brand new services. And whenever a request is made to the service, you should see HTTP metrics on each of the service ELBs.

To deploy new versions of these services, simply change the versions in the Terraform configuration, then re-apply. New task definitions will be created and the appropriate containers will be cycled with zero downtime.

There’s a few other pieces you’ll need to add, which you can see examples for in the main Pingdummy terraform file. Keep in mind that the example is a dummy app, and is not how we’d recommend doing things like security groups or configuration in production. We’ll have more on that in terraform to come :).

One More Thing…

Additionally, we’re excited to open source a few other pieces that were involved in releasing the Stack:

Amir Abu Shareb created terraform-docs, a command-line tool to automatically generate documentation for Terraform modules. You can think of it as the godoc of the Terraform world, automatically extracting inputs, outputs, and module usage in an easily consumable format.

We use terraform-docs to build all of the module reference documentation for Stack.

Achille Roussel created ecs-logs, an agent for sending logs from journald to CloudWatch. It provides all the built-in logging for Stack, and makes sure to create a log group for each service and a single log stream per container.

Go Forth, and Stack

It’s our hope that this post gave you a brief look at the raw power of what can be achieved with the AWS APIs these days. The ease of Terraform paired with the flexibility and scale of AWS is an extremely powerful combination.

Stack is a “first pass” of what combining these technologies can achieve. It’s by no means finished, and only provides the foundation for many of the ideas that we’ve put into production. Additionally, we’re trying some new experiments around log drivers and instances (reflected by the 0.1 tag) which we think will pay off in the future.

Nonetheless, we’ve open sourced Stack today as the first step to gather as much community wisdom around running infrastructure atop AWS.

In that vein, we’ll happily accept pull requests for new modules that fall within the spirit of the project. It’s our goal to provide the community with a good set of Terraform modules that provide sane defaults and simpler abstractions on top of the raw AWS infrastructure.

So go ahead and try out the Stack today, and please let us know what you think!

Part of the Segment infrastructure team hacking on The Segment Stack: Amir Abu Shareb, Rick Branson, Calvin French-Owen, Kevin Lo, and Achille Roussel. Open sourced at a team-offsite in Amsterdam.

Prateek Srivastava on June 14th 2016

Segment’s mobile SDKs are designed to track behavioral data from your app and translate and route that data to hundreds of downstream integrations. One of the SDK’s core tasks is to upload behavioral data to our servers. Since every network request requires your app to power up the device’s radio, uploading this data in real-time can quickly drain a battery.

To minimize the impact of the SDK on battery life, we queue these behavioral events and upload them in batches periodically. This results in 3x less battery drain over an implementation without batching.

Our Android queuing system is built on QueueFile, which was developed by Square. In this article we’re going to run though our queueing requirements, some of traditional solutions, and explain in detail why we chose to build on Queue File.

Our Queueing Requirements

Queues are a deceptively simple concept, but there are a two main considerations that make it complicated in practice: durability and atomicity.

  1. Durability - An element that has been added to the queue will survive permanently. An in-memory queue is easy to implement, but it lacks durability — events are queued until the process dies, and lost thereafter.

  2. Atomicity - An element is either added to the queue or isn’t; the queue won’t ever be in an partial/invalid state. For our purposes, we needed to save events to disk as quickly and reliably as possible, and then worry about uploading them later.

We were looking for a solution that would guarantee these contracts would be be honored even in the face of process deaths or system crashes. Such scenarios are inevitable on mobile, e.g. the user could lose battery power in the middle of an operation, or the operating system could kill your application to reclaim memory.

Traditional Solutions

File: The most obvious way to persist data to disk is to use a plain old Java file. However, writing to a file is not atomic (in most cases), and it’s easy to run into cases of a corrupted file. A trivial implementation would also keep the entire queue in memory, which is not ideal for devices that might go offline for a long time.

AtomicFile: AtomicFile (also available from the support library) is a simple helper that can guarantee atomic writes on a regular file. It guarantees atomicity by creating a backup of the original file before writing any changes, and waiting until the write is completely written to disk before deleting the backup. As long as the backup file exists, the original file is considered to be invalid. However, AtomicFile requires the caller to maintain track of the corrupted state and restore itself from one.

SharedPreferences: The simplest way to persist data onto disk is by using SharedPreferences from the Android framework. Although designed to store small key-value pairs, it can certainly be used to store any data in a pinch. The SharedPreferences class is a high level wrapper around its own implementation of an atomic file and an in-memory cache. Writes are committed to a map in memory first, and then saved to disk. However, writing to SharedPreferences is not durable. SharedPreferences silently swallows any disk write error, leaving the memory cache and disk out of sync, with callers left guessing about the result of an operation.

SQLite: The typical solution to designing a reliable disk queue on Android is to layer it on top a SQLite database. SQLite works great for large datasets that need multithreaded access, but being a fully featured database, it is designed with optimizations for advanced querying and not for the purposes of a queue. SQLite is complex with many moving parts, and this made us wary of relying on it to back such a critical section of our code.

Why We Chose QueueFile

Although all of the above solutions could have been coerced to work for us, none of them were designed specifically to be used as queues. Luckily, Square had also run into this problem for accepting payments, and they built QueueFile to accomplish this. QueueFile guarantees that all operations are atomic, writes are durable, and is designed to survive process and even system-level crashes.

It was a solution tailor-made for our use case. Adding and removing elements from QueueFile is ordered first in first out (FIFO) and takes constant time, both of which are a nice bonus. QueueFile’s tiny size also makes it perfect for us to embed in our SDK.

Core Concepts of QueueFile

There are a few guarantees that the filesystem provides:

  • renaming a file is an atomic operation

  • fsync is durable

  • segment writes are atomic

QueueFile is particularly clever about the way it stores and updates data — Bob Lee has an excellent talk on the subject. QueueFile consists of a 16 byte file header, and a series of items called elements, as a circular buffer. The grey area (not to scale) represents empty space in the file.

The file header consists of four 4-byte integers that represent the length of the file, the number of elements in the file, and a pointer to the location of the first and last elements. Since the length of the file header is only 16 bytes (smaller than the size of a segment), changes to the file header are atomic. QueueFile relies on this by making modifications to the file visible only when the header is committed as well.

The components of a file header.

Each element itself is comprised of a 4-byte element header that stores the length of the element, and the element data (variable length) itself.

The components of an element.

Adding Data

Consider adding an element to the QueueFile below.

QueueFile first writes (and fsyncs) the element and its length. Notice that the file header remains unchanged. If writing the second element fails, the QueueFile is still left in a valid state since the header hasn’t been updated (even though the new data may be on disk). Upon restart, the QueueFile would still report that queue contains only a single element.

When the write operation completes successfully, the header is committed (and fsynced) as well. If updating the header fails, then the QueueFile remains the same as above, and the change is aborted. Otherwise, we’ve successfully added our data!

Calling fsync after every write prevents the filesystem from reordering our writes, and makes the transaction durable. This ensures that the QueueFile is never left in an invalid or corrupted state.

Removing Data

Removing and clearing do the reverse. Let’s start from the result of our previous operation.

QueueFile writes (and fsyncs) the header first. If committing the header fails, then the change is simply aborted and the QueueFile remains the same as above. When the header is written successfully, the removal is committed, and now has a size of 1 (even though the data is on disk).

QueueFile goes one step further and zeroes out the removed data, which leaves us just with the element we added previously.

In Conclusion

QueueFile has been a key part of the queueing system we’ve built which powers data collection for over 1,400 Android apps running on more than 200 million devices. While the incredibly small size of the library helps prevent SDK bloat, its simplicity does also create a few limitations. For instance, QueueFile is limited to a size of 1GB, and it cannot be used by multiple processes concurrently. Neither were deal breakers for us — we don’t want your queued analytics data using too much disk space and we can create separate queues for different processes — but are good to be aware of.

We’re working on porting this approach over to iOS and have some ideas to expand QueueFile to work on a broader range of filesystems like iOS and desktop apps. If you’re interested in helping us build infrastructural components like this for mobile apps and SDKs, we’re hiring.

Stephen Mathieson, Calvin French-Owen on May 26th 2016

For the past year, we’ve been heavy users of Amazon’s EC2 Container Service (ECS). It’s given us an easy way to run and deploy thousands of containers across our infrastructure.

ECS gives us free Cloudwatch metrics, automatic healthchecks, and scheduling across hundreds of nodes. It’s radically simplified how we deploy code to production. In short, it’s pretty awesome.

But ECS had one major shortcoming: navigating the AWS Console is a massive pain. It’s hard to find which containers are running and what version of an image is deployed.

That’s why we built Specs

Specs gives you a high-level window into your ECS clusters and services.

It provides a handy search box to find particular services in production, and then helps diagnose what the service’s exact state. Specs lets us know if a container isn’t able to be placed due to insufficient capacity or if all containers in a service are dead.

It gives the entire development team a better picture of how our production environment is configured.

We run Specs behind an internal Google oAuth server, so any engineer can view it even if they don’t have an IAM account. It’s allowed a lot more of the team to debug issues without bothering the core infra team.

Getting started with Specs couldn’t be easier. Assuming you’re running Docker and have your AWS credentials exported, just run the following command:

You’ll get convenient dashboards with zero configuration.

If you’d like to help contribute, please check out the github repo. We’re excited to add features like log tailing and a real-time events feed to make working with ECS a truly seamless experience.

Lipei Wang on May 23rd 2016

Python, one of the most popular scripting languages, is also one of the most preferred tools for data analysis and visualization. In addition to the broader Python developer community, there is also a significant group that uses Python to analyze data, draw actionable insights, and make decisions.

With its extensive collection of helper libraries and platforms, Python is a great tool for quick, iterative data exploration. Python’s set of libraries includes everything from visualization to statistical analysis, making it convenient for its users to jump into the data and begin identifying patterns.

Together with the ability to iterate quickly in data and statistical analysis, there are great open source tools on managing data pipelines and workflows. A growing community of analysts are finding new ways of using Python to crunch numbers and understand their data.

We had a chance to catch up with the Chief Analyst at Mode, Benn Stancil, and ask him about the importance of Python, how to use it in day-to-day analysis, and to tell us about some key features of Mode’s new product, Mode Python Notebooks. Mode Python Notebooks is a hosted solution that allows analysts to use Python for exploratory analysis.

So Benn, you’re a huge SQL guy. We love your writing on finding A-ha momentsand discovering growth engines with SQL. What makes you interested in using Python?

I’m interested in Python for the same reasons I like SQL: It gives me the power and flexibility to answer any question. The community is great and adoption is on the rise.

There are many easy to use Python libraries to make data exploration convenient and immediate. This allows for iterative data analysis. With Python, you can really chase your curiosities down the rabbit hole.

Lastly, Python’s utility and flexibility allows it to be used for a variety of tasks within the data science stack. For example, Luigi and Airflow both allow for managing data pipelines and workflows in Python. By completing exploratory analysis in Python, there can be times where the work carries over into production.

What are the most popular use cases for SQL as opposed to Python?

SQL is designed to query and extract data from a database. It’s a necessary first step to get the data into a usable format. For instance, SQL allows you to easily join several data sets to create a table that you can explore further.

SQL isn’t really designed for manipulating or transforming data in certain ways. Higher level data manipulation that is common with data science, such as statistical analysis, regressions, trend lines, and working with time series data, isn’t easy in SQL.

Despite these limitations, because SQL is necessary for extracting data, it’s still commonly used for complex operations. The query below, which calculates quantiles for different series in a data, is something I’ve used versions of many times.

When would Python come in?

Python has a ton of libraries (e.g. Pandas, StatsModel, and SciPy) that are designed for statistical and mathematical analysis. The libraries also do a great job of abstracting away the details so that you don’t need to calculate all the underlying math by hand. Moreover, you can get your results immediately, so you can use Python iteratively to explore your data.

Rather than saying “I want to do a regression analysis” and sitting down for half an hour figuring out where to begin in SQL, the Python libraries make it so that you can just run the analysis, see the results, and continue exploring the path your curiosity takes you down. With Python, there is not much lag between inspiration and action. With SQL, on the other hand, I often think twice before going down a path that may or may not be fruitful.

For example, I’d only write the query above if I really knew I wanted to present the quantiles of that dataset. Because the entire thing can be accomplished with the one line of Python below, I’d do it much earlier in my analytical process–and may discover something I wasn’t looking for as a result.

Another way to think about the difference between Python and SQL is that Python allows you to start with one large table, from which you branch off different analyses in different directions. One avenue of inspiration can bring you to another avenue and to another avenue. The speed and flexibility of analysis makes it easy to go down many exploratory paths.

Those kinds of analyses sound very different. Why combine SQL and Python in one place?

Because SQL and Python each have individual strengths and weaknesses. Tying the languages together gives analysts the best of both worlds.

First, SQL is needed to build the data set into a final table that has all of the necessary attributes. Then, from this large data set, you can use Python to spin off deeper analysis.

What would it take for a SQL analyst to learn Python?

Like many skills, the best way to learn how to use Python for analysis is by diving in to work on a problem you’re interested in, care about, and with which you’re somewhat familiar.

When you work on something that you’re interested in, you tend to go deeper. You uncover something in the core problem that piques your interest, and you want to learn more by analyzing the data set in a different way. You begin asking more and more questions. This curiosity can push you further than you would go otherwise, and a is a source of a lot of real learning.

You also should work on data that you’re somewhat familiar with, so you know when you do something wrong. You’ll have better instincts about what’s going on and what to expect. Compare this to when you’re working on data about which you know nothing—like flower petal sizes (an unusually popular data set found in many Python examples). If your analysis concludes that “all of these flowers have two centimeter-long petals” and you have no idea whether that is reasonable, you may just assume it’s right and move on.

Mode is also releasing a new Python tutorial that aims to help SQL users learn how and where to integrate Python into their workflow. In addition, the Python tutorials provide problems that are familiar to those in business settings, instead of academic problems.

How would you expect learning Python to help benefit analysts in their job and their careers?

Learning Python definitely can augment an analyst’s skill set.

An analyst needs to communicate the business value through data. One part of the job is to find the insights from the data, but the more effective job is also to include the right context and narrative around the insights that can compel your teammates towards action. And since using data and analytics to make decisions is becoming more important in the workplace, the role of the analyst to deliver comprehensive analysis is more important than ever.

Is it easy to get started with Python? What kind of setup and tooling do you need?

Until recently, getting started with Python for analysis requires installing a few things—Python, several main statistics and data analysis libraries, and Notebooksoftware that’ll run the analysis locally on your computer. Then, you’d have to run the Notebook, starting a local server to execute your Python commands.

The results generated from your commands on Notebooks will exist on your desktop. In order to share it, you basically download a Python Notebook HTML file (which has a mix of code and results) and send that around. In order to easily parse the results from the HTML, your colleagues would have to open that file in a browser. And at that point, unless they have Python set up too, the code isn’t re-executable.

When you run Python locally, you’re limited to the power of your computer, you have to leave your computer open to run it, and running scripts can slow other things down. By running it remotely, you can run it from any machine and you can run something, close your computer and walk away, and still have your results waiting when you get back.

With Mode Python Notebooks, all of that setup and hosting is taken care of for you. If you’re using Mode to query an existing database, there’s a tab for a Python Notebook. You can open it up and the query results table will be automatically populated. And after generating plots, time series analysis, summary statistics, etc., in the Notebook, it’s easy to curate the results into an instantly shareable report anyone can re-run. Moreover, you don’t have to run it on your local machine, which saves your computer’s processing power and memory. Finally, setting it all up is as easy as setting up a database connection.

Thanks for your time, Benn!

If you’re interested in learning how to run data analysis in Python, check out Mode’s new Python tutorial. And, if you’re a Segment customer looking to run Python scripts on your event and cloud app data check out Mode Python Notebooks.

Calvin French-Owen on May 16th 2016

Since Segment’s first launch in 2012, we’ve used queues everywhere. Our API queues messages immediately. Our workers communicate by consuming from one queue and then publishing to another. It’s given us a ton of leeway when it comes to dealing with sudden batches of events or ensuring fault tolerance between services.

We first started out with RabbitMQ, and a single Rabbit instance handled all of our pubsub. Rabbit had lot of nice tooling around message delivery, but it turned out to be really tough to cluster and scale as we grew. It didn’t help that our client library was a bit of a mess, and frequently dropped messages (anytime you have to do a seven-way handshake for a protocol that’s not TLS… maybe re-think the tech you’re using).

So in January 2014, we started the search for a new queue. We evaluated a few different systems: DarnerRedisKestrel, and Kafka (more on that later). Each queue had different delivery guarantees, but none of them seemed both scalable and operationally simple. And that’s when NSQ entered the picture… and it’s worked like a charm.

As of today, we’ve pushed 750 billion messages through NSQ, and we’re adding around 150,000 more every second.

Core Concepts

Before discussing how NSQ works in practice, it’s worth understanding how the queue is architected. The design is so simple, it can be understood with only a few core concepts:

topics - a topic is the logical key where a program publishes messages. Topics are created when programs first publish to them.

channels - channels group related consumers and load balance between them–channels are the “queues” in a sense. Every time a publisher sends a message to a topic, that message is copied into each channel that consumes from it. Consumers will read messages from a particular channel and actually create the channel on the first subscription. Channels will queue messages (first in memory, and spill over to disk) if no consumers are reading from them.

messages - messages form the backbone of our data flow. Consumers can choose to finish messages, indicating they were processed normally, or requeue them to be delivered later. Each message contains a count for the number of delivery attempts. Clients should discard messages which pass a certain threshold of deliveries or handle them out of band.

NSQ also runs two programs during operation:

nsqd - the nsqd daemon is the core part of NSQ. It’s a standalone binary that listens for incoming messages on a single port. Each nsqd node operates independently and doesn’t share any state. When a node boots up, it registers with a set of nsqlookupd nodes and broadcasts which topics and channels are stored on the node.

Clients can publish or read from the nsqd daemon. Typically publishers will publish to a single, local nsqd. Consumers read remotely from the connected set of nsqd nodes with that topic. If you don’t care about adding more nodes dynamically, you can run nsqds standalone.

nsqlookupd – the nsqlookupd servers work like consul or etcd, only without coordination or strong consistency (by design). Each one acts as an ephemeral datastore that individual nsqd nodes register to. Consumers connect to these nodes to determine which nsqd nodes to read from.

The Message Lifecycle

Let’s walk through a more concrete example of how this works in practice.

NSQ recommends co-locating publishers with their corresponding nsqd instances. That means even in the face of a network partition, messages are stored locally until they are read by a consumer. What’s more, publishers don’t need to discover other nsqd nodes–they can always publish to the local instance.

First, a publisher sends a message to its local nsqd. To do this, it first opens up a connection, and then sends a PUB command with the topic and message body. In this case, we publish our messages to the events topic to be fanned out to our different workers.

The events topic will copy the message and queue it in each of the channels linked to the topic. In our case, there are three channels, one of them being the archives channel. Consumers will take these messages and upload them to S3.

Messages in each of the channels will queue until a worker consumes them. If the queue goes over the in-memory limit, messages will be written to disk.

The nsqd nodes will first broadcast their location to the nsqlookupds. Once they are registered, workers will discover all the nsqd nodes with the events topic from the lookup servers.

Then each worker subscribes to each nsqd host, indicating that it’s ready to receive messages. We don’t need a fully connected graph here, but we do need to ensure that individual nsqd instances have enough consumers to drain their messages (or else the channels will queue up).

Separate from the client library, here’s an example of what our message handling code might look like:

If the third party fails for any reason, we can handle the failure. In this snippet, we have three behaviors:

  • discard the message if it’s passed a certain number of delivery attempts

  • finish the message if it’s been processed successfully

  • requeue the message to be delivered later if an error occurs

As you can see, the queue behavior is both simple and explicit.

In our example, it’s easy to reason that we’ll tolerate MAX_DELIVERY_ATTEMPTS * BACKOFF_TIME minutes of failure from our integration before discarding messages.

At Segment, we keep statsd counters for message attempts, discards, requeues, and finishes to ensure that we’re achieving a good QoS. We’ll alert for services any time the number of discards exceeds the thresholds we’ve set.

In Practice

In production, we run nsqd daemons on pretty much all of our instances, co-located with the publishers that write to them. There are a few reasons NSQ works so well in practice:

Simple protocol - this isn’t a huge issue if you already have a good client library for your queue. But, it can make a massive difference if your existing client libraries are buggy or outdated.

NSQ has a fast binary protocol that is easy to implement with just a few days of work. We built our own pure-JS node driver, (at the time, only a coffeescript driver existed) which has been stable and reliable.

Easy to run - NSQ doesn’t have complicated watermark settings or JVM-level configuration. Instead, you can configure the number of in-memory messages and the max message size. If a queue fills up past this point, the messages will spill over to disk.

Distributed - because NSQ doesn’t share information between individual daemons, it’s built for distributed operation from the beginning. Individual machines can go up and down without affecting the rest of the system. Publishers can publish locally, even in the face of network partitions.

This ‘distributed-first’ design means that NSQ can essentially scale foreverNeed more throughput? Add more nsqds!

The only shared state is kept in the lookup nodes, and even those don’t require a “global view of the universe.” It’s trivial to set up configurations where certain nsqds register with certain lookups. The only key is that the consumers can query to get the complete set.

Clear failure cases - NSQ sets up a clear set of trade-offs around components which might fail, and what that means for both deliverability and recovery.

I’m a firm believer in the principle of least surprise, particularly when it comes to distributed systems. Systems fail, we get it. But systems that fail in unexpected ways are impossible to build upon. You end up ignoring most failure cases because you can’t even begin to account for them.

While they might not be as strict of guarantees as a system like Kafka can provide, the simplicity of operating NSQ makes the failure conditions exceedingly apparent.

UNIX-y tooling - NSQ is a good general purpose tool. So it’s not surprising that the included utilities are multi-purpose and composable.

In addition to the TCP protocol, NSQ provides an HTTP interface for simple cURL-able maintenance operations. It ships with binaries for piping from the CLItailing a queuepiping from one queue to another, and HTTP pubsub.

There’s even an admin dashboard for monitoring and pausing queues (including the sweet animated counter above!) that wraps the HTTP API.

What’s missing?

As I mentioned, that simplicity doesn’t come without trade-offs:

No replication - unlike other queues, NSQ doesn’t provide any sort of replication or clustering. This is part of what makes running it so simple, but it does force some hard guarantees on reliability for published messages.

We partially get around this by lowering the file sync time (configurable via a flag) and backing our queues with EBS. But there’s still the possibility that a queue could indicate it’s published a message and then die immediately, effectively losing the write.

Basic message routing - with NSQ, topics and channels are all you get. There’s no concept of routing or affinity based upon on key. It’s something that we’d love to support for various use cases, whether it’s to filter individual messages, or route certain ones conditionally. Instead we end up building routing workers, which sit in-between queues and act as a smarter pass-through filter.

No strict ordering - though Kafka is structured as an ordered log, NSQ is not. Messages could come through at any time, in any order. For our use case, this is generally okay as all the data is timestamped, but doesn’t fit cases which require strict ordering.

No de-duplication - Aphyr has talked extensively in his posts about the dangers of timeout-based systems. NSQ also falls into this trap by using a heartbeat mechanism to detect whether consumers are alive or dead. We’ve previously written about various reasons that would cause our workers to fail heartbeat checks, so there has to be a separate step in the workers to ensure idempotency.

Simplicity Works

As you can see, the underlying motif behind all of these benefits is simplicity. NSQ is a simple queue, which means that it’s easy to reason about and easy to spot bugs. Consumers can handle their own failure cases with confidence about how the rest of the system will behave.

In fact, simplicity was one of the primary reasons we decided to adopt NSQ in the first place (along with many of our other software choices). The payoff has been a queueing layer that has performed flawlessly even as we’ve increased throughput by several orders of magnitude.

Today, we face a more complex future. More and more of our workers require a stricter set of reliability and ordering guarantees than NSQ can easily provide.

We plan to start swapping out NSQ for Kafka in those pieces of the infrastructure and get much better at running the JVM in production. There’s a definite trade-off with Kafka; we’ll have to shoulder a lot more operational complexity ourselves. On the other hand, having a replicated, ordered log would provide much better behavior for many of our services.

But for any workers which fit NSQ’s trade-offs, it’s served us amazingly well. We’re looking forward to continuing to build on its rock-solid foundation.

Thanks to:

TJ Holowaychuk for creating nsq.js and helping drive its adoption at Segment.

Jehiah Czebotar and Matt Reiferson for building NSQ and reading drafts of this article.

Julian Gruber, Garrett Johnson, and Amir Abu Shareb for maintaining nsq.js.

Tejas Manohar, Vince Prignano, Steven Miller, Tido Carriero, Andy Jiang, Achille Roussel, Peter Reinhardt, Stephen Mathieson, Brent Summers, Nathan Houle, Garrett Johnson, and Amir Abu Shareb for giving feedback on this article.

Calvin French-Owen on March 15th 2016

I recently jumped back into frontend development for the first time in months, and I was immediately struck by one thing: everything had changed.

When I was more active in the frontend community, the changes seemed minor. We’d occasionally make switches in packaging (RequireJS → Browserify), or frameworks (Backbone → Components). And sometimes we’d take advantage of new Node/v8 features. But for the most part, the updates were all incremental.

It’s only after taking a step back and then getting back into it that I realized how much Javascript fatigue there really is.

Years ago, a friend and I discussed choosing the ‘right’ module system for his company. At the time, he was leaning towards going with RequireJS–and I urged him to look at Browserify or Component (having just abandoned RequireJS ourselves).

We talked again last night, and he said that he’d chosen RequireJS. By now, his company had built a massive codebase around it –- “I guess we bet on the wrong horse there.”

But that got me thinking, with all the Javascript churn there’s really no right horse to bet on. You either have to continually change your tech stack to keep up with the times, or be content with the horse you’ve got. And as a developer who’s trying to build things for people, it’s difficult to know when it makes sense to invest in new tech.

Like it or not, Javascript is evolving at a faster rate than any widespread language in the history of computing.

Most of the tools we use today didn’t even really exist a year ago: React, JSX, Flux, Redux, ES6, Babel, etc. Even setting up a ‘modern’ project requires installing a swath of dependencies and build tools that are all new. No other language does anything remotely resembling that kind of thing. It’s enough to even warrant a “State of the Art” post so everyone knows what to use.

So, why does Javascript change so much__?

A long strange journey

To answer that question, let’s take a step back through history. The exact relationship between ECMAScript and Javascript is a bit complex, but it sets the stage for a lot of today’s changes.

As it turns out, there are a ton of interesting dynamics at play: corporate self-interest, cries for open standards, Mozilla pushing language development, lagging IE releases that dominate the market, and tools that pave over a lot of the fragmentation.

The Browser Wars

Javascript first appeared as part of Netscape Navigator in March, 1996. The creation story is now the stuff of legend, where Brendan Eich was tasked with creating a language to run in the browser as a one-man-dev-team. And in just ten days, he delivered the first working version of Javascript for Navigator.

To make sure they weren’t missing the boat, Microsoft then implemented their own version of Javascript, dubbed JScript to avoid trademark issues (they would continue calling the language “JScript” until 2009). It was a reverse-engineered version of Javascript, designed to be compatible with Netscape’s implementation.

After the initial release in November of 1996, Eich pushed Javascript to be designed as a formal standard and spec with the ECMA commmittee, so that it would gain even broader adoption. The Netscape team submitted it for official review in 1997.

Despite the submission to become an ECMA defined language, Javascript continued iterating as part of its own versions, adding new features and extensions to the language.

As Eich moved from Netscape to Mozilla, the Firefox teams owned Javascript development, consistently releasing additional specs for developers willing to work with experimental features. Many of those features have since made into later versions of ECMAScript-driving mainstream adoption.


ECMAScript has a weird and fascinating history of its own, largely broken up into different ‘major’ editions.

ECMA is an international standards organization (originally the European Computer Manufacturers Association), and they handle a number of specs, including C#, JSON, and Dart.

When choosing a name for this new browser language, ECMA had to decide between Javascript, Jscript, or an entirely new name altogether. So they took the diplomatic route, and adopted the name ECMAScript as the ‘least common denominator’ between the two.

The first edition of ECMA-262 (262 is the id number given to the language spec) was then released in June 1997, to bridge the gap between Javascript and Jscript and establish a plan for standardizing updates going forward. It was designed to be compatible with Javascript 1.3, which had added callapply, and variousDate functions.

ES2 was released one year later in June 1998, but the changes were purely cosmetic. They had been added to comply with a corresponding ISO version of the spec without changing the specification for the language itself, so it’s typically omitted from compatibility tables.

ES3, released in December 1999, marked the first major additions to the language. It added support for regexes, .bind, and try/catch error handling. ES3 became the “standard” for what the majority of browsers would support, underscored by the popularity of older IE browsers. Even in February 2012 (13 years after the ES3 release!!), ES3 IE browsers still had over 20% of the browser market.

Then work began on ES4, a version that was designed to revolutionize Javascript. Ideas for classes, generators, modules, and many of the features that ended up in ES6 were introduced in what was dubbed the ‘harmony’ spec. The committee first published a draft of the spec in October 2007, aiming to have a final release the following year. It was supposed to be the modernized “Javascript 2.0” that would reduce a lot of the cruft of the first implementation.

But just as the spec was nearing completion, trouble struck! On the one hand Microsoft’s IE platform architect Chris Wilson, argued that the changes would effectively “break the web”. He stated that the ES4 changes were so backwards incompatible that they would significantly hurt adoption.

On the other hand, Brendan Eich, now the Mozilla CTO, argued for the changes. In an open letter to Chris Wilson, he objected to the fact that Microsoft was just now withdrawing support for a spec which had been in the works for years.

After a lot of internal friction, the committee finally set aside its differences. They actually abandoned ES4 altogether and skipped straight to creating ES5 as an extension the language features found in ES3! They added specifications for many of the features already added to Javascript 1.x by Mozilla and planned to release later that year. There was no ES4 release.

After meeting in Oslo, Brendan Eich sent a message to es-discuss (still the primary point for language development) outlining a plan for a near-term incremental release, along with a bigger reform to the language spec that would be known as ‘harmony’.. ES5 was finalized and published in December 2009.

With ES5 finally out the door, the committee started moving full-speed ahead on the next set of language extensions which they’d hoped to start incoporating nearly a decade earlier.

By now, browser-makers and committee members alike had seen the danger of adding a giant release without testing the features. So drafts of ES6 have been published frequently since 2011, and available behind flags (--harmony) in Node and many of the major browsers. Additionally transpilers had become a part of the modern build chain (more on that later), allowing developers to use the bleeding-edge of the spec, without worrying about breaking old browsers. It was eventually released in June 2015.

So we have:

  • Javascript versions driven by Netscape and then Mozilla (championed by Eich)

  • ECMAScript versions driven by the planning committee to standardize the language

  • JScript versions released by Microsoft which effectively mimicked Javascript and added its own extensions

With experiments and language additions to each dialect pushing the ES standard ahead.


Of course, we can’t talk just about the full history of JS without mentioning the platform it’s running on: browsers.


Javascript is the only language in history that has so many dominant runtimes, all fighting for their piece of the market share. And it makes total sense, given how ‘bundled’ browsers are. If you build an alternative Python runtime, it’s hard to profit unless you add some sort of proprietary add-ons. But with browsers, the sky is the limit: tracking usage statistics, the ability to set default search engines, email defaults, and even advertising.

And the Javascript arms race certainly wasn’t helped by the collapse of the ES4 spec, or IE6-8’s longstanding vicegrip on the market. Though most of the 2000’s, Javascript runtimes remained non-standard and stagnant.


But an interesting thing happened when Chrome first emerged onto the scene in December, 2008. While Chrome had numerous features that made it technically interesting (tabs as processes, a new fast js runtime, and others), perhaps the thing that made it most revolutionary was its release schedule.

Chrome shipped from day-1 with auto-updates, first tested from a widespread number of early adopters using the dev and beta channels. If you wanted to use the bleeding edge, you could easily do so. Nowadays, updates to the stable channel happen every six weeks, with thousands of users testing and automatically reporting bugs they encounter on the dev and beta channels.

It meant that Chrome was able to ship updates far more frequently than its competitors. Where the other browsers would tell users to update every 6-12 months, Chrome updated the three channels weekly, monthly, and released updates on the stable channel every six weeks.

This put enormous pressure on the other browser makers to follow suit, causing a rapid increase to iterate and release new language features. Gone are the days of the not-to-spec IE Javascript implementations. As the ES6 drafts started to finalize, browsers have steadily been adding those features and fixing bugs with their implementations.

Most of the features are still not “safe” to use and hidden behind flags, but the fact that browsers will continue to auto-update means that the velocity of new JS features as increased dramatically in the last few years.

Builders and Shims

Imagine now that you’re a developer just learning JS, and knowing none of that history. It’d be pretty much impossible to keep track of which features are supported in what browsers, what’s to spec and what’s not, and what parts of the language should even be used.

There’s an entire book on just the ‘good parts’ of Javascript, borne of it’s convoluted past. And before MDN, the predominant resources for getting an intro to Javascript just barely scratched the surface.

Realizing this was a problem, John Resig released the first version of JQuery in 2006. It was a drop-in library designed to pave over a lot of the inconsistencies between the different browsers. And for the most part, developers became comfortable with always bundling jQuery into their projects. It became the de facto library for DOM manipulation, AJAX requests, and more.

But working with many different scripts was tedious. They had to be loaded in a certain order, or else bundle duplicated dependencies many times over. Individual libraries would overwrite the global scope, causing conflicts and monkey-patching default implementations.

So James Burke created RequireJS in 2009, spawned out of his work with Dojo. He created a module loader designed to specify dependencies, and load them asynchonously in the browser. It was one of the first frameworks that actually introduced the idea of a “module system” for isolating pieces of code and loading them asynchronously–dubbed AMD (Asynchronous Module Definition).

While AMD was easy to load on the fly, it was also verbose. It essentially added dependency injection for every single library you’d want to load:

The simplicity had the benefit that the browser could load libraries on-the-fly, but it was a lot of overhead for the programmer to really understand.

In 2009, Ryan Dahl started working on a project to run Javascript as a server. He liked the simplicity of working in a single-threaded environment, and leveraged the non-blocking I/O of v8. That project became Node.js.

Since Node ran on the server, making requests for additional scripts was cheap. The scripts were cached, so it was as easy as reading and parsing an additional file.

Instead of using AMD modules, Node popularized the CommonJS format (the competing module format at the time), using synchronous require statements for loading dependencies. It mirrored the same way that other languages worked, grabbing dependencies synchronously and then caching them. Isaac Schuetler built npm around it as the de facto way to manage and install dependencies in Node. Soon, everyone was writing node scripts that looked like this:

Yet the frontend still lagged. Developers wanted to require and manage Javascript dependencies with the same sort of first class CommonJS support we were getting in Node. The TC39 committee heard these complaints, but a fully-baked module system would take time to develop.

So a new set of tools appeared on the frontend developer’s toolchain. Bower for pure dependency management, Browserify and Component/Duo for actually building scripts and bundling them together. Webpack promised a new build system that also handled CSS, and Grunt and Gulp for actually orchestrating them all and making them play nicely together.


Builders didn’t quite take it far enough on their own. Once people were comfortable with the idea of adding convenience shims and libraries, they started exploring compiling other languages down to Javascript.

After all, there are hundreds of programming languages that can run on the server, but there’s only one widely-supported language for the browser.

Coffeescript was the first major player, created by Jeremy Ashkenas in 2009. Prompted by the surge of programmers building Ruby on Rails apps, Coffeescript provided a way to program in Javascript that looked like Ruby. It compiled down to Javascript, and using our handy build tools, could remove many of the gotchas around vanilla js.

But it didn’t end there.

Once transpilers became a standard a part of the build chain, we had opened the Pandora’s box of what could be compiled to Javascript.

We started with existing languages that compiled to JS. Google’s Inbox was built with GWT (Google Web Toolkit)–a toolkit that compiled Java down to Javascript. Clojurescript compiled Clojure to Javascript. Emscripten took C/C++ code and compiled it down to a much more efficent browser runtime using asm.js.

And now, there’s even entirely separate languages–built with compilation to Javascript as a first-class citizen. Dart and Typescript both allow users to run code in the browser both natively and compiled to JS. Elm was a new take on Javascript designed to compile down both visual elements and execute Javascript code.

Perhaps most interesting, transpilers provided a way to use tomorrow’s ES versions, today! Using tools like BabelGnode, and Regenerator, we could transform our ES6 features for the existing browser ecosystem. JSX allowed us to write HTML in a way that combined DOM composition and Javascript a bit more naturally–with all the complexity hidden behind a straightforward transpiler.

Nearly 10 years after Chris Wilson first thought that ES4 would break the web–Babel and other transpilers have acted as the “missing link” to provide backwards compatibility. Babel has worked it’s way into the standard Javascript toolchain, and now most builders include it as a first-class plugin.

Betting on the Right Horse

The history of the language should indicate that, above all, one thing is clear: there’s an enormously high rate of change in the Javascript world.

I didn’t even go into changes in frameworks (EmberBackboneAngularReact) or how we structure asynchronous programming (Callbacks, Promises, Iterators, Async/Await). But those have all churned relatively regularly over the years.

The most interesting thing about having a language where everyone is comfortable with build tools, transpilers, and syntax additions is that the language can advance at an astonishing rate. Really the only thing holding back new language features is consensus. As quickly as people agree and implement the spec, there’s code to support it.

It makes me wonder whether there will ever be another language like Javascript. It’s such a unique environment, where the implementations in the wild lag so far behind the spec, that it just invites new specs as quickly as the community can think of them. Some of them even become first-class citizens, and are adopted by browsers.

Javascript development doesn’t show any signs of stopping either. And why would it? There’s never been a time with more tooling and first-class JS support.

I predict we’ll continue to see the cycle of rapid iteration for the next few years at least. Companies will be forced to update their codebase, or be happy with the horse they have.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.