Benjamin Yolken, Julien Fabre on October 29th 2020
Lauren Venell on October 11th 2016
When it comes to your app, size makes a difference. Bigger apps have fewer downloads, worse reviews, and a harder time penetrating the international market. We measured the exact impact of increased app size, shown below. We’ve also included learnings on how to prevent bloat in your own app.
Peter Reinhardt on October 5th 2016
A few months ago we were at a conference in Half Moon Bay talking with a general manager at a large mobile company, and he said, “One of my projects for the next six months is to reduce SDK bloat in all our apps.” Six months is a substantial investment! So we asked why this mattered so much to him.
He gave three reasons:
They’re up against the 100MB app size cellular download limit, and they need to find ways to reduce the size of their app so that they don’t take a hit on installs.
Occasionally their SDK vendors have bugs that crash their app, giving them bad reviews and lowering installs.
Often their SDK vendors are unprepared for major iOS version upgrades, blocking key releases and big announcements.
While it makes intuitive sense that a large app download size would demotivate people and reduce installs, we wanted to dig deeper into these reasons and really assess the quantitative impact. Are we sure app size reduces installs? How big is the effect?
So, over the last few months we’ve done some extensive research into these issues… and in this article we’re going to share some new experimental results on how app size affects install rate.
To measure the impact of increased app size, we needed to buy a small app, with no active marketing activities but significant steady downloads. Then we needed to increase the app’s size, leaving everything else constant, and observe the impact on app install rate. To the best of our abilities, this would simulate the impact of SDK bloat, or anything else that just makes an app bulky.
So, we bought the Mortgage Calculator Free iOS app through some friends in the YC Founders network. It was a minuscule 3MB, had a steady pattern of organic installs (~50 installs per day for several years), and had no active marketing activities.
Then, without making any additional changes, we bloated the app from 3MB to 99MB, 123 MB and then finally 150MB, observing the isolated impact on install rate with each change in app size. In the real world, app sizes can increase substantially with the addition of seemingly simple things, like an SDK, an explainer video, a bunch of fonts, or a beautiful background picture for your loading screen. For the purposes of our experiment, we just bloated the app with a ton of hidden album art from our engineering team’s favorite artist.
To quantitatively measure the impact of each successive bloating, we looked at data provided directly by Apple in iTunes Analytics; specifically conversion from “Product Page Views” to “App Units” (or colloquially “installs”/”install rate”).
With the larger app sizes, we saw substantial losses in product page to app install rate. In particular, there was a substantial drop around the cellular download limit (~100MB), above which Apple does not let users download the app over 3G or 4G. (Note: the conversion rate can be greater than 100% due to installs direct from search results that skip product page view.)
From these results we estimate a linear decrease in install conversion rate below the cellular download limit from 3–99MB at 0.45% per MB. Above the cellular download limit we estimate a linear decrease in install rate of 0.32% per MB. To our best estimate, the gap between the two lines is covered by a 10% instantaneous install rate drop across the cellular download limit. (Although Apple says the cellular download limit is 100MB, we found in practice that a 101MB IPA did not trigger the cellular download block… the actual limit was somewhere between 101MB and 123MB and varied depending on the exact build.)
Increasing the size of our app from 3MB to 99MB reduced installs by 43%, and the increase to 150MB reduced installs by 66% in total. For mobile companies striving for growth, an increase in app size is extremely costly.How to destroy an app
In an attempt to be proper growth scientists, we tried to replicate the experiment by returning the app to its original 3MB size (plus other intermediate sizes) and re-measuring the install rate. Unfortunately, as a result of our earlier bloating, the app attracted several critical ratings and reviews, which stick around forever:
In our measurements, the app’s growth appears to be semi-permanently damaged and we only saw a minor rebound to 59% install conversion rate for the 16 days the 3MB version was available. (As a side note you can see these customers cite 140MB and 181MB as the download sizes… the true download size varies depending on the customer’s device and OS version.)
As an example, we randomly chose to inspect the NBC Sports app that’s been popular and featured on the App Store during the Olympic games. The app is 90.5MB in total. By the far the biggest part of app’s size is images, accounting for 51% of the entire app’s size, with code (23%), fonts (16%) and video (9%) accounting for most of the rest.
Images were both in the raw app package and in the asset catalog (hidden away inside assets.car). There are huge numbers of local station & team logos, startup screens, and soccer field layouts.
Within the “code” category (which is a single encrypted file) are around a dozen different SDKs, a handful of which we could detect and estimate for size contribution. SDKs contributed roughly 3.5MB to the total app size, with an estimated impact of -1.54% install rate.
Overall, there’s an incredible amount of low-hanging fruit in optimizing an app’s size. One particularly honorable mention is the (presumably accidental) inclusion of the Adobe VideoHeartbeat SDK docs (955 kb) in the production IPA package downloaded through iTunes.
While most SDKs are small individually, the net impact of installing many SDKs in your app is a significant increase to the app’s size, and therefore a meaningful negative impact on install rate, not to mention engineering time & maintenance. SDK vendors can be unprepared for new iOS releases, which can block key releases and announcements. Or, SDKs can occasionally introduce bugs, and even a single crash can mean a would-be user will never use your app again.
Therefore, it makes sense to limit the number of SDKs you bundle into your app in order to optimize for performance. But you still need to make sure you’re tracking and collecting key lifecycle events. Segment built an easy, lightweight solution called the Native Mobile Spec. The Native Mobile Spec automatically collects events that allow you measure top mobile metrics without any tracking code.
Now that we conclusively know how to destroy an app, we also know how to improve one. Here are a few key steps that you can take to make sure your app is ready for holiday season:
Use a mobile app size calculator to find out the exact size of your app.
Review our guide for getting your app ready for launch, for a complete set of tips and advice on how to make your app great.
Read about how Segment can help mobile teams.
A version of this blog post originally appeared on Recode.
Andy Jiang on July 28th 2016
At Segment, we’re working hard to make our mobile SDKs the best possible collection options for your analytics data. An SDK can make your data more durable, minimize data transfer, and optimize your app’s battery usage. In this article, we’ll take you through what happens under the hood as a piece of data flows through our iOS and Android SDKs from a button handler in your app all the way through to our API.
When you initialize Segment’s iOS and Android library, it will automatically start tracking a few app lifecycle events that are part of our Native Mobile Spec.
Application Installed — User installs your application for the first time.
Application Opened — User opens your application.
Application Updated — User updates to a newer version of your app.
The automatic lifecycle tracking allows you to focus on building your app and thinking less about your analytics tracking plan.
Now that we’ve initialized the library, we’re ready for user interaction.
Let’s take a look at a fictitious mobile app, Bluth’s Banana Stand, and see what happens with our SDKs when the user completes an order.
Listening on the purchase button handler, you’ll fire off a
It’s safe to invoke this call directly from your button handler, as our SDKs automatically dispatch it to a background queue thread.
These queues represent various threads on the mobile device. The .track() event is moved to the background queue so the UI queue can continue to operate for the user.
Once dispatched to the background thread, the event is immediately written to the device’s disk to maximize the chances of deliverability. In a world where power can be lost at any time, this method maximizes the durability of your data. (You can read about why we chose QueueFile for reliable request batching on Android).
On the way out of the queue and onto Segment’s tracking API, Segment’s SDKs use two different strategies to optimize your app’s bandwidth and power usage:
Batching — Combining multiple events into each outbound request allows us to minimize the amount of times we power up the wireless hardware
Compression — Compressing each request with gzip which allows us to drastically reduce the amount of bandwidth used
Before the request is sent outbound, the event is now batched together with other events before being sent to our servers. Our SDKs minimize battery use with auto-batching, since powering up the radio on every call can waste battery life. Auto-batching allows our SDKs to send events in batches, without powering up the radio on every call.
Visual representation of batched events, with the green circle representing a
We batch our requests to minimize the number of times we access the underlying hardware. This allows us to send fewer requests, spaced further apart, and therefore allow wireless hardware to remain offline for majority of the time.
Next, we’ll compress the outbound request body with gzip, reducing the total size drastically.
In our testing, we sent one thousand
.track() calls without the SDK (directly to our API) and with the SDK. The amount of data on the wire without the SDK was 1.1mb and with the SDK is 63kb — roughly 17x saving in bandwidth. We’re able to accomplish this via intelligent batching and aggressive data “middle-out compression” technologies.
These are the bandwidth impacts realized with our SDKs’ batching and compression.
The battery story is even better—we’re able to reduce the wasted energy by almost 3x from 56% overhead to 20% overhead. The net result of this is low average energy impact on the app and much more efficient battery usage—and happier users!
These are the battery enhancements our SDKs enable. (Xcode).
Finally, the request leaves the mobile device and makes its way to our servers.
By tracing the SDKs functionality from tap to API, we’re able to see how a few simple data transfer strategies can significantly strengthen your mobile analytics stack:
Increase durability — Immediately persist every message to a disk-backed queue, preventing you from losing data in the event of battery or network loss.
Auto-Retries — If the network is spotty or not available, our SDKs will retry transferring the batch until the request is successful. This drastically improves data deliverability.
Reduce Network Usage — Each batch is gzip compressed, decreasing the amount of bytes on the wire by 10x-20x.
Save Battery — Because of data batching and compression, Segment’s SDKs reduce energy overhead by 2-3x which means longer battery life for your app’s users.
These features would not have been possible without the help of our incredibly supportive open source community. With 80+ contributors, 550+ pull requests, 240+ releases, our iOS and Android libraries just kept improving. Thanks to their work, more than 3,000 apps now collect customer data using Segment’s mobile SDKs every day.
We can’t wait to keep improving the mobile analytics ecosystem alongside you. If you have any ideas, we’re all ears! Send a pull request or email us at firstname.lastname@example.org. 😃
Want to learn about and grow your mobile users? Sign up today.
Interested in building infrastructural components for mobile apps and SDKs? We’re hiring.
Calvin French-Owen on June 23rd 2016
As part of our push to open up what’s going on internally at Segment – we’d like to share how we run our CI builds. Most of our approaches follow standard practices, but we wanted to share a few tips and tricks we use to speed up our build pipeline.
Powering all of our builds are CircleCI, Github, and Docker Hub. Whenever there’s a push to Github, the repository triggers a build on CircleCI. If that build is a tagged release, and passes the tests, we build an image for that container.
The image is then pushed to Docker Hub, and is ready to be deployed to our production infrastructure.
Before going any further, I’d like to talk about the elephant in the room: Travis CI. Pretty much every discussion of CI tools has some debate around using Travis CI vs CircleCI. Both are sleek, hosted, and responsive. And both are incredibly easy to use.
Honestly, we love both tools. We use Travis CI for a lot of our open source libraries, while CircleCI powers many of our private repos. Both products work extremely well and do a fantastic job meeting our needs.
However, there’s one feature we really love about CircleCI: SSH access.
Most of the time, we don’t have any problems configuring our test environments. But when we do, the ability to SSH into the container running your code is invaluable.
To put this in perspective, Segment runs our entire infrastructure with hundreds of different microservices. Each one is pulled from a different repo, and runs with a few different dependencies via docker-compose (more on that later).
Most of our CI is relatively standard, but occasionally setting up a service with a fresh environment requires some custom work. It’s in a new repo, and will require it’s own set of dependencies and build steps. And that’s where being able to run commands from within the environment you’re testing against is so handy – you can tweak configuration right on the box.
No more hundreds of “fixing CI” commits!
To work with all of these different repos, we wanted to make it trivially easy to setup a repo so that it has CI enabled. We have three different circle commands we use regularly, which are shared amongst our common dotfiles. First, there’s circle() which sets up all the right environment variables and automatically enabled our slack notifications.
Additionally, we have a circle.open() command which automatically opens the test results right from your CLI in your browser
And finally there’s the circle.badge() command to automatically add to a badge to a repo
Now given the fact that we have hundreds of repos, we have the task of keeping all the testing scripts and repositories in-sync when we make changes to our circle.yml files.
Maintaining the same behavior across a few hundred repos is annoying, but we’ve decided we’d rather trade abstraction problems (hard to solve) for investing more heavily in tooling (generally easier).
For that, we use a common set of scripts inside a shared git repo. The scripts are pulled down every time the test runs, and handle the shared packaging and deployments. Each service’s circle.yml file looks something like this:
It means that if we change our deploy scheme, we only have to update the code in one place rather than updating each individual repo’s circle.yml. We can then reference different scripts depending on what sorts of builds we need within the individual service repo.
Finally, the entire build process wouldn’t be possible without Docker containers. Containers have greatly simplified how we push code to production, test against our internal services, and develop locally.
When testing our services, we make use of docker-compose.yml files to run our tests. That way, a given service can actually test against the exact same images in CI as are running in production. It reduces the need for mocks or stubs.
What’s more, when the images are built by CI–we can pull those same images down and run them locally as well.
To actually build that code and push it to production, CircleCI will first run the tests, then check whether the build is a tagged release. For any tagged releases, we have CircleCI build the container via a Dockerfile, then tag it and push the deploy to Docker Hub.
Instead of using latest everywhere, we explicitly deploy the tagged image to Docker Hub, along with the major version (1.x) and minor version (1.2.x).
This way, we’re able to specify rollbacks to a specific version when we need it – or deploy the latest build of a certain release branch if we don’t need a specific version (useful for local development and in docker-compose files).
The code to do this is relatively straightforward, first we detect the versions:
And then we build, tag, and push our docker images:
Once our images are pushed to Docker Hub, we’re guaranteed to have the right version of the code built so that we can deploy it to production and run it inside ECS.
Thanks to containers, our CI pipeline gives us much better confidence when deploying our microservices to production.
So there you have it: our CI build pipeline, heavily powered by Github, CircleCI, and Docker.
While we’re constantly trying to find ways to make the entire pipeline a bit more seamless, we’ve been happy with the low maintenance, parallelization and isolation provided by using third-party tools.
On that note, if you’re managing a large number of repos, we’d love to hear about your own techniques for managing your build pipeline. Drop us a note by email (friends@segment) or on Twitter!
Calvin French-Owen on June 15th 2016
AWS is the default for running production infrastructure. It’s cheap, scalable, and flexible to whatever configuration you’d like to run on top of it. But that flexibility comes with a cost: it makes AWS endlessly configurable.
You can build whatever you want on top of AWS, but that means it’s difficult to know whether you’re taking the right approach. Pretty much every startup we talk with has the same question: “What’s the right way to setup our infrastructure?”
To help solve that problem, we’re excited to open source the Segment AWS Stack. It’s our first pass at building a collection of Terraform modules for creating production-ready architecture on AWS. It’s largely based on the service architecture we use internally to process billions of messages every month, but built solely on AWS.
The steps are incredibly simple. Add 5 lines of Terraform, run terraform apply, and you’ll have your base infrastructure up and running in just three minutes.
It’s like a mini-Heroku that you host yourself. No magic, just AWS.
Our major goals with Stack are:
to provide a good set of defaults for production infrastructure
make the AWS setup process incredibly simple
allow users to easily customize the core abstractions and run their own infrastructure
To achieve those goals, Stack is built with Hashicorp’s Terraform.
Terraform provides a means of configuring infrastructure as code. You write code that represents things like EC2 instances, S3 buckets, and more–and then use Terraform to create them.
Terraform manages the state of your infrastructure internally by building a dependency graph of which resources depend on one another:
and then applies only the “diff” of changes to your production environment. Terraform makes changing your infrastructure incredibly seamless because it already knows which resources have to be re-created and which can remain untouched.
Terraform provides easy-to-use, high level abstractions for provisioning cloud infrastructure, but also exposes the low-level AWS resources for custom configuration. This low-level access provides a marvelous “escape hatch” for truly custom needs.
To give you a flavor of what the setup process looks like, run
terraform applyagainst this basic configuration:
It will automatically create all of the following:
Networking: Stack includes a new VPC, with public and private subnets. All routing tables, Internet Gateways, NAT Gateways, and basic security groups are automatically provisioned.
Auto-scaling default cluster: Stack ships with an auto-scaling group and basic lifecycle rules to automatically add new instances to the default cluster as they are needed.
ECS configuration: in Stack, all services run atop ECS. Simply create a new service, and the auto-scaling default cluster will automatically pick it up. Each instance ships with Docker and the latest ecs-agent.
CloudWatch logging & metrics: Stack sends all container logs to CloudWatch. Because all requests between services go through ELBs, metrics around latency and status codes are automatically collected as well.
Bastion: Stack also includes a bastion host for manual SSH access to your cluster. Besides the public services, it’s the only instance exposed to the outside world and acts as the “jump point” for manual access.
This basic setup uses the
stack module as a unit, but Terraform can also reference the components of Stack individually.
That means that you can reference any of the internal modules that the stack uses, while continuing to use your own custom networking and instance configuration.
Want to only create Stack services, but bring your own VPC? Just source the service module and pass in your existing VPC ID. Don’t need a bastion and want custom security groups? Source only the vpc and cluster modules to set up only the default networking.
You’re free to take the pieces you want and leave the rest.
If you’d like to dig into more about how this works in-depth, and each of the modules individually, check out the Architecture section of the Readme.
Now, let’s walkthrough how to provision a new app and add our internal services.
Note: this walkthrough assumes you have an AWS account and Terraform installed. If not, first get the pre-requisites from the requirements section.
For this tutorial, we’ll reference the pieces of the demo app we’ve built: Pingdummy, a web-based uptime monitoring system.
All of the Docker images we use in this example are public, so you can try them yourself!
The Pingdummy infrastructure runs a few different services to demonstrate how services can be deployed and integrated using Stack.
the pingdummy-frontend is the main webpage users hit to register and create healthchecks. It uses the web-service module to run as a service that is publicly accessible to the internet.
the pingdummy-beacon is an internal service which makes requests to other third-party services, and responds with their status. It uses the servicemodule, and is not internet facing. (though here it’s used for example purposes, this service could eventually be run in many regions for HA requests)
the pingdummy-worker is a worker which periodically sends requests to the pingdummy-beacon service. It uses the worker module as it only needs a service definition, not a load balancer.
an RDS instance used for persistence
First, you’ll want to add a Terraform file to define all of the pieces of your infrastructure on AWS. Start by creating a
terraform.tf file in your project directory.
Then, copy the basic stack setup to it:
And then use the Terraform CLI to actually apply the infrastructure:
This will create all the basic pieces of infrastructure we described in the first section.
Note: for managing Terraform’s remote state with more than a single user, we recommend configuring the remote state to use Terraform Enterprise or S3. You can use our pingdummy repo’s Makefile as an example.
Now we’ll add RDS as our persistence layer. We can pull the rds module from Stack, and then reference the outputs of the base networking and security groups we’ve already created. Terraform will automatically interpolate these and set up a dependency graph to re-create the resources if they change.
Again, we’ll need to run plan and apply again to create the new resources:
And presto! Our VPC now has an RDS cluster to use for persistence, managed by Terraform.
Now that we have our persistence and base layers setup, it’s time to add the services that run the Pingdummy app.
We can start with the internal beacon service for our health-checks. This service will listens on port 3001 and makes outbound HTTP requests to third-parties to check if a given URL is responding properly.
We’ll need to use the service module which creates an internal service that sits behind an ELB. That ELB will be automatically addressable at beacon.stack.local,and ECS will automatically add the service containers to the ELB once they pass the health check.
Next, we’ll add the pingdummy-worker service. It is responsible for making requests to our internal beacon service.
As you can see, we’ve used the worker module since this program doesn’t need a load balancer or DNS name. We can also pass custom configuration to it via environment variables or command line flags. In this case, it’s passed the address of the beacon service.
Finally, we can add our pingdummy-frontend web app which will be Internet-accessible. This will use the web-service module so that the ELB can serve requests from the public subnet.
In order to make the frontend work, we need a few extra pieces of configuration beyond just what the base web-service module provides.
We’ll first need to add an SSL certificate that’s been uploaded to AWS. Sadly, there’s no terraform configuration for doing this (it requires a manual step), but you can find instructions in the AWS docs.
From there, we can tell our module that we’d like it to be accessible on the public subnets and security groups and be externally facing. The stack module creates these all individually, so we can merely pass them in and we’ll be off to the races.
Finally, run the plan and apply commands one more time:
And we’re done! Just like that, we have a multi-AZ microservice architecture running on vanilla AWS.
Looking in the AWS console, you should see logs streaming into CloudWatch from our brand new services. And whenever a request is made to the service, you should see HTTP metrics on each of the service ELBs.
To deploy new versions of these services, simply change the versions in the Terraform configuration, then re-apply. New task definitions will be created and the appropriate containers will be cycled with zero downtime.
There’s a few other pieces you’ll need to add, which you can see examples for in the main Pingdummy terraform file. Keep in mind that the example is a dummy app, and is not how we’d recommend doing things like security groups or configuration in production. We’ll have more on that in terraform to come :).
Additionally, we’re excited to open source a few other pieces that were involved in releasing the Stack:
Amir Abu Shareb created terraform-docs, a command-line tool to automatically generate documentation for Terraform modules. You can think of it as the godoc of the Terraform world, automatically extracting inputs, outputs, and module usage in an easily consumable format.
We use terraform-docs to build all of the module reference documentation for Stack.
Achille Roussel created ecs-logs, an agent for sending logs from journald to CloudWatch. It provides all the built-in logging for Stack, and makes sure to create a log group for each service and a single log stream per container.
It’s our hope that this post gave you a brief look at the raw power of what can be achieved with the AWS APIs these days. The ease of Terraform paired with the flexibility and scale of AWS is an extremely powerful combination.
Stack is a “first pass” of what combining these technologies can achieve. It’s by no means finished, and only provides the foundation for many of the ideas that we’ve put into production. Additionally, we’re trying some new experiments around log drivers and instances (reflected by the 0.1 tag) which we think will pay off in the future.
Nonetheless, we’ve open sourced Stack today as the first step to gather as much community wisdom around running infrastructure atop AWS.
In that vein, we’ll happily accept pull requests for new modules that fall within the spirit of the project. It’s our goal to provide the community with a good set of Terraform modules that provide sane defaults and simpler abstractions on top of the raw AWS infrastructure.
So go ahead and try out the Stack today, and please let us know what you think!
Prateek Srivastava on June 14th 2016
Segment’s mobile SDKs are designed to track behavioral data from your app and translate and route that data to hundreds of downstream integrations. One of the SDK’s core tasks is to upload behavioral data to our servers. Since every network request requires your app to power up the device’s radio, uploading this data in real-time can quickly drain a battery.
To minimize the impact of the SDK on battery life, we queue these behavioral events and upload them in batches periodically. This results in 3x less battery drain over an implementation without batching.
Our Android queuing system is built on QueueFile, which was developed by Square. In this article we’re going to run though our queueing requirements, some of traditional solutions, and explain in detail why we chose to build on Queue File.
Queues are a deceptively simple concept, but there are a two main considerations that make it complicated in practice: durability and atomicity.
Durability - An element that has been added to the queue will survive permanently. An in-memory queue is easy to implement, but it lacks durability — events are queued until the process dies, and lost thereafter.
Atomicity - An element is either added to the queue or isn’t; the queue won’t ever be in an partial/invalid state. For our purposes, we needed to save events to disk as quickly and reliably as possible, and then worry about uploading them later.
We were looking for a solution that would guarantee these contracts would be be honored even in the face of process deaths or system crashes. Such scenarios are inevitable on mobile, e.g. the user could lose battery power in the middle of an operation, or the operating system could kill your application to reclaim memory.
File: The most obvious way to persist data to disk is to use a plain old Java file. However, writing to a file is not atomic (in most cases), and it’s easy to run into cases of a corrupted file. A trivial implementation would also keep the entire queue in memory, which is not ideal for devices that might go offline for a long time.
AtomicFile: AtomicFile (also available from the support library) is a simple helper that can guarantee atomic writes on a regular file. It guarantees atomicity by creating a backup of the original file before writing any changes, and waiting until the write is completely written to disk before deleting the backup. As long as the backup file exists, the original file is considered to be invalid. However, AtomicFile requires the caller to maintain track of the corrupted state and restore itself from one.
SharedPreferences: The simplest way to persist data onto disk is by using SharedPreferences from the Android framework. Although designed to store small key-value pairs, it can certainly be used to store any data in a pinch. The SharedPreferences class is a high level wrapper around its own implementation of an atomic file and an in-memory cache. Writes are committed to a map in memory first, and then saved to disk. However, writing to SharedPreferences is not durable. SharedPreferences silently swallows any disk write error, leaving the memory cache and disk out of sync, with callers left guessing about the result of an operation.
SQLite: The typical solution to designing a reliable disk queue on Android is to layer it on top a SQLite database. SQLite works great for large datasets that need multithreaded access, but being a fully featured database, it is designed with optimizations for advanced querying and not for the purposes of a queue. SQLite is complex with many moving parts, and this made us wary of relying on it to back such a critical section of our code.
Although all of the above solutions could have been coerced to work for us, none of them were designed specifically to be used as queues. Luckily, Square had also run into this problem for accepting payments, and they built QueueFile to accomplish this. QueueFile guarantees that all operations are atomic, writes are durable, and is designed to survive process and even system-level crashes.
It was a solution tailor-made for our use case. Adding and removing elements from QueueFile is ordered first in first out (FIFO) and takes constant time, both of which are a nice bonus. QueueFile’s tiny size also makes it perfect for us to embed in our SDK.
There are a few guarantees that the filesystem provides:
renaming a file is an atomic operation
fsync is durable
segment writes are atomic
QueueFile is particularly clever about the way it stores and updates data — Bob Lee has an excellent talk on the subject. QueueFile consists of a 16 byte file header, and a series of items called elements, as a circular buffer. The grey area (not to scale) represents empty space in the file.
The file header consists of four 4-byte integers that represent the length of the file, the number of elements in the file, and a pointer to the location of the first and last elements. Since the length of the file header is only 16 bytes (smaller than the size of a segment), changes to the file header are atomic. QueueFile relies on this by making modifications to the file visible only when the header is committed as well.
The components of a file header.
Each element itself is comprised of a 4-byte element header that stores the length of the element, and the element data (variable length) itself.
The components of an element.
Consider adding an element to the QueueFile below.
QueueFile first writes (and fsyncs) the element and its length. Notice that the file header remains unchanged. If writing the second element fails, the QueueFile is still left in a valid state since the header hasn’t been updated (even though the new data may be on disk). Upon restart, the QueueFile would still report that queue contains only a single element.
When the write operation completes successfully, the header is committed (and fsynced) as well. If updating the header fails, then the QueueFile remains the same as above, and the change is aborted. Otherwise, we’ve successfully added our data!
Calling fsync after every write prevents the filesystem from reordering our writes, and makes the transaction durable. This ensures that the QueueFile is never left in an invalid or corrupted state.
Removing and clearing do the reverse. Let’s start from the result of our previous operation.
QueueFile writes (and fsyncs) the header first. If committing the header fails, then the change is simply aborted and the QueueFile remains the same as above. When the header is written successfully, the removal is committed, and now has a size of 1 (even though the data is on disk).
QueueFile goes one step further and zeroes out the removed data, which leaves us just with the element we added previously.
QueueFile has been a key part of the queueing system we’ve built which powers data collection for over 1,400 Android apps running on more than 200 million devices. While the incredibly small size of the library helps prevent SDK bloat, its simplicity does also create a few limitations. For instance, QueueFile is limited to a size of 1GB, and it cannot be used by multiple processes concurrently. Neither were deal breakers for us — we don’t want your queued analytics data using too much disk space and we can create separate queues for different processes — but are good to be aware of.
We’re working on porting this approach over to iOS and have some ideas to expand QueueFile to work on a broader range of filesystems like iOS and desktop apps. If you’re interested in helping us build infrastructural components like this for mobile apps and SDKs, we’re hiring.
Stephen Mathieson, Calvin French-Owen on May 26th 2016
For the past year, we’ve been heavy users of Amazon’s EC2 Container Service (ECS). It’s given us an easy way to run and deploy thousands of containers across our infrastructure.
ECS gives us free Cloudwatch metrics, automatic healthchecks, and scheduling across hundreds of nodes. It’s radically simplified how we deploy code to production. In short, it’s pretty awesome.
But ECS had one major shortcoming: navigating the AWS Console is a massive pain. It’s hard to find which containers are running and what version of an image is deployed.
That’s why we built Specs
Specs gives you a high-level window into your ECS clusters and services.
It provides a handy search box to find particular services in production, and then helps diagnose what the service’s exact state. Specs lets us know if a container isn’t able to be placed due to insufficient capacity or if all containers in a service are dead.
It gives the entire development team a better picture of how our production environment is configured.
We run Specs behind an internal Google oAuth server, so any engineer can view it even if they don’t have an IAM account. It’s allowed a lot more of the team to debug issues without bothering the core infra team.
Getting started with Specs couldn’t be easier. Assuming you’re running Docker and have your AWS credentials exported, just run the following command:
You’ll get convenient dashboards with zero configuration.
If you’d like to help contribute, please check out the github repo. We’re excited to add features like log tailing and a real-time events feed to make working with ECS a truly seamless experience.
Lipei Wang on May 23rd 2016
Python, one of the most popular scripting languages, is also one of the most preferred tools for data analysis and visualization. In addition to the broader Python developer community, there is also a significant group that uses Python to analyze data, draw actionable insights, and make decisions.
With its extensive collection of helper libraries and platforms, Python is a great tool for quick, iterative data exploration. Python’s set of libraries includes everything from visualization to statistical analysis, making it convenient for its users to jump into the data and begin identifying patterns.
Together with the ability to iterate quickly in data and statistical analysis, there are great open source tools on managing data pipelines and workflows. A growing community of analysts are finding new ways of using Python to crunch numbers and understand their data.
We had a chance to catch up with the Chief Analyst at Mode, Benn Stancil, and ask him about the importance of Python, how to use it in day-to-day analysis, and to tell us about some key features of Mode’s new product, Mode Python Notebooks. Mode Python Notebooks is a hosted solution that allows analysts to use Python for exploratory analysis.
I’m interested in Python for the same reasons I like SQL: It gives me the power and flexibility to answer any question. The community is great and adoption is on the rise.
There are many easy to use Python libraries to make data exploration convenient and immediate. This allows for iterative data analysis. With Python, you can really chase your curiosities down the rabbit hole.
Lastly, Python’s utility and flexibility allows it to be used for a variety of tasks within the data science stack. For example, Luigi and Airflow both allow for managing data pipelines and workflows in Python. By completing exploratory analysis in Python, there can be times where the work carries over into production.
What are the most popular use cases for SQL as opposed to Python?
SQL is designed to query and extract data from a database. It’s a necessary first step to get the data into a usable format. For instance, SQL allows you to easily join several data sets to create a table that you can explore further.
SQL isn’t really designed for manipulating or transforming data in certain ways. Higher level data manipulation that is common with data science, such as statistical analysis, regressions, trend lines, and working with time series data, isn’t easy in SQL.
Despite these limitations, because SQL is necessary for extracting data, it’s still commonly used for complex operations. The query below, which calculates quantiles for different series in a data, is something I’ve used versions of many times.
When would Python come in?
Python has a ton of libraries (e.g. Pandas, StatsModel, and SciPy) that are designed for statistical and mathematical analysis. The libraries also do a great job of abstracting away the details so that you don’t need to calculate all the underlying math by hand. Moreover, you can get your results immediately, so you can use Python iteratively to explore your data.
Rather than saying “I want to do a regression analysis” and sitting down for half an hour figuring out where to begin in SQL, the Python libraries make it so that you can just run the analysis, see the results, and continue exploring the path your curiosity takes you down. With Python, there is not much lag between inspiration and action. With SQL, on the other hand, I often think twice before going down a path that may or may not be fruitful.
For example, I’d only write the query above if I really knew I wanted to present the quantiles of that dataset. Because the entire thing can be accomplished with the one line of Python below, I’d do it much earlier in my analytical process–and may discover something I wasn’t looking for as a result.
Another way to think about the difference between Python and SQL is that Python allows you to start with one large table, from which you branch off different analyses in different directions. One avenue of inspiration can bring you to another avenue and to another avenue. The speed and flexibility of analysis makes it easy to go down many exploratory paths.
Those kinds of analyses sound very different. Why combine SQL and Python in one place?
Because SQL and Python each have individual strengths and weaknesses. Tying the languages together gives analysts the best of both worlds.
First, SQL is needed to build the data set into a final table that has all of the necessary attributes. Then, from this large data set, you can use Python to spin off deeper analysis.
What would it take for a SQL analyst to learn Python?
Like many skills, the best way to learn how to use Python for analysis is by diving in to work on a problem you’re interested in, care about, and with which you’re somewhat familiar.
When you work on something that you’re interested in, you tend to go deeper. You uncover something in the core problem that piques your interest, and you want to learn more by analyzing the data set in a different way. You begin asking more and more questions. This curiosity can push you further than you would go otherwise, and a is a source of a lot of real learning.
You also should work on data that you’re somewhat familiar with, so you know when you do something wrong. You’ll have better instincts about what’s going on and what to expect. Compare this to when you’re working on data about which you know nothing—like flower petal sizes (an unusually popular data set found in many Python examples). If your analysis concludes that “all of these flowers have two centimeter-long petals” and you have no idea whether that is reasonable, you may just assume it’s right and move on.
Mode is also releasing a new Python tutorial that aims to help SQL users learn how and where to integrate Python into their workflow. In addition, the Python tutorials provide problems that are familiar to those in business settings, instead of academic problems.
How would you expect learning Python to help benefit analysts in their job and their careers?
Learning Python definitely can augment an analyst’s skill set.
An analyst needs to communicate the business value through data. One part of the job is to find the insights from the data, but the more effective job is also to include the right context and narrative around the insights that can compel your teammates towards action. And since using data and analytics to make decisions is becoming more important in the workplace, the role of the analyst to deliver comprehensive analysis is more important than ever.
Is it easy to get started with Python? What kind of setup and tooling do you need?
Until recently, getting started with Python for analysis requires installing a few things—Python, several main statistics and data analysis libraries, and Notebooksoftware that’ll run the analysis locally on your computer. Then, you’d have to run the Notebook, starting a local server to execute your Python commands.
The results generated from your commands on Notebooks will exist on your desktop. In order to share it, you basically download a Python Notebook HTML file (which has a mix of code and results) and send that around. In order to easily parse the results from the HTML, your colleagues would have to open that file in a browser. And at that point, unless they have Python set up too, the code isn’t re-executable.
When you run Python locally, you’re limited to the power of your computer, you have to leave your computer open to run it, and running scripts can slow other things down. By running it remotely, you can run it from any machine and you can run something, close your computer and walk away, and still have your results waiting when you get back.
With Mode Python Notebooks, all of that setup and hosting is taken care of for you. If you’re using Mode to query an existing database, there’s a tab for a Python Notebook. You can open it up and the query results table will be automatically populated. And after generating plots, time series analysis, summary statistics, etc., in the Notebook, it’s easy to curate the results into an instantly shareable report anyone can re-run. Moreover, you don’t have to run it on your local machine, which saves your computer’s processing power and memory. Finally, setting it all up is as easy as setting up a database connection.
Thanks for your time, Benn!
If you’re interested in learning how to run data analysis in Python, check out Mode’s new Python tutorial. And, if you’re a Segment customer looking to run Python scripts on your event and cloud app data check out Mode Python Notebooks.
Calvin French-Owen on May 16th 2016
Since Segment’s first launch in 2012, we’ve used queues everywhere. Our API queues messages immediately. Our workers communicate by consuming from one queue and then publishing to another. It’s given us a ton of leeway when it comes to dealing with sudden batches of events or ensuring fault tolerance between services.
We first started out with RabbitMQ, and a single Rabbit instance handled all of our pubsub. Rabbit had lot of nice tooling around message delivery, but it turned out to be really tough to cluster and scale as we grew. It didn’t help that our client library was a bit of a mess, and frequently dropped messages (anytime you have to do a seven-way handshake for a protocol that’s not TLS… maybe re-think the tech you’re using).
So in January 2014, we started the search for a new queue. We evaluated a few different systems: Darner, Redis, Kestrel, and Kafka (more on that later). Each queue had different delivery guarantees, but none of them seemed both scalable and operationally simple. And that’s when NSQ entered the picture… and it’s worked like a charm.
As of today, we’ve pushed 750 billion messages through NSQ, and we’re adding around 150,000 more every second.
Before discussing how NSQ works in practice, it’s worth understanding how the queue is architected. The design is so simple, it can be understood with only a few core concepts:
topics - a topic is the logical key where a program publishes messages. Topics are created when programs first publish to them.
channels - channels group related consumers and load balance between them–channels are the “queues” in a sense. Every time a publisher sends a message to a topic, that message is copied into each channel that consumes from it. Consumers will read messages from a particular channel and actually create the channel on the first subscription. Channels will queue messages (first in memory, and spill over to disk) if no consumers are reading from them.
messages - messages form the backbone of our data flow. Consumers can choose to finish messages, indicating they were processed normally, or requeue them to be delivered later. Each message contains a count for the number of delivery attempts. Clients should discard messages which pass a certain threshold of deliveries or handle them out of band.
NSQ also runs two programs during operation:
nsqd - the nsqd daemon is the core part of NSQ. It’s a standalone binary that listens for incoming messages on a single port. Each nsqd node operates independently and doesn’t share any state. When a node boots up, it registers with a set of nsqlookupd nodes and broadcasts which topics and channels are stored on the node.
Clients can publish or read from the nsqd daemon. Typically publishers will publish to a single, local nsqd. Consumers read remotely from the connected set of nsqd nodes with that topic. If you don’t care about adding more nodes dynamically, you can run nsqds standalone.
nsqlookupd – the nsqlookupd servers work like consul or etcd, only without coordination or strong consistency (by design). Each one acts as an ephemeral datastore that individual nsqd nodes register to. Consumers connect to these nodes to determine which nsqd nodes to read from.
Let’s walk through a more concrete example of how this works in practice.
NSQ recommends co-locating publishers with their corresponding nsqd instances. That means even in the face of a network partition, messages are stored locally until they are read by a consumer. What’s more, publishers don’t need to discover other nsqd nodes–they can always publish to the local instance.
First, a publisher sends a message to its local nsqd. To do this, it first opens up a connection, and then sends a PUB command with the topic and message body. In this case, we publish our messages to the events topic to be fanned out to our different workers.
The events topic will copy the message and queue it in each of the channels linked to the topic. In our case, there are three channels, one of them being the archives channel. Consumers will take these messages and upload them to S3.
Messages in each of the channels will queue until a worker consumes them. If the queue goes over the in-memory limit, messages will be written to disk.
The nsqd nodes will first broadcast their location to the nsqlookupds. Once they are registered, workers will discover all the nsqd nodes with the events topic from the lookup servers.
Then each worker subscribes to each nsqd host, indicating that it’s ready to receive messages. We don’t need a fully connected graph here, but we do need to ensure that individual nsqd instances have enough consumers to drain their messages (or else the channels will queue up).
Separate from the client library, here’s an example of what our message handling code might look like:
If the third party fails for any reason, we can handle the failure. In this snippet, we have three behaviors:
discard the message if it’s passed a certain number of delivery attempts
finish the message if it’s been processed successfully
requeue the message to be delivered later if an error occurs
As you can see, the queue behavior is both simple and explicit.
In our example, it’s easy to reason that we’ll tolerate
MAX_DELIVERY_ATTEMPTS * BACKOFF_TIME minutes of failure from our integration before discarding messages.
At Segment, we keep statsd counters for message attempts, discards, requeues, and finishes to ensure that we’re achieving a good QoS. We’ll alert for services any time the number of discards exceeds the thresholds we’ve set.
In production, we run nsqd daemons on pretty much all of our instances, co-located with the publishers that write to them. There are a few reasons NSQ works so well in practice:
Simple protocol - this isn’t a huge issue if you already have a good client library for your queue. But, it can make a massive difference if your existing client libraries are buggy or outdated.
NSQ has a fast binary protocol that is easy to implement with just a few days of work. We built our own pure-JS node driver, (at the time, only a coffeescript driver existed) which has been stable and reliable.
Easy to run - NSQ doesn’t have complicated watermark settings or JVM-level configuration. Instead, you can configure the number of in-memory messages and the max message size. If a queue fills up past this point, the messages will spill over to disk.
Distributed - because NSQ doesn’t share information between individual daemons, it’s built for distributed operation from the beginning. Individual machines can go up and down without affecting the rest of the system. Publishers can publish locally, even in the face of network partitions.
This ‘distributed-first’ design means that NSQ can essentially scale forever. Need more throughput? Add more nsqds!
The only shared state is kept in the lookup nodes, and even those don’t require a “global view of the universe.” It’s trivial to set up configurations where certain nsqds register with certain lookups. The only key is that the consumers can query to get the complete set.
Clear failure cases - NSQ sets up a clear set of trade-offs around components which might fail, and what that means for both deliverability and recovery.
I’m a firm believer in the principle of least surprise, particularly when it comes to distributed systems. Systems fail, we get it. But systems that fail in unexpected ways are impossible to build upon. You end up ignoring most failure cases because you can’t even begin to account for them.
While they might not be as strict of guarantees as a system like Kafka can provide, the simplicity of operating NSQ makes the failure conditions exceedingly apparent.
UNIX-y tooling - NSQ is a good general purpose tool. So it’s not surprising that the included utilities are multi-purpose and composable.
In addition to the TCP protocol, NSQ provides an HTTP interface for simple cURL-able maintenance operations. It ships with binaries for piping from the CLI, tailing a queue, piping from one queue to another, and HTTP pubsub.
There’s even an admin dashboard for monitoring and pausing queues (including the sweet animated counter above!) that wraps the HTTP API.
As I mentioned, that simplicity doesn’t come without trade-offs:
No replication - unlike other queues, NSQ doesn’t provide any sort of replication or clustering. This is part of what makes running it so simple, but it does force some hard guarantees on reliability for published messages.
We partially get around this by lowering the file sync time (configurable via a flag) and backing our queues with EBS. But there’s still the possibility that a queue could indicate it’s published a message and then die immediately, effectively losing the write.
Basic message routing - with NSQ, topics and channels are all you get. There’s no concept of routing or affinity based upon on key. It’s something that we’d love to support for various use cases, whether it’s to filter individual messages, or route certain ones conditionally. Instead we end up building routing workers, which sit in-between queues and act as a smarter pass-through filter.
No strict ordering - though Kafka is structured as an ordered log, NSQ is not. Messages could come through at any time, in any order. For our use case, this is generally okay as all the data is timestamped, but doesn’t fit cases which require strict ordering.
No de-duplication - Aphyr has talked extensively in his posts about the dangers of timeout-based systems. NSQ also falls into this trap by using a heartbeat mechanism to detect whether consumers are alive or dead. We’ve previously written about various reasons that would cause our workers to fail heartbeat checks, so there has to be a separate step in the workers to ensure idempotency.
As you can see, the underlying motif behind all of these benefits is simplicity. NSQ is a simple queue, which means that it’s easy to reason about and easy to spot bugs. Consumers can handle their own failure cases with confidence about how the rest of the system will behave.
In fact, simplicity was one of the primary reasons we decided to adopt NSQ in the first place (along with many of our other software choices). The payoff has been a queueing layer that has performed flawlessly even as we’ve increased throughput by several orders of magnitude.
Today, we face a more complex future. More and more of our workers require a stricter set of reliability and ordering guarantees than NSQ can easily provide.
We plan to start swapping out NSQ for Kafka in those pieces of the infrastructure and get much better at running the JVM in production. There’s a definite trade-off with Kafka; we’ll have to shoulder a lot more operational complexity ourselves. On the other hand, having a replicated, ordered log would provide much better behavior for many of our services.
But for any workers which fit NSQ’s trade-offs, it’s served us amazingly well. We’re looking forward to continuing to build on its rock-solid foundation.
TJ Holowaychuk for creating nsq.js and helping drive its adoption at Segment.
Jehiah Czebotar and Matt Reiferson for building NSQ and reading drafts of this article.
Julian Gruber, Garrett Johnson, and Amir Abu Shareb for maintaining nsq.js.
Tejas Manohar, Vince Prignano, Steven Miller, Tido Carriero, Andy Jiang, Achille Roussel, Peter Reinhardt, Stephen Mathieson, Brent Summers, Nathan Houle, Garrett Johnson, and Amir Abu Shareb for giving feedback on this article.