Go back to Blog


Solon Aguiar, Brian Lai on January 20th 2022

With the launch of Twilio Engage, the team needed to update the data delivery system to support the latest platform features. This blog covers the process, including questions and lessons learned, to get to the end result.

All Engineering articles

Jeroen Ransijn on October 16th 2018

Design systems are emerging as a vital tool for product design at scale. These systems are collections of components, styles, and processes to help teams design and build consistent user experiences. It seems like everyone is building one, but there is no playbook on how to take it from the first button to a production-ready system adopted across an organization. Much of the advice and examples out there are for teams that seem to have already figured it out.

Today I want to share my experience in bootstrapping a design system and driving adoption within our organization, Segment. I will share how we got started by creating something small and useful  first. Then I will share how I hijacked a project to build out that small thing into our full blown design system known as Evergreen. Finally, I will share how we continue to drive and track adoption of our design system.

What is a Design System?

A design system is a collection of components, styles and processes to help teams design and build consistent user experiences — faster and better. Design systems often contain components such as buttons, popovers and checkboxes, and foundational styles such as typography and colors. Teams that use the design system can focus on what’s unique to their product instead of reinventing common UI components.

What’s in our Design System

Before I share my experience bootstrapping our design system called Evergreen. I want to set some context first, and explain what is in our design system.

  • Design Resources

    • Sketch UI Kit

    • Design Guidelines

  • Code Resources

    • React UI Framework

    • Developer Documentation

  • Operational Resources

    • Roadmap documents

    • On-boarding process

Our design system didn’t start out with all of those resources. In fact, I built something small and useful first. In the next sections I will share the lessons I learned in bootstrapping a design system and driving adoption within our organization.

How We Got Started

About 2 years ago, I joined Segment as a product designer. I worked as a front-end developer in the past and I wanted to use my skillset to create interactive prototypes. To give you a bit of context, the Segment application allows our customers to collect data from your website or app, synthesize that data and integrate with over 200 integrations for marketing and analytics.

The prototypes I wanted to develop would live outside of our Segment application and would have no access to the application codebase. This means that I didn’t have access to the components already in the application — I had to create everything from scratch.

Most advice online talks about starting with a UI audit or trying to get executive buy-in. Those are all part of the long journey of creating a successful design system, but there are many ways to get started. If you set out to solve all of the problems in your product, you might be taking on too much at once. Instead, build something small and useful, provide value quickly, and iterate on what works.

Build something small and useful

One of the first challenges you run into when creating a component library is how to deal with styling and CSS. There are a few different ways to deal with this:

  • Traditional CSS: Verbose to write, hard to maintain at scale. Often relies on conventions.

  • CSS Preprocessor such as Sass or Less: Easier way to write CSS, chance of naming collisions. Often relies on conventions.

  • CSS-in-JS solutions: Write CSS in JavaScript. Powerful ways to abstract into components.

I wanted a solution that didn’t require any extra build steps or extra imports when using the component library. CSS-in-JS made this very easy. You can import a component in your code and it works out of the box.

I wanted to avoid having to create a ton of utility class names to override simple CSS properties on components such as dimensions, position and spacing. It turns out there is a way to achieve this in an elegant way — enter the React UI primitive.

Choosing React

There are many choices of frameworks for your component library. When I started building a component library, we were already using React, so it was the obvious choice.

React UI Primitive

After doing research, I found the concept of UI primitives. Instead of dealing with CSS directly, you deal with the properties on a React component. I bounced ideas off my coworkers and got excited about what this would mean. In the end we built UI-BOX.


UI-BOX exports a single Box component that allows you to use React props for CSS properties. Instead of creating a class name, you pass the property to the Box component directly:

Why is this Box component useful?

The Box component is useful because it helps with 3 common use cases

  • Create layouts without helper classes.

  • Define components without worrying about CSS.

  • Override single properties when using components.

Create layouts without helper classes.

Define components without worrying about CSS

Override single properties when using components

Flexibility and composability

The Box component makes it easy to start writing new components that allows setting margin properties directly to the component. For example, quickly space out two buttons by adding marginRight={10} to the left button. Also, you can override CSS properties without adding new distinct properties to the component. For example, this is useful when full-width button is needed, or want to remove the border-radius on one side of a button. Furthermore, layouts can be created instantly by using the Box component directly.

Still a place for CSS

It is important to note that UI-BOX only solves some of the problems. A class is still needed to control the appearance of a component. For example, a button can add dimensions and spacing with UI-BOX, but a class defines the appearance: background color, box shadows, color as well as the hover, active and focus states. In our design system called Evergreen a CSS-in-JS library called Glamor is used to create appearance classes.

Why it drove adoption of Evergreen

A design system can start with something small and useful. In our case it was using a UI primitive that abstracted away dealing with CSS directly. Roland, one of our lead engineers said the following about UI-BOX.

UI-BOX really drove adoption of Evergreen…

…there is no need to consider every configuration when defining a new component. And no need to wrap components in divs for spacing.

— Roland Warmerdam, Lead Software Engineer, Segment

The lesson learned here is that it’s possible to start with something small and slowly grow that out to a full fledged design system. Don’t think you have time for that? Read the next section for some ideas.

How we started driving adoption

Up until now, I had built a tool for myself in my spare time, but it was still very much a side project. Smaller startups often can’t prioritize a design system as it doesn’t always directly align with business value. I will share how I hijacked a project, scaled out the system, and finally focused on accessibility to drive adoption across teams at Segment — and how you can do the same.

Hijack a project

About a year ago I switched teams within Segment. I joined a small team called Personas, which was almost like a small startup within Segment. With Personas we were building user profiles and audience capabilities on top of the Segment platform. It turned out to be a perfect opportunity to build out more of the design system.

Deadline in sight — our first user conference

The company wanted to announce the Personas product at our first ever user conference, with only 3 months of lead time to prepare. The idea was that our CEO and Head of Product would demo it on stage. However, there was no way we could finish a fully-baked consumer-facing product in time. We were pivoting too often based on customer feedback.

The company wanted to announce this product at our first ever user conference, with only 3 months of lead time to prepare.

Seize the opportunity

It seemed like an impossible deadline. Then it hit me: We could build a standalone prototype to power the on-stage demo. This prototype would be powered by fake data and only support just the functionality that was part of the demo.

This prototype would live outside of the confines of our application. This would allow us to build things quickly, but the downside is that there is no access to the code and components that live in our application codebase. Every component we want to use in the prototype needs to be built — a perfect opportunity to build out more of the design system. We decided it would be the lowest risk, highest reward option for us to pursue.

While we worked on the demo script for the on-stage demo, I was crunching away on the prototype and Evergreen. Having the prototype available and easily shareable made it easier for the team to practice and fine-tune the script. It was a great time at Segment; I could see our team and company growing closer while readying for launch.

Huge Success

The interactive prototype was a huge success. It helped us show the vision of what our product and Personas could be. It drove considerable interest to our newest product, Personas. I was happy, because not only did we have a interactive prototype, we also have the first parts of our design system.

Focus on the developer experience

So far, we built something small and useful and hijacked a project that allowed us to build out a big chunk of our design system, Evergreen. The prototype also proved to be a great way to drive adoption of Evergreen in our application. Our developers simply took code from the prototype and ported it over in our application. 

At that point, Evergreen components were adopted in over 200 source code files. Our team was happy about the components, but there were some pain points with the way Evergreen was structured. When we started building Evergreen, we copied some of the architecture decisions of bigger design systems. That turned out to be a mistake. It slowed us down.

Too early for a mono-repo

When I started building Evergreen I took a lot of inspiration from Atlassian’s AtlasKit. It is one of the most mature and comprehensive enterprise design systems out there. We used the same mono-repo architecture for Evergreen, but it turns out there is quite a lot of overhead to when using a mono-repo.

Our developers were not happy with the large number of different imports in each file. There were over 20 different package dependencies. Maintaining these dependencies was painful. Besides unhappy developers, it was time-consuming to add new components.

A single dependency is better for us (for now)

I wanted to remove as much friction for our developers using Evergreen as possible, which is why I wanted to migrate away from the mono-repo. Instead, a single package would export all of our components as a single dependency.

Migrate our codebase in a single command

When we decided to migrate to a single package, it required updating the imports in all the places Evergreen was used in our application. At this point Evergreen was used in over 200 source code files in the Segment application. It seemed like a pretty daunting challenge, not something anyone got excited about doing manually. We started exploring our options and ways to automate the process, and to our surprise it was easier than we thought.

Babel parser to the rescue

We created a command line tool for our application that could migrate the hundreds of files of source code using Evergreen with one command. The syntax was transformed using a tool called b. Now it’s a much better experience for our developers in the application. In the end, our developers were happy.

Lesson Learned, Face the challenge

A big change like this can feel intimidating, and give you second doubts. Although I wish I started Evergreen with the architecture it has right now — sometimes the right choice isn’t clear up front. The most important thing is to learn and move forward. 

Driving adoption of a design system is very challenging. It is hard to understand progress. We came up with a quite nifty way to visualize the adoption in our application — and in turn make data-driven decisions about the future of Evergreen.

How to get to 100% adoption

Within our company, product teams operate on key metrics to get resources and show they are being successful to the rest of the company. One of the key metrics for Evergreen is 100% adoption in our application. What does 100% even mean? And how can we report on this progress?

What does 100% adoption even mean?

100% adoption at Segment means building any new products with Evergreen and deprecating our legacy UI components in favor of Evergreen components. The first part is the easiest as most teams are already using Evergreen to build new products. The second part is harder. How do we migrate all of our legacy UI components to Evergreen components?

What legacy UI components are in our app?

Active code bases will accrue a large number of components over time. In our case this comes in the form of legacy component libraries that live in the application codebase.

In our case it comes in the following two legacy libraries:

  • React UI Library, precursor to Evergreen.

  • Legacy UI folder, literally a folder called ui in our codebase that holds some very old components.

Evergreen versions

In addition to the legacy libraries, the application is able to leverage multiple versions of Evergreen. This allows gradual data migration from one version to another.

  • Evergreen v4, the latest and greatest version of Evergreen. We want 100% of this.

  • Evergreen v3, previous version of Evergreen. We are actively working on migrating this over to v4

How can we report on the progress of adoption?

The solution we came up with to report on the adoption of Evergreen is an adoption dashboard. At any single point in time the dashboard shows the following metrics:

  • Global Adoption, the current global state of adoption

  • Adoption Week Over Week, the usage of Evergreen (and other libraries) week over week

  • Component Usage, a treemap of each component sorted by framework. Each square is sized by how many times the component is imported in our codebase.

The Component Usage Treemap

Besides the aggregates, we know exactly which files import a component. To visualize this, a Treemap chart on the dashboard shows each component with the size of the square representing how many times it imported in our application.

Understand exactly where you are using a component

Clicking on one of the squares in the treemap shows a side sheet with a list of all the files which import that component. This information allows us to confidently deprecate components.

Filter down a list of low hanging fruit to deprecate

The adoption dashboard also helps to prioritize the adoption roadmap. For example, legacy components that are only imported once or twice are easy to deprecate.

How it works

Earlier I shared how we used babel-parser to migrate to the new import structure.  Being true to our roots, we realized the same technique could be used to collect analytics for our design system! To get to the final adoption dashboard there are a few steps involved.

Step 1. Create a report by analyzing the codebase

We wrote a command line utility that returns a report by analyzing the import statements at the top of each file in our codebase. An index is built that maps these files to their dependencies. Then the index can be queried by package and optionally the export.  Here is an example:



We open-sourced this tool if you are interested in learning more or want to build out your own adoption dashboard see https://github.com/segmentio/dependency-report

Step 2. Create and save a report on every app deploy

  • Every time we deploy our application, the codebase is analyzed and a JSON report is generated using the dependency-report tool.

  • Once the report is generated, it is persisted to object storage (S3).

  • After persisting the report, a webhook triggers the rebuild of our dashboard via the Gatsby static site generator.

Step 3. Build the dashboard and load the data

To reduce the number of reports on the dashboard, the generator only retrieves the most current report as well as a sample report from each previous week. The latest report is used to show the current state. The reports of the previous weeks are used to calculate an aggregate for the week over week adoption chart.

How the adoption dashboard is pushing Evergreen forward

The adoption dashboard was the final piece in making Evergreen a success as it helped us migrate over old parts of our app systematically and with full confidence. It was easy to identify usage of legacy components in the codebase and know when it was safe to deprecate them. Our developers were also excited to see a visual representation of the progress. These days it helps us make data-driven decisions about the future of Evergreen and prioritize our roadmap. And honestly, it is pretty cool.


To those of you who are considering setting out on this journey, I’ll leave you with a few closing thoughts:

  • Start small. It’s important to show the value of a potential design system by solving a small problem first.

  • Find a real place to start. A design system doesn’t have value by itself. It only works when applied to a real problem.

  • Drive adoption and measure your progress. The real work starts once the adoption begins. Don’t forget that the real value is in adoption. Design systems are only valuable once they are fully integrated into the team’s workflow.

This is only the start of our journey. There are still many challenges ahead. Remember, building a design system is not about reaching a single point in time. It’s an ongoing process of learning, building, evangelizing and driving adoption in your organization.

Alan Braithwaite on September 27th 2018

“How should we test this?”

“Let’s just run it in production and monitor it closely.”

— You and your coworker, probably.

While often mocked, testing in production is the most definitive way to ensure that your system is operating as expected.  Segment has been on a journey for the last 18 months to include end-to-end testing in production as part of our broader testing strategy, so we wanted to share some of the work we’ve been doing in this area.

For those unfamiliar, Segment is a Customer Data Infrastructure which helps our customers route data about their users from various collection points (web, mobile, server-side) to hundreds of Destinations (partners which receive data from segment) and data warehouses.  

The numerous components which compose Segment’s backend creates a challenging environment for testing in general, but especially in production. To manage this complexity, we’ve decided to focus on two areas.

First, we’ve been building towards a staging environment that faithfully represents our production environment. Second, since we cannot cost-effectively operate a staging environment at the same scale as our production environment, we’ve been developing end-to-end tests for production.

Much has been written about other types of testing, so we’re going to focus on end-to-end testing in this post.

End-to-end tests are tests which run against the entire infrastructure. They are distinct from integration tests because they’re run on real infrastructure whereas integration tests are not. These tests are also distinct from unit tests, which only test a very small amount of code or even just one method.  End-to-end tests should also exercise the exact same code paths used by a customer sending data to Segment’s API.

So what does an end-to-end test look like for Segment?

  1. Send an event to the Segment Tracking API

  2. Process that event through our many streaming services (e.g. validation, deduplication, etc)

  3. Send the event into Centrifuge, which handles reliable delivery of events to Destinations in the presence of network timeouts or other failures outside of our control

  4. Verify that the event is received by a Webhook destination

  5. Emit latency and delivery metrics to alert on using segmentio/stats.

To implement this kind of test, we required an end-to-end testing framework that would make it easy for developers to build new tests.

When we started looking at solutions, we played around with some other end-to-end frameworks with varying degrees of success. They often incorporated ideas about contracts and assertions which were tightly coupled to the framework. This not only made it difficult to add new types of tests, but it also made them difficult to debug.

Before we had end-to-end tests, our staging environment wasn’t effective at preventing bugs from getting to production.  Software is updated more frequently in staging, often being a week or so ahead of the production version.  Additionally, configuration of the staging environment was haphazard and occasionally broke due to changes in the software. These breaks were often silent because we had not been monitoring them in staging.

Today we’re open sourcing Orbital, a framework which meets the requirements presented above and helped us reach our testing goals.

Orbital provides the means to define, register and run tests as part of a perpetually-running end-to-end test service. Additionally, it provides metrics (using segmentio/stats) around test latency and failure rates which we can monitor and alert on.


Orbital is a lightweight test framework for running systems tests defined in Go.  Orbital is inspired by Go’s own testing library, specifically the testing.T abstraction.  testing.T is a struct that gets injected into each test which defines a set of methods to determine whether or not that test passed.  We like Go’s testing package for two reasons.

First, the package takes a users first approach in it’s design.  The API couldn’t be more simple!  Doing this greatly reduces friction when writing tests, increasing the likelihood that they’ll get written and maintained properly.

Second, modeling orbital.O  after testing.T gives us the flexibility we need to define our arbitrarily complex tests.  After trying to enumerate all the different things we’d like to support we found that there are just too many behavioral edge cases that needed actual code to describe properly.  For example, say you want to check events were received by a webhook and also that some counters were updated.  This was difficult to articulate with an assertion based framework like the one we were using before. With Orbital, we’re now only limited by what the Go language supports which is an improvement over the “mutation→assertion” style tests which we encountered before.

The following example exercises the above illustrated case: sending an event to our Tracking API produces a webhook to a configured endpoint. In this case, we’ve configured the webhook to be our own end-to-end service’s API for test verification.

As you can see, the code is very straightforward.  Each test runs in its own goroutine and blocks until it’s completed or the context is cancelled.  Modeling tests in this way allows us to check arbitrary side effects and allows for any kind of behavioral testing your imagination can come up with.

Orbital provides a Service struct which registers the tests and manages the process lifecycle. This struct allows you to set global timeouts for all tests as well as configure logging and metrics. During test registration, you set the period (how often the test is run), name, function and optional timeout override.

One key factor in the design of this framework is the embedded webhook package. This special webhook operates like a normal HTTP server which logs requests to an interface.  One implementation of this interface (RouteLogger) is configured such that after sending an event, you can block your goroutine waiting until that event is received by the webhook or a timeout occurs.

With this primitive, we can send requests to the API, then wait for them to be sent back after being processed in our pipeline. In the above example, we’re doing this on the line h.Waiter.Wait(ctx, evt.ID). To see a full example of both a tester and tested service, check out the examples directory on GitHub.

How do we use it?

Our Orbital tests are deployed as a service that runs inside of our staging and production infrastructure.  It sends events to the Segment Tracking API using our various library implementations.  We even fork out to headless Chrome to execute tests in the browser with analytics.js!  This framework generates metrics used for dashboards and alerting. Here you can see a comparison of our staging vs production environments.



From the screenshots above you can see that something was broken in staging by looking at the top center graph.

This library was important to creating confidence that our stage environment behaves the same way as our production environment.  We’re now at the point where we can block a release if any of the tests fail in stage.  We know for certain that something did actually break and needs to be investigated. This is the testing strategy you need to have in your infrastructure to reach the ever elusive 5-9s of reliability.

What remains?

Orbital has already proven instrumental in reducing the number of bugs making it to production.  We’ve written numerous tests across multiple teams which exercise various known customer configurations.  However, the framework is not yet bulletproof.

Although you can scale on a single instance to tens of thousands of requests per second, eventually you’ll hit a bottleneck somewhere.  Unfortunately, this framework doesn’t elegantly scale out right now.  Currently, the “RouteLogger/Waiter” records messages sent and received in memory; not to a shared resource or DB.  So if you have multiple tasks running which are load balanced, those requests are unlikely to be sent to the right task and the tests will fail/timeout.  This is a non-trivial but ultimately a solvable problem.

If this is interesting to you, reach out to us!  We’d love to hear from you.  You can find us on twitter @Segment.  Check out our Open Source initiatives here.  We’ve also got many positions in Engineering which involve solving problems similar to this which you can see here.

If you’re interested in reading more about this topic, check out these other great resources on testing in production:





Noah Zoschke on August 7th 2018

Segment is a hub for a tremendous amount of data. It processes peaks of 230,000 events per second inbound, and 280,000 events per second outbound between more than 200 integration partners. You may think of Segment as black box for delivering all this data. You send data once to its tracking API, and it coordinates translating data and delivering it to many destinations.

When everything works perfectly, you don’t need to open the black box. Unfortunately, the world of data delivery at scale is far from perfect. Think of all the software, networks, databases and engineers behind Segment and our partners. You can imagine at any given time a database is failing, a network is unreachable, etc.

Segment engineering has taken lengths to operate reliably in this environment. Our latest efforts have been around visibility into the HTTP response codes from destinations. We spent the last few months adding hooks to measure everything from the volume of events, how quickly they are sent to destinations, and what HTTP status code and error response body, if any, occurred for every request.

This instrumentation is ultimately for Segment users to see into the black box to answer one question: how do delivery challenges affect my data? To this end, we built an event delivery dashboard around the data.

It turns out the data in aggregate is also tremendously useful to cloud service engineers at Segment and our destination partners alike. Looking at HTTP status codes alone has unveiled lots of insights on how data flows between services and how we can maximize delivery rates.

I’d like to share some of the things we have found in a day of HTTP responses that we see at Segment.


First the good news. 92.6% of events — 24.4 billion on the sample day — are delivered on the first attempt. In this happy path, Segment makes an HTTP request to a destination and receives a HTTP 2XX success status code response.

Terminal problems

Next, the bad news. 5.5% of events — 1.4 billion on the sample day — never make it to their destination. In this path, Segment makes an HTTP request to a destination and generally receives an HTTP 4XX client error status code response. These codes indicate the client — either Segment or the user it represents — made an error that the server can’t reconcile.

What’s the password?

The most common client errors Segment sees are HTTP 401 Unauthorized and HTTP 403 Forbidden on 3.8% of requests. In this case, the server doesn’t recognize the given username, password or token, and can’t accept any data. Neither Segment nor the destination server can resolve this automatically for a given request.

This is either due to wrong credentials configured in Segment in the first place or credentials that expired on a destination. Segment always attempts to send the latest events just in case the problem was resolved on either side.

No comprende

The next most common client error is HTTP 400 Bad Request on 0.51% of requests. In this case, the received the request payload but couldn’t make sense of it. These are generally validation errors. Again, Segment and the destination can’t do anything about it automatically, except show instructive error messages to the user.

Next steps…

These errors are considered fatal, but the qualitative data can inform ways to improve delivery over time. The first big step here was building the event delivery dashboard to surface these issue to users.

For authentication errors, a logical next step would be to send notifications when delivery begins to encounter 401 errors. We can also imagine a mechanism to disable event delivery after a threshold to spare partners the request overhead.

For validation errors, visibility into requests per-customer and per-destination can inform improvements to the Segment integration code. Segment can review partner API requirements and not attempt to deliver data it can determine is bad ahead of time, or automatically massage data to conform to the destination API.

Ephemeral problems

Now the interesting challenge… a large class of HTTP problems on the internet are not fatal. In fact, most of the HTTP 5XX server error status codes reflect an unexpected error and imply that the system may accept data at a later time, as does one critical 4XX status code.


The largest class of temporary problems seen by Segment are of the HTTP 429 -- Too Many Requests class. It’s not hard to imagine why… 

Segment itself has very high rate limits with the aim of accepting all of the data a customer throws at it. Not every downstream destination has the same capabilities, particularly those that are systems of record. Intercom, Zendesk, and Mailchimp, for example, all have well-designed and lower API rate limits.

Segment has to mediate between the customer data volume and the destination rate limits. A combination of internal metering, request batching, and retry with backoff get most of the data through.

But about 7.3% of requests — 2.1 billion a day — encounter a 429 response along the way. Retries help a lot, but if a customer is simply over their limits consistently over a long enough time frame, Segment has no choice but to drop some messages. At least we can quantify how much this is happening with the delivery data and report this to a customer.

Out of service

The next largest class of error — 1.3% of requests — is from destination servers. Segment often sees servers respond with an error like:

  • HTTP 502 Bad Gateway

  • HTTP 504 Gateway Timeout

  • HTTP 500 Internal Server Error

  • HTTP 503 Service Unavailable

Perhaps it’s a temporary glitch for a single request, or perhaps the destination service is experiencing an outage. But every day Segment encounters 371 million of these error responses.

Unreliable channel

Finally, 1.1% of requests error out because of the network layer. At scale, Segment sees a noticeable number of network errors, such as:

  • ENOTFOUND — hostname not found

  • ECONNREFUSED — connection refused

  • ECONNRESET — connection reset

  • ECONNABORTED — connection aborted

  • EHOSTUNREACH — host unreachable

  • EAI_AGAIN — DNS lookup timeout

Maybe it’s due to bad host, flaky network, or DNS error.

If at first, you don’t succeed…

As seen above, a significant number of HTTP or network status codes indicate transient problems. When Segment encounters these, it retries delivery over a 4-hour window with exponential backoff. We can see that this retry strategy is successful. We go from 92.6% success on the first attempt to 93.9% success after ten attempts, an extra 163 million events delivered, all thanks to the destination server sending proper HTTP status codes.

WTF webhooks

Finally, we see some bizarre errors. A very popular destination is webhooks — arbitrary HTTP addresses to POST events. The error codes we see from these destinations imply webhooks might not always follow best practices.

We see every number from 1  to 101 as HTTP status code, which is far outside the HTTP status code specification. Perhaps this is someone testing Segment delivery rates themselves?

We see HTTP 418 I'm a teapot which is in the HTTP spec as an April fools joke.


Unfortunately, all of these strange responses are considered terminal errors by Segment. Sorry, webhooks!


It’s literally impossible to achieve 100% delivery on the first attempt over the internet. Transient network errors, unexpected server errors, and rate limiting all present challenges that add up to significant problems at scale. On top of that, encryption, authentication and data validation add another layer of challenges for perfect machine-to-machine delivery.

Retries are the primary strategy to improve delivery, and a retry strategy can only be as good as the destination service response codes.

As a service provider, returning status codes like 400, 403 or 501 is a powerful signal that Segment has no choice but to drop data. Inversely, returning status codes like 500, 502, and 504 is strong hint that Segment should try again. And 429 — rate limit exceeded — is an explicit sign that Segment needs to retry later.

If you’re running cloud service APIs or writing webhooks, think carefully about HTTP status codes. User data depends on it!

For more information about cloud service APIs, visit Segment’s Destinations catalog at https://segment.com/catalog#integrations/all

Tamar Ben-Shachar on August 1st 2018

Segment loads billions of rows of arbitrary events into our customers’ data warehouses every single day. How do we test a change that can corrupt only one field in millions, across thousands of warehouses? How can we verify the output when we don’t even control the input?

We recently migrated all our customers to a new data pipeline without any major customer impact. Through this process we had to solidify our testing story, to the point where we were able to compare billions of entries for row-per-row parity. In this post, I’d like to share a few of the techniques we used to make that process both fast and efficient.

Warehouses Overview 

Before going into our testing strategy, it’s worth understanding a little about the data we process. Segment has a single API for sending data that will be loaded into a customer’s data warehouse. Each API call generates what we call an ‘event’. We transform these events into rows in a database. 

Below, you can see an example of how an event from a website source gets transformed into a database row.

Note that the information in the event depends on the customer. We accept any number of properties and any type of value: number, string, timestamp, etc.


Our first iteration of the warehouse pipeline (v1) was just a scheduler and a task to process the data. This worked well until our customer base increased in both number and volume. At that point the simplicity caught up with us and we had to make a change.

We came up with a new architecture (v2) that gave us greater visibility into our pipeline and let us scale at more granular levels. Though this was the right way forward, it was a completely new pipeline, and we needed a migration plan. 

Our goal was to switch customers to the new pipeline without them noticing. All data should be written to the database exactly the same way in the v1 pipeline as in the v2 pipeline. If we found a bug in how we processed data in v1, we kept that bug in v2. Customers expect to receive our data in a certain way and have built tooling around it. We can’t just change course and expect all our customers to change their tooling accordingly.  

We also want minimize any potential for new bugs. When we have a bug with how we load data, deploying a fix will only fix future data. Previous data has already been loaded into a customer’s database that we don’t control. To fix the bad rows, we have to re-run the pipeline over the old events that were incorrectly loaded.

Testing Strategy 

In other parts of the Segment infrastructure, our normal testing strategies consist of some combination of code reviews and tests.

Code reviews are very useful but don’t guarantee that something won’t slip through the cracks. Unit and integration tests are great for testing basic functionality and edge cases. However, customers send us too much variance in data and we can’t exhaustively test, or know about, every case.  

For out first pass at testing the new pipeline, we resorted to manually sending events and looking for any data anomalies. But this quickly fell short. As we started processing more and more data through our pipeline, we realized we needed a new solution.

Large Scale Testing

We needed to come up with a testing solution that gave us confidence.

To solve this problem, our approach was straightforward: build a system to do what we were trying to do manually. In essence, we wanted to run the same dataset through both the v1 and v2 warehouse pipelines and make sure the result from both pipelines is exactly the same. We call this service Warehouse QA.

When a pull request becomes ready for testing, a request is made on that pull request to trigger a webhook, which begins the QA run.

Let’s walk through a concrete example of how this works. 

Step 1: Send a Request from Github

We trigger a QA request by adding a comment on a pull request in a format the service understands. We can request a specific dataset to run through the pipelines to verify if a bug is fixed or test the proposed changes under a certain type of load. If a dataset is not specified, we use one of our defaults. 

Step 2: Process the Request

Once the service receives a new request, it starts runs of both the v1 and v2 pipelines over the chosen dataset. Each pipeline writes data under a unique schema in our test warehouse so we can easily query the results. 

Step 3: Audit Results

The most important step is the validation or audit. We check every table, column, and value that was loaded from the v2 pipeline and make sure it matches what was loaded from the v1 pipeline. 

Here is the struct that represents the results for a given run. We first check that the exact same set of tables were created by both pipelines. 

For each table, we then dive deeper to populate the fields in the table struct below. This struct compares a given table from the v1 run to the table created by the v2 run. 

Note that we are checking more than just the counts of the two runs. Counts can give false positives if there are both extra rows and missing rows. Counts also don’t find differences in the field values themselves. This piece is critical to check since we have to be sure we aren’t processing data differently in the new pipeline.

Step 4: Reporting

Now that we’ve compared what we loaded from the v1 run with the v2 run we need to be able to report the data succinctly. When the run request is complete, we post the following overview on the pull request:

If we open the results file under detailed results above, this is what we see for each table that was identical under v1 and v2:

If the pipelines outputted different results, it looks like this:

Step 5: Debug Differing Results

At this point, we have all the information we need to figure out why the two pipelines wrote different values.

Below is a comparison of data that differed between the v1 and v2 pipelines, taken from the results of a QA run. Each row corresponds to a row in our test warehouse. The fields are the id of the row in the table, the column that differed in the two pipelines, and the differing values, respectively. The red value is the value from the v1 pipeline and the green from v2. 

Why are timestamps getting dropped in v1 here? We know that timestamps aren’t all getting dropped, and confirmed this with tests. It turns out that dates before the epoch (January 1st, 1970) were getting dropped because we converted timestamps to integers using Golang’s .Unix() method and dropped values <= 0.

Now that we found the root cause we can alter v2 accordingly. We then run QA again with the fix, and see that it passes. 


We found this tool so valuable that we still use it today, even though the migration to the v2 pipeline is complete. The QA system now compares the code running in production to the code on a pull request. We’ve even integrated with Github status checks. A QA run fo each warehouse type we support is automatically run on every pull request. 


Warehouse QA was crucial to our successful data migration. It allowed us to test the unknown at scale and find any and all differences without any manual intervention. It continues to allow us to deploy quickly and confidently with a small team by thoroughly testing hundreds of thousands of events.

Alexandra Noonan on July 10th 2018

Unless you’ve been living under a rock, you probably already know that microservices is the architecture du jour. Coming of age alongside this trend, Segment adopted this as a best practice early-on, which served us well in some cases, and, as you’ll soon learn, not so well in others.

Briefly, microservices is a service-oriented software architecture in which server-side applications are constructed by combining many single-purpose, low-footprint network services. The touted benefits are improved modularity, reduced testing burden, better functional composition, environmental isolation, and development team autonomy. The opposite is a Monolithic architecture, where a large amount of functionality lives in a single service which is tested, deployed, and scaled as a single unit.

In early 2017 we reached a tipping point with a core piece of Segment’s product. It seemed as if we were falling from the microservices tree, hitting every branch on the way down. Instead of enabling us to move faster, the small team found themselves mired in exploding complexity. Essential benefits of this architecture became burdens. As our velocity plummeted, our defect rate exploded.

Eventually, the team found themselves unable to make headway, with 3 full-time engineers spending most of their time just keeping the system alive. Something had to change. This post is the story of how we took a step back and embraced an approach that aligned well with our product requirements and needs of the team.

Why Microservices worked

Segment’s customer data infrastructure ingests hundreds of thousands of events per second and forwards them to partner APIs, what we refer to as server-side destinations. There are over one hundred types of these destinations, such as Google Analytics, Optimizely, or a custom webhook. 

Years back, when the product initially launched, the architecture was simple. There was an API that ingested events and forwarded them to a distributed message queue. An event, in this case, is a JSON object generated by a web or mobile app containing information about users and their actions. A sample payload looks like the following:

As events were consumed from the queue, customer-managed settings were checked to decide which destinations should receive the event. The event was then sent to each destination’s API, one after another, which was useful because developers only need to send their event to a single endpoint, Segment’s API, instead of building potentially dozens of integrations. Segment handles making the request to every destination endpoint.

If one of the requests to a destination fails, sometimes we’ll try sending that event again at a later time. Some failures are safe to retry while others are not. Retry-able errors are those that could potentially be accepted by the destination with no changes. For example, HTTP 500s, rate limits, and timeouts. Non-retry-able errors are requests that we can be sure will never be accepted by the destination. For example, requests which have invalid credentials or are missing required fields.

At this point, a single queue contained both the newest events as well as those which may have had several retry attempts, across all destinations, which resulted in head-of-line blocking. Meaning in this particular case, if one destination slowed or went down, retries would flood the queue, resulting in delays across all our destinations.

Imagine destination X is experiencing a temporary issue and every request errors with a timeout. Now, not only does this create a large backlog of requests which have yet to reach destination X, but also every failed event is put back to retry in the queue. While our systems would automatically scale in response to increased load, the sudden increase in queue depth would outpace our ability to scale up, resulting in delays for the newest events. Delivery times for all destinations would increase because destination X had a momentary outage. Customers rely on the timeliness of this delivery, so we can’t afford increases in wait times anywhere in our pipeline.

To solve the head-of-line blocking problem, the team created a separate service and queue for each destination. This new architecture consisted of an additional router process that receives the inbound events and distributes a copy of the event to each selected destination. Now if one destination experienced problems, only it’s queue would back up and no other destinations would be impacted. This microservice-style architecture isolated the destinations from one another, which was crucial when one destination experienced issues as they often do.

The Case for Individual Repos

Each destination API uses a different request format, requiring custom code to translate the event to match this format. A basic example is destination X requires sending birthday as traits.dob in the payload whereas our API accepts it as traits.birthday. The transformation code in destination X would look something like this:

Many modern destination endpoints have adopted Segment’s request format making some transforms relatively simple. However, these transforms can be very complex depending on the structure of the destination’s API. For example, for some of the older and most sprawling destinations, we find ourselves shoving values into hand-crafted XML payloads.

Initially, when the destinations were divided into separate services, all of the code lived in one repo. A huge point of frustration was that a single broken test caused tests to fail across all destinations. When we wanted to deploy a change, we had to spend time fixing the broken test even if the changes had nothing to do with the initial change. In response to this problem, it was decided to break out the code for each destination into their own repos. All the destinations were already broken out into their own service, so the transition was natural.

The split to separate repos allowed us to isolate the destination test suites easily. This isolation allowed the development team to move quickly when maintaining destinations.

Scaling Microservices and Repos

As time went on, we added over 50 new destinations, and that meant 50 new repos. To ease the burden of developing and maintaining these codebases, we created shared libraries to make common transforms and functionality, such as HTTP request handling, across our destinations easier and more uniform.

For example, if we want the name of a user from an event, event.name() can be called in any destination’s code. The shared library checks the event for the property key name and Name. If those don’t exist, it checks for a first name, checking the properties firstName, first_name, and FirstName. It does the same for the last name, checking the cases and combining the two to form the full name.

The shared libraries made building new destinations quick. The familiarity brought by a uniform set of shared functionality made maintenance less of a headache.

However, a new problem began to arise. Testing and deploying changes to these shared libraries impacted all of our destinations. It began to require considerable time and effort to maintain. Making changes to improve our libraries, knowing we’d have to test and deploy dozens of services, was a risky proposition. When pressed for time, engineers would only include the updated versions of these libraries on a single destination’s codebase.

Over time, the versions of these shared libraries began to diverge across the different destination codebases. The great benefit we once had of reduced customization between each destination codebase started to reverse. Eventually, all of them were using different versions of these shared libraries. We could’ve built tools to automate rolling out changes, but at this point, not only was developer productivity suffering but we began to encounter other issues with the microservice architecture.

The additional problem is that each service had a distinct load pattern. Some services would handle a handful of events per day while others handled thousands of events per second. For destinations that handled a small number of events, an operator would have to manually scale the service up to meet demand whenever there was an unexpected spike in load.

While we did have auto-scaling implemented, each service had a distinct blend of required CPU and memory resources, which made tuning the auto-scaling configuration more art than science.

The number of destinations continued to grow rapidly, with the team adding three destinations per month on average, which meant more repos, more queues, and more services. With our microservice architecture, our operational overhead increased linearly with each added destination. Therefore, we decided to take a step back and rethink the entire pipeline.

Ditching Microservices and Queues

The first item on the list was to consolidate the now over 140 services into a single service. The overhead from managing all of these services was a huge tax on our team. We were literally losing sleep over it since it was common for the on-call engineer to get paged to deal with load spikes.

However, the architecture at the time would have made moving to a single service challenging. With a separate queue per destination, each worker would have to check every queue for work, which would have added a layer of complexity to the destination service with which we weren’t comfortable. This was the main inspiration for Centrifuge. Centrifuge would replace all our individual queues and be responsible for sending events to the single monolithic service.

Moving to a Monorepo

Given that there would only be one service, it made sense to move all the destination code into one repo, which meant merging all the different dependencies and tests into a single repo. We knew this was going to be messy.

For each of the 120 unique dependencies, we committed to having one version for all our destinations. As we moved destinations over, we’d check the dependencies it was using and update them to the latest versions. We fixed anything in the destinations that broke with the newer versions.

With this transition, we no longer needed to keep track of the differences between dependency versions. All our destinations were using the same version, which significantly reduced the complexity across the codebase. Maintaining destinations now became less time consuming and less risky.

We also wanted a test suite that allowed us to quickly and easily run all our destination tests. Running all the tests was one of the main blockers when making updates to the shared libraries we discussed earlier.

Fortunately, the destination tests all had a similar structure. They had basic unit tests to verify our custom transform logic was correct and would execute HTTP requests to the partner’s endpoint to verify that events showed up in the destination as expected.

Recall that the original motivation for separating each destination codebase into its own repo was to isolate test failures. However, it turned out this was a false advantage. Tests that made HTTP requests were still failing with some frequency. With destinations separated into their own repos, there was little motivation to clean up failing tests. This poor hygiene led to a constant source of frustrating technical debt. Often a small change that should have only taken an hour or two would end up requiring a couple of days to a week to complete.

Building a Resilient Test Suite

The outbound HTTP requests to destination endpoints during the test run was the primary cause of failing tests. Unrelated issues like expired credentials shouldn’t fail tests. We also knew from experience that some destination endpoints were much slower than others. Some destinations took up to 5 minutes to run their tests. With over 140 destinations, our test suite could take up to an hour to run.

To solve for both of these, we created Traffic Recorder. Traffic Recorder is built on top of yakbak, and is responsible for recording and saving destinations’ test traffic. Whenever a test runs for the first time, any requests and their corresponding responses are recorded to a file. On subsequent test runs, the request and response in the file is played back instead requesting the destination’s endpoint. These files are checked into the repo so that the tests are consistent across every change. Now that the test suite is no longer dependent on these HTTP requests over the internet, our tests became significantly more resilient, a must-have for the migration to a single repo.

I remember running the tests for every destination for the first time, after we integrated Traffic Recorder. It took milliseconds to complete running the tests for all 140+ of our destinations. In the past, just one destination could have taken a couple of minutes to complete. It felt like magic.

Why a Monolith works

Once the code for all destinations lived in a single repo, they could be merged into a single service. With every destination living in one service, our developer productivity substantially improved. We no longer had to deploy 140+ services for a change to one of the shared libraries. One engineer can deploy the service in a matter of minutes.

The proof was in the improved velocity. In 2016, when our microservice architecture was still in place, we made 32 improvements to our shared libraries. Just this year we’ve made 46 improvements. We’ve made more improvements to our libraries in the past 6 months than in all of 2016.

The change also benefited our operational story. With every destination living in one service, we had a good mix of CPU and memory-intense destinations, which made scaling the service to meet demand significantly easier. The large worker pool can absorb spikes in load, so we no longer get paged for destinations that process small amounts of load.

Trade Offs

Moving from our microservice architecture to a monolith overall was huge improvement, however, there are trade-offs:

  1. Fault isolation is difficult. With everything running in a monolith, if a bug is introduced in one destination that causes the service to crash, the service will crash for all destinations. We have comprehensive automated testing in place, but tests can only get you so far. We are currently working on a much more robust way to prevent one destination from taking down the entire service while still keeping all the destinations in a monolith.

  2. In-memory caching is less effective. Previously, with one service per destination, our low traffic destinations only had a handful of processes, which meant their in-memory caches of control plane data would stay hot. Now that cache is spread thinly across 3000+ processes so it’s much less likely to be hit. We could use something like Redis to solve for this, but then that’s another point of scaling for which we’d have to account. In the end, we accepted this loss of efficiency given the substantial operational benefits.

  3. Updating the version of a dependency may break multiple destinations. While moving everything to one repo solved the previous dependency mess we were in, it means that if we want to use the newest version of a library, we’ll potentially have to update other destinations to work with the newer version. In our opinion though, the simplicity of this approach is worth the trade-off. And with our comprehensive automated test suite, we can quickly see what breaks with a newer dependency version.


Our initial microservice architecture worked for a time, solving the immediate performance issues in our pipeline by isolating the destinations from each other. However, we weren’t set up to scale. We lacked the proper tooling for testing and deploying the microservices when bulk updates were needed. As a result, our developer productivity quickly declined.

Moving to a monolith allowed us to rid our pipeline of operational issues while significantly increasing developer productivity. We didn’t make this transition lightly though and knew there were things we had to consider if it was going to work.

  1. We needed a rock solid testing suite to put everything into one repo. Without this, we would have been in the same situation as when we originally decided to break them apart. Constant failing tests hurt our productivity in the past, and we didn’t want that happening again.

  2. We accepted the trade-offs inherent in a monolithic architecture and made sure we had a good story around each. We had to be comfortable with some of the sacrifices that came with this change.

When deciding between microservices or a monolith, there are different factors to consider with each. In some parts of our infrastructure, microservices work well but our server-side destinations were a perfect example of how this popular trend can actually hurt productivity and performance. It turns out, the solution for us was a monolith.

The transition to a monolith was made possible by Stephen Mathieson, Rick Branson, Achille Roussel, Tom Holmes, and many more.

Special thanks to Rick Branson for helping review and edit this post at every stage.

See what Segment is all about...

Sign up for a free workspace or catch a demo here 👉

Calvin French-Owen on May 23rd 2018

Today, we’re excited to share the architecture for Centrifuge–Segment’s system for reliably sending billions of messages per day to hundreds of public APIs. This post explores the problems Centrifuge solves, as well as the data model we use to run it in production.

The Centrifuge problem

At Segment, our core product collects, processes, and delivers hundreds of thousands of analytics events per second. These events consist of user actions like viewing a page, buying an item from Amazon, or liking a friend’s playlist. No matter what the event is, it’s almost always the result of some person on the internet doing something.

We take these incoming events and forward them to hundreds of downstream endpoints like Google Analytics, Salesforce, and per-customer Webhooks.

At any point in time, dozens of these endpoints will be in a state of failure. We’ll see 10x increases in response latency, spikes in 5xx status codes, and aggressive rate limiting for single large customers. 

To give you a flavor, here are the sorts of latencies and uptimes I pulled from our internal monitoring earlier today. 

In the best case, these API failures cause delays. In the worst case, data loss.

As it turns out, ‘scheduling’ that many requests in a faulty environment is a complex problem. You have to think hard about fairness (what data should you prioritize?), buffering semantics (how should you enqueue data?), and retry behavior (does retrying now add unwanted load to the system?).

Across all of the literature, we couldn't find a lot of good 'prior art' for delivering messages reliably in high-failure environments. The closest thing is network scheduling and routing, but that discipline has very different strategies concerning buffer allocation (very small) and backpressure strategies (adaptive, and usually routing to a single place).

So we decided to build our own general-purpose, fully distributed job scheduler to schedule and execute HTTP requests reliably. We've called it Centrifuge.

You can think of Centrifuge as the layer that sits between our infrastructure and the outside world–it's the system responsible for sending data to all of our customers destinations. When third-party APIs fail, Centrifuge is there to absorb the traffic. 

Under normal operation, Centrifuge has three responsibilities: it delivers messages to third-party endpoints, retries messages upon failure, and archives any undelivered messages.

We've written this first post as a guide to understand the problems Centrifuge solves, its data model, and the building blocks we’ve used to operate it in production. In subsequent posts, we’ll share how we’ve verified the system’s correctness and made it blindingly fast.

Let’s dive in. 

When queues stop working

Before discussing Centrifuge itself, you might be thinking "why not just use a queue here? Building a fully distributed job scheduler seems a bit overdone”.

We've asked ourselves the same question. We already use Kafka extensively at Segment (we're passing nearly 1M messages per second through it), and it’s been the core building block of all of our streaming pipelines. 

The problem with using any sort of queue is that you are fundamentally limited in terms of how you access data. After all, a queue only supports two operations (push and pop).

To see where queues break down, let's walk through a series of queueing topologies that we’ve implemented at Segment.

Architecture 1: a single queue

To start, let’s first consider a naive approach. We can run a group of workers that read jobs from a single queue. 

Workers will read a single message off the queue, send to whatever third-party APIs are required and then acknowledge the message. It seems like it should protect us from failure, right?

This works okay for awhile, but what happens when we start seeing a single endpoint get slow? Unfortunately, it creates backpressure on the entire message flow.

Clearly, this isn’t ideal. If a single endpoint can bring down the entire pipeline, and each endpoint has an hour-long downtime each year (99.9% available), then with 200+ endpoints, we’ll be seeing hour-long outages once per day.

Architecture 2: queues per destination

After seeing repeated slowdowns on our ingestion pipeline, we decided to re-architect. We updated our queueing topology to route events into separate queues based upon the downstream endpoints they would hit.

To do this, we added a router in front of each queue. The router would only publish messages to a queue destined for a specific API endpoint.

Suppose you had three destinations enabled: Google Analytics, Mixpanel, and Salesforce. The router would publish three messages, one to each dedicated queue for Google Analytics, Mixpanel, and Salesforce, respectively.

The benefit of this approach is that a single failing API will only affect messages bound for a single endpoint (which is what we want!).

Unfortunately, this approach has problems in practice. If we look at the distribution of messages which should be delivered to a single endpoint, things become a little more nuanced.

Segment is a large, multi-tenant system, so some sources of data will generate substantially more load than others. As you might imagine, among our customer base, this follows a fairly consistent power law:

When that translates to messages within our queues, the breakdown looks more like this:

In this case, we have data for customers A, B, and C, all trying to send to the same downstream endpoint. Customer A dominates the load, but B and C have a handful of calls mixed in.

Let’s suppose that the API endpoint we are sending to is rated to 1,000 calls per second, per customer. When the endpoint receives more than 1,000 calls in a second for a given customer API key, it will respond with a 429 HTTP Header (rate limit exceeded). 

Now let’s assume that customer A is trying to send 50,000 messages to the API. Those messages are all ordered contiguously in our queue.

At this point we have a few options:

  • we can try and send a hard cap of 1,000 messages per second, but this delays traffic for B and C by 50 seconds.

  • we can try and send more messages to the API for customer A, but we will see 429 (rate limit exceeded) errors. We’ll want to retry those failed messages, possibly causing more slowdowns for B and C.

  • we can detect that we are nearing a rate-limit after sending 1,000 messages for customer A in the first second, so we can then copy the next 49,000 messages for customer A into a dead-letter queue, and allow the traffic for B and C to proceed.

None of these options are ideal. We’ll either end up blocking the queue for all customers in the case where a single customer sends a large batch of data, or we’ll end up copying terabytes of data between dead-letter queues.

Ideal state: queues per <source, destination>

Instead, we want an architecture that looks more like the following diagram, where we have separate queues per combination of customer and endpoint. This architecture gives us much better isolation, as well as the ability to dynamically adjust throughput on a per-customer basis.

However, in a large, multi-tenant system, like Segment, this number of queues becomes difficult to manage.

We have hundreds of thousands of these source-destination pairs. Today, we have 42,000 active sources of data sending to an average of 2.1 downstream endpoints. That’s 88,000 total queues that we’d like to support (and we’re growing quickly). 

To implement per source-destination queues with full isolation, we’d need hundreds of thousands of different queues. Across Kafka, RabbitMQ, NSQ, or Kinesis–we haven’t seen any queues which support that level of cardinality with simple scaling primitives. SQS is the only queue we’ve found which manages to do this, but is totally cost prohibitive. We need a new primitive to solve this problem of high-cardinality isolation.

Getting to “virtual” queues

We now have our ideal end-state: tens of thousands of little queues. Amongst those queues, we can easily decide to consume messages at different rates from Customers A, B, and C.

But when we start to think about implementation, how do we actually manage that many queues? 

We started with a few core requirements for our virtual queue system:

1) Provide per-customer isolation

First and foremost, we need to provide per-customer isolation. One customer sending a significant amount of failing traffic shouldn’t slow down any other data delivery. Our system must absorb failures without slowing the global delivery rate. 

2) Allow us to re-order messages without copying terabytes of data

Our second constraint is that our system must be able to quickly shuffle its delivery order without copying terabytes of data over the network. 

In our experience working with large datasets, having the ability to immediately re-order messages for delivery is essential. We’ve frequently run into cases which create large backlogs in data processing, where our consumers are spinning on a set of consistently failing messages.

Traditionally there are two ways to handle a large set of bad messages. The first is to stop your consumers and retry the same set of messages after a backoff period. This is clearly unacceptable in a multi-tenant architecture, where valid messages should still be delivered.

The second technique is to publish failed messages to a dead-letter queue and re-consume them later. Unfortunately, re-publishing messages to dead-letter queues or ‘tiers’ of topics with copies of the same event incurs significant storage and network overhead. 

In either case, if your data is sitting in Kafka–the delivery order for your messages is effectively ‘set’ by the producer to the topic:

We want the ability to quickly recover from errors without having to shuffle terabytes of data around the network. So neither of these approaches works efficiently for us. 

3) Evenly distribute the workload between many different workers

Finally, we need a system which cleanly scales as we increase our event volume. We don’t want to be continually adding partitions or doing additional sharding as we add customers. Our system should scale out horizontally based upon the throughput of traffic that we need. 

Data in Centrifuge

By this point, we have a good idea of the problems that Centrifuge solves (reliable message delivery), the issues of various queueing topologies, and our core requirements. So let’s look at the Centrifuge data layer to understand how we’ve solved for the constraints we just listed above.

The core delivery unit of Centrifuge is what we call a job

Jobs require both a payload of data to send, as well an endpoint indicating where to send the data. You can optionally supply headers to govern things like retry logic, message encoding, and timeout behavior. 

In our case, a job is a single event which should be delivered to a partner API. To give you an idea of what jobs look like in practice, here’s a sample Segment job:

Looking back at our requirements, we want a way of quickly altering the delivery order for our jobs, without having to create many copies of the jobs themselves. 

A queue won’t solve this problem in-place for us. Our consumer would have to read and then re-write all of the data in our new ordering. But a database, on the other hand, does

By storing the execution order inside a relational database, we can immediately change the quality of service by running a single SQL statement. 

Similarly, whenever we want to change the delivery semantics for our messages, we don’t have to re-shuffle terabytes of data or double-publish to a new datastore. Instead, we can just deploy a new version of our service, and it can start using the new queries right away. 

Using a database gives us the flexibility in execution that queues are critically lacking.

For that reason, we decided to store Centrifuge data inside Amazon’s RDS instances running on MySQL. RDS gives us managed datastores, and MySQL provides us with the ability to re-order our jobs. 

The database-as-a-queue

The Centrifuge database model has a few core properties that allow it to perform well:

  • immutable rows: we don’t want to be frequently updating rows, and instead be appending new rows whenever new states are entered. We’ve modeled all job execution plans as completely immutable, so we never run updates in the database itself.

  • no database JOINs: rather than needing a lot of coordination, with locks across databases or tables, Centrifuge need only query data on a per job basis. This allows us to massively parallelize our databases since we never need to join data across separate jobs.

  • predominantly write-heavy, with a small working set: because Centrifuge is mostly accepting and delivering new data, we don’t end up reading from the database. Instead, we can cache most new items in memory, and then age entries out of cache as they are delivered.

To give you a sense of how these three properties interact, let’s take a closer look at how jobs are actually stored in our Centrifuge databases.

The jobs table

First, we have the jobs table. This table is responsible for storing all jobs and payloads, including the metadata governing how jobs should be delivered.

While the endpoint, payload, and headers fields govern message transmission, the expire_at field is used to indicate when a given job should be archived.

By splitting expire_at into a separate field, our operations team can easily adjust if we ever need to flush a large number of failing messages to S3, so that we can process them out-of-band. 

Looking at the indexes for the jobs table, we’ve been careful to minimize the overhead of building and maintaining indexes on each field. We keep only a single index on the primary key.

The jobs table primary key is a KSUID, which means that our IDs are both are k-sortable by timestamp as well as globally unique. This allows us to effectively kill two birds with one stone–we can query by a single job ID, as well as sort by the time that the job was created with a single index. 

Since the median size of the payload and settings for a single job is about 5kb (and can be as big as 750kb), we’ve done our best to limit reads from and updates to the jobs table. 

Under normal operation, the jobs table is immutable and append-only. The golang process responsible for inserting jobs (which we call a Director) keeps a cached version of the payloads and settings in-memory. Most of the time, jobs can be immediately expired from memory after they are delivered, keeping our overall memory footprint low.

In production, we set our jobs to expire after 4 hours, with an exponential backoff strategy.

Of course, we also want to keep track of what state each job is in, whether it is waiting to be delivered, in the process of executing, or awaiting retry. For that, we use a separate table, the job_state_transitions table. 

The job state transitions table

The job_state_transitions table is responsible for logging all of the state transitions a single job may pass through.

Within the database, the job state machine looks like this:

A job first enters with the awaiting_scheduling state. It has yet to be executed and delivered to the downstream endpoint.

From there, a job will begin executing, and the result will transition to one of three separate states. 

If the job succeeds (and receives a 200 HTTP response from the endpoint), Centrifuge will mark the job as succeeded. There’s nothing more to be done here, and we can expire it from our in-memory cache.  

Similarly, if the job fails (in the case of a 400 HTTP response), then Centrifuge will mark the job as discarded. Even if we try to re-send the same job multiple times, the server will reject it. So we’ve reached another terminal state.

However, it’s possible that we may hit an ephemeral failure like a timeout, network disconnect, or a 500 response code. In this case, retrying can actually bring up our delivery rate for the data we collect (we see this happen across roughly 1.5% of the data for our entire userbase), so we will retry delivery.

Finally, any jobs which exceed their expiration time transition from awaiting_retry to archiving. Once they are successfully stored on S3, the jobs are finally transitioned to a terminal archived state.

If we look deeper into the transitions, we can see the fields governing this execution:

Like the jobs table, rows in the job_state_transitions are also immutable and append-only. Every time a new attempt is made, the attempt number is increased. Upon job execution failure, the retry is scheduled with a retry_at time by the retry behavior specified in the job itself. 

In terms of indexing strategy, we keep a composite index on two fields: a monotonically incrementing ID, as well the ID of the job that is being executed.

You can see here in one of our production databases that the first index in the sequence is always on the job_id, which is guaranteed to be globally unique. From there, the incrementing ID ensures that each entry in the transitions table for a single job’s execution is sequential. 

To give you a flavor of what this looks like in action, here’s a sample execution trace for a single job pulled from production. 

Notice that the job first starts in the awaiting-scheduling state before quickly transitioning to its first delivery attempt. From there, the job consistently fails, so it oscillates between executing and awaiting-retry

While this trace is certainly useful for internal debugging, the main benefit it provides is the ability to actually surface the execution path for a given event to the end customer. (Stay tuned for this feature, coming soon!)

Interacting with the database: the Director

Up until this point, we’ve focused exclusively on the data model for our jobs. We’ve shown how they are stored in our RDS instance, and how the jobs table and jobs_state_transitions table are both populated. 

But we still need to understand the service writing data to the database and actually executing our HTTP requests. We call this service the Centrifuge Director

Traditionally, web-services have many readers and writers interacting with a single, centralized database. There is a stateless application tier, which is backed by any number of sharded databases.

Remember though, that Segment’s workload looks very different than a traditional web-service. 

Our workload is extremely write-heavy, has no reads, and requires no JOINs or query coordination across separate jobs. Instead, our goal is to minimize the contention between separate writers to keep the writes as fast as possible. 

To do that, we’ve adopted an architecture where a single Director interacts with a given database. The Director manages all of its caching, locks, and querying in-process. Because the Director is the sole writer, it can manage all of its cache invalidations with zero-coordination. 

The only thing a Director needs to globally coordinate is to which particular database it is writing. We call the attached database a JobDB, and what follows is a view into the architecture for how Directors coordinate to acquire and send messages to a JobDB.

When a Director first boots up, it follows the following lifecycle:

Acquire a spare JobDB via Consul – to begin operating; a Director first does a lookup and acquires a consul session on the key for a given JobDB. If another Director already holds the lock, the current Director will retry until it finds an available spare JobDB. 

Consul sessions ensure that a given database is never concurrently written to by multiple Directors. They are mutually exclusive and held by a single writer. Sessions also allow us to lock an entire keyspace so that a director can freely update the status for the JobDB in Consul while it continues to hold the lock. 

Connect to the JobDB, and create new tables – once a Director has connected to a spare JobDB, it needs to create the necessary tables within the connected DB. 

Rather than use an ORM layer, we’ve used the standard database/sql golang interface, backed by the go-sql-driver/mysql implementation. Many of these queries and prepared statements are generated via go:generate, but a handful are handwritten.

Begin listening for new jobs and register itself in Consul – after the Director has finished creating the necessary tables, it registers itself in Consul so that clients may start sending the Director traffic. 

Start executing jobs – once the Director is fully running, it begins accepting jobs. Those jobs are first logged to the paired JobDB; then the Director begins delivering each job to its specified endpoint.

Now that we understand the relationship between Directors and JobDBs, we can look back at the properties of the system (immutable, extremely write-heavy with a small working set, no database JOINs), and understand how Centrifuge is able to quickly absorb traffic.

Under normal operation, the Director rarely has to read from the attached JobDB. Because all jobs are immutable and the Director is the sole writer, it can cache all jobs in-memory and expire them immediately once they are delivered. The only time it needs to read from the database is when recovering from a failure. 

Looking at the pprof for our memory profile, we can see that a significant proportion of heap objects do indeed fall into the category of cached jobs:

And thanks to the cache, our writes dominate our reads. Here’s the example Cloudwatch metrics that we pulled from a single active database.

Since all jobs are extremely short-lived (typically only a few hundred milliseconds while it is being executed), we can quickly expire delivered jobs from our cache. 

All together now

Taking a step back, we can now combine the concepts of the Centrifuge data model with the Director and JobDB.

First, the Director is responsible for accepting new jobs via RPC. When it receives the RPC request, it will go ahead and log those jobs to the attached JobDB, and respond with a transaction ID once the jobs have been successfully persisted.

From there, the Director makes requests to all of the specified endpoints, retrying jobs where necessary, and logging all state transitions to the JobDB.

If the Director fails to deliver any jobs after their expiration time (4 hours in our case), they are archived on S3 to be re-processed later.

Scaling with load

Of course, a single Director wouldn’t be able to handle all of the load on our system.

In production, we run many of individual Directors, each one which can handle a small slice of our traffic. Over the past month, we’ve been running anywhere from 80 to 300 Directors at peak load. 

Like all of our other services at Segment, the Directors scale themselves up and down based upon CPU usage. If our system starts running under load, ECS auto-scaling rules will add Directors. If we are over capacity, ECS removes them.

However, Centrifuge created an interesting new motion for us. We needed to appropriately scale our storage layer (individual JobDBs) up and down to match the scaling in our compute layer (instances of Director containers).

To do that, we created a separate binary called the JobDB Manager. The Manager’s job is to constantly adjust the number of databases to match the number of Directors. It keeps a pool of ‘spare’ databases around in case we need to scale up suddenly. And it will retire old databases during off-peak hours. 

To keep the ‘small working set’ even smaller, we cycle these JobDBs roughly every 30 minutes. The manager cycles JobDBs when their target of filled percentage data is about to exceed available RAM. 

This cycling of databases ensures that no single database is slowing down because it has to keep growing its memory outside of RAM.

Instead of issuing a large number of random deletes, we end up batching the deletes into a single drop table for better performance. And if a Director exits and has to restart, it must only read a small amount of data from the JobDB into memory.

By the time 30 minutes have passed, 99.9% of all events have either failed or been delivered, and a small subset are currently in the process of being retried. The manager is then responsible for pairing a small drainer process with each JobDB, which will migrate currently retrying jobs into another database before fully dropping the existing tables. 

In production

Today, we are using Centrifuge to fully deliver all events through Segment. By the numbers this means:

  • 800 commits from 5 engineers

  • 50,000 lines of Go code

  • 9 months of build, correctness testing, and deployment to production

  • 400,000 outbound HTTP requests per second

  • 2 million load-tested HTTP requests per second

  • 340 billion jobs executed in the last month

On average, we find about 1.5% of all global data succeeds on a retry, where it did not succeed on the first delivery attempt. 

Depending on your perspective, 1.5% may or may not sound like a big number. For an early-stage startup, 1.5% accuracy is almost meaningless. For a large retailer making billions of dollars in revenue, 1.5% accuracy is incredibly significant. 

On the graph below, you can see all successful retries split by ‘attempt number’. We typically deliver the majority of messages on their second try (the large yellow bar), but about 50% of retries succeed only on the third through the tenth attempts. 

Of course, seeing the system operate at ‘steady-state’ isn’t really the most interesting part of Centrifuge. It’s designed to absorb traffic in high-load failure scenarios.

We had tested many of these scenarios in a staging account, but had yet to really see a third-party outage happen in production. One month after the full rollout, we finally got to observe the system operating in a high-failure state.

At 4:45 pm on March 17th, one of our more popular integrations started reporting high latencies and elevated 500s. Under normal operation, this API receives 16,000 requests per second, which is a fairly significant portion of our outbound traffic load. 

From 4:45pm until 6:30pm, our monitoring saw a sharp decline and steeply degraded performance. The percentage of successful calls dropped to about 15% of normal traffic load.

Here you can see the graph of successful calls in dark red, plotted against the data from one week before as the dashed thin line.  

During this time, Centrifuge began to rapidly retry the failed requests. Our exponential backoff strategy started to kick in, and we started attempting to re-send any requests which had failed.

Here you can see the request volume to the third-party’s endpoint. Admittedly this strategy still needs some tuning–at peak, we were sending around 100,000 requests per second to the partner’s API. 

You can see the requests rapidly start retrying over the first few minutes, but then smooth out as they hit their exponential backoff period. 

This outage was the first time we’d really demonstrated the true power of Centrifuge. Over a 90-minute period, we managed to absorb about 85 million analytics events in Segment’s infrastructure. In the subsequent 30 minutes after the outage, the system successfully delivered all of the queued traffic. 

Watching the event was incredibly validating. The system worked as anticipated: it scaled up, absorbed the load, and then flushed it once the API had recovered. Even better, our mutual customers barely noticed. A handful saw delays in delivering their data to the third-party tool, but none saw data loss. 

Best of all, this single outage didn’t affect data delivery for any other integrations we support! 

All told, there’s a lot more we could say about Centrifuge. Which is why we’re saving a number of the implementation details around it for further posts.

In our next posts in the series, we plan to share: 

  • how we’ve verified correctness and exactly-once delivery while moving jobs into Centrifuge

  • how we’ve optimized the system to achieve high performance, and low-cost writes

  • how we’ve built upon the Centrifuge primitives to launch an upcoming visibility project

  • which choices and properties we plan on re-thinking for future versions

Until then, you can expect that Centrifuge will continue evolving under the hood. And we’ll continue our quest for no data left behind. 

Interested in joining us on that quest? We’re hiring.

See what Segment is all about

Since you've made it this far, perhaps you want to check out Segment? Sign up for a free workspace here or a get a demo here 👉

Centrifuge is the result of a 9-month development and roll-out period. 

Rick Branson designed and architected the system (as well as christened it with the name). Achille Roussel built out most of the core functionality, the QA process, and performance optimizations. Maxence Charriere was responsible for building the initial JobDB Manager as well as numerous integration tests and checks. Alexandra Noonan built the functionality for pairing drainers to JobDBs and helping optimize the system to meet our cost efficiency. And Tom Holmes wrote most of the archiving code, the drainer processes, and tracked down countless edge cases and bugs. Julien Fabre helped architect and build our load testing environment. Micheal Lopez designed the amazing logo.

Special thanks to James Cowling for advising on the technical design and helping us think through our two-phase-commit semantics.

To close, we wanted to share a few of the moments in development and rollout:

June 23rd, 2017: Max, Rick, and Achille begin testing Centrifuge on a subset of production traffic for the first time. They are stoked.

Sept 22, 2017: Achille gets some exciting ideas for cycling databases. Feverish whiteboarding ensues.

January 12, 2018: we hit a major milestone of 70% traffic flowing through the system. Tom bows for the camera.

Mar 14, 2018: We hit a new load test record of 2M messages per second in our “Black Hole” account.

May 22, 2018: Tom, Calvin, Alexandra, and Max take a group picture, since we forgot to earlier. Rick and Achille are traveling

Evan Johnson on March 6th 2018

Segment receives billions of events from our customers daily and has grown in to dozens of AWS accounts. Expanding in to many more accounts was necessary in order to best align with our GDPR and security initiatives, but it comes at a large complexity cost. In order to continue scaling gracefully we are investing in building tooling for employees to use with many accounts, and centrally managing employee access to AWS with terraform and our identity provider.

Segment began in a single AWS account and last year finished our move to a dev, stage, prod, and “ops” accounts. For the past few months we’ve been growing at about one new AWS account every week or two, and plan to continue this expansion in to per-team and per-system accounts. Having many “micro-accounts” provides superior security isolation between systems, and reliability benefits by limiting the blast radius of AWS rate-limits.

When Segment had only a few accounts, employees would log in to the AWS “ops” account using their email, password, and 2FA token. Employees would then connect to the ops-admin role in the dev, stage, and prod accounts using the AssumeRole api.

Segment now has a few dozen AWS accounts and plans to continue adding more! In order to organize this expansion we needed a mechanism to control our accounts, which accounts employees have access to, and each employee’s permissions in each account. 

We also hate using AWS API keys when we don’t absolutely have to so we moved to a system where no employees have any AWS keys. Instead, employees only access AWS through our identity provider. Today we have zero employees with AWS keys and there is no future need for employees to have a personal AWS key. This is a massive security win!

Designing a scalable IAM architecture

Segment uses Okta as an identity provider, and consulted their integration guide for managing multiple AWS accounts, but improved it with a minor change for better employee experience. The integration guide recommends connecting the identity provider to each AWS account but this breaks AWS’ built in support for account switching and was more complicated to audit which teams had access to which roles.

Instead, employees use our identity provider to connect to our “ops” account and then use the simple token service assume-role API to connect to each account they have access to. Using our identity provider, each team is assigned to a different role in our hub account, and each team role has access to different roles in each account. This is the classic “hub-and-spoke” architecture.

In order to make maintaining our hub-and-spoke architecture simple, we built a terraform module for creating a role in our spoke accounts, and a separate terraform module for creating a role in our hub account. Both modules simply create a role and attach a policy ARN to it, which is part of the module’s input.  

The only difference between the modules are their trust relationships. The hub role module allows access from our identity provider while the spoke module only allow access from the hub account. Below is module we use for allowing access to a hub role from our Identity provider. 

In order to provide each team with granular access to only the resources the teams need we create a role for each team in the hub account using our hub role terraform module. These roles mostly contain IAM policies for sts:AssumeRole in to other accounts but it is also possible to give granular access in our hub role too.

One concrete and simple example of a granular policy is our Financial Planning and Analysis team’s role, who keeps close watch on our AWS spend. Our FP&A team only has access to billing information and information about our reserved capacity.

The FP&A team does not have access to our spoke accounts, though. One team that needs full access to much of our infrastructure and all of our accounts is our Foundation and Reliability team, who participate in our on-call rotation. We provide both a ReadOnly role, and an Administrator role to our foundation team in all of our accounts.

After per-team roles are created for each team in the hub account, employees are assigned to groups that represent their teams in Okta, and each team can then be assigned to their associated role in the hub account.

Okta allows each group to be assigned different IAM roles in the hub account, and using their UI we can assign the FP&A team to our “Amazon Web Services” app, and restrict their access to the fpa role that we created for them in the hub account. 

After building this, we needed the tooling to provide our employees with an amazing engineering experience. Even though this system is far more secure, we wanted it to be just as usable and efficient as our setup with only a handful of AWS accounts.

Maintaining usability with aws-okta

One great thing about our old IAM setup was each employee with AWS access could use AWS APIs from their local computer using aws-vault. Each employee had their IAM user credentials securely stored in their laptop’s keychain. However, accessing AWS entirely through Okta is a massive breaking change for our old workflows.

Our tooling team took up the challenge and created a (near) drop in replacement for aws-vault which our engineering team used extensively, called aws-okta. aws-okta is now open-source and available on github

The quality of aws-okta is the principal reason that Segment engineers were able to smoothly have their AWS credentials revoked. Employees are able to execute commands using the permissions and roles they are granted, exactly like they did when using aws-vault.

There is a lot of new complexity handled with aws-okta that is is not able to be handled in aws-vault. While aws-vault uses IAM user credentials to run commands, aws-okta uses your Okta password (stored in your keychain) to authenticate with Okta, waits for a response to a push notification for 2FA, and finally provides AWS with a SAML assertion to retrieve temporary credentials.

In order to authenticate with Okta, aws-okta needs to know your Okta “application id”. We took the liberty of extending the ~/.aws/config ini file to add in the necessary id.

When Segment had only a few AWS accounts and the ops-admin role, Segment engineers all shared the same ~/.aws/config. Once each team had access to different accounts and systems, we needed a better system to manage each team’s ~/.aws/config. Our system also needed a way to update the access that employees had quickly, when new accounts and roles are created.

We decided to integrate this solution closely with prior art that Segment had built. Each team’s config is stored in a git repo that has our company dotfiles in it. Each team can initialize their aws config by using our internal tool called robo, which is a tool to share helpful commands between employees.

This was only possible to add because all Segment engineers already had an environment variable called called SEGMENT_TEAM, which denotes the team the engineer is a part of. Running robo aws.config will clone the dotfiles repo, save the old ~/.aws/config, and initialize the most recent config for their team.

AWS bookmarks were the primary way that engineers navigated our environment when we utilized fewer accounts. When we got rid of the ops-admin role, the engineers sign-in bookmarks stopped working. Additionally, AWS bookmarks only support up to five different AssumeRole targets and we now have many more than five accounts.

In order to support having many more accounts, we mostly abandoned bookmarks and instead ensured that aws-okta supports engineers who needed to switch AWS accounts often. Our previous use of aws-vault meant many of us were familiar with the aws-vault login command. We found that adding a login command to aws-okta helped engineers who switched accounts often. 

After responding to the Duo push notification aws-okta will open a browser and log in to the specified role in only a couple of seconds. This feature is supported by the AWS Custom Federated Login feature, but feels more like magic when using it. It makes logging in a breeze.

Beyond 100 accounts

We expect to be near 50 AWS accounts by the end of this year. The security of having an account be completely closed by default, and the reliability benefits of having isolated per-account rate-limits are compelling.

This system we have built is plenty robust and usable enough to scale our AWS usage to many hundreds of AWS accounts and many more engineering teams.

Deleting all employee AWS keys was extremely satisfying from a security perspective, and this alone is a compelling enough reason to integrate your identity provider with your AWS hub account.

Our world-class engineering organization is hiring. If you are excited about building amazing tools to empower our engineers, our tooling team is hiring.

Alan Braithwaite on December 21st 2017

Go has a robust built-in testing library.  If you write Go, you already know this.  In this post we will discuss a handful of strategies to level up your Go testing.   We have learned from experience on our large Go codebase that these strategies work to save time and effort maintaining the code.

1. Use test suites

If you take away just one thing from this post, it should be: use test suites.  For those unfamiliar with the pattern, suite testing is the process of developing a test against a common interface which can be used against multiple implementations of that interface.  Below, you can see how we can pass in multiple different Thinger implementations and have them run against the same tests.

Fortunate readers may have worked with codebases which use this technique.  Frequently used in plugin-based systems, tests which are written against the interface are usable by all implementations of that interface to determine if the behavior requirements are met.

Using this strategy will save potentially hours, days, or perhaps even enough time to solve the P versus NP problem.  Also when swapping two underlying systems you don’t have to write (many) additional tests and it provides confidence that doing so won’t break your application.  This implicitly requires that you create an interface defining the surface area of that which you’re testing.  Using dependency injection, you set up the suite from your package passing in the implementation for the package.

A complete example is provided here.  While this example is fairly contrived, you can imagine one implementation being a remote database whilst the other being an in-memory database.

Another fantastic example of this in the standard library is the golang.org/x/net/nettest package.  It provides the means to verify a net.Conn satisfies its interface.  

2. Avoid interface pollution

We can’t talk about testing in Go without talking about interfaces.  

Interfaces are important in the context of testing because they are the most powerful tool in our test arsenal, so it’s important to use them correctly.  Packages frequently export an interface for consumers to use which in turn leads to either: a. consumers implementing their own mock of the package implementation or b. the package exporting their own mock. 

The bigger the interface, the weaker the abstraction.

— Rob Pike, Go Proverbs

Interfaces should be carefully considered before exporting them.  Developers are often tempted to export interfaces as a way for consumers to mock out their behavior.  Instead, document what interfaces your structs satisfy such that you don’t create a hard dependency between the consumer package and your own.  A great example of this is the errors package.

When we have an interface in our program which we don’t want to export can use an internal/ package subtree to keep it scoped to the package.  By doing this, we remove the concern that other consumers might depend on it and therefore can be flexible in the evolution of the interface as new requirements present themselves.  We usually create interfaces around external dependencies and use dependency injection so we can run tests locally.

This enables the consumer to implement small interfaces of their own, only wrapping the consumed surface of the library for their own testing.  For more details about this concept, see rakyll’s post on interface pollution.

3. Don’t export concurrency primitives

Go offers easy-to-use concurrency primitives which can also sometimes lead to their overuse.  We’re primarily concerned about channels and the sync package.  It is sometimes tempting to export a channel from your package for consumers to use.  Additionally, it’s a common mistake to embed sync.Mutex without making it private.  As with anything, this isn’t always bad but it does pose challenges when testing your program.

If you’re exporting channels, you’re exposing the consumer of the package to additional complexity they shouldn’t care about.  As soon as a channel is exported from a package, you’re opening up challenges in testing for the one consuming that channel.  To test well, the consumer will need to know:

  • When data is finished being sent on the channel

  • Whether or not there were any errors receiving the data

  • How does the package clean up the channel after completion, if at all?

  • How can I wrap an interface around the package API so I don’t have to call it directly?

Consider an example of reading a queue.  Here’s an example library which reads from the queue and exposes a channel for the consumer to read from.

Now a user of your library wants to implement a test for their consumer:

The user might then decide that dependency injection is a good idea here and write their own messages along the channel:

But wait, what about errors?

Now, how do we generate events to actually write into this mock which sufficiently replicate the behavior of the actual library we’re using?  If the library simply wrote a synchronous API, then we could add all this concurrency in our client code and it becomes much simpler to test.

When in doubt, remember that it’s always easy to add concurrency in a consuming package and hard/impossible to remove once exported from a library.  Finally, don’t forget to mention in package documentation whether or not a struct/package is safe for concurrent access by multiple goroutines!

Sometimes, it’s still desirable or necessary to export a channel.  To alleviate some of the problems above, you can expose channels through accessors instead of directly and force them to be read-only (←chan) or write-only (chan←) channels in their declaration.

4. Use net/http/httptest

 httptest allows you to exercise your http.Handler code without actually spinning up a server or binding to a port.  This speeds up tests and allows them to run in parallel with less effort.

 Here’s an example of the same test implemented using both methods.  It doesn’t look like much, but it saves you a considerable amount of code and resources.

Perhaps the biggest thing using httptest buys you is the ability to compartmentalize your test to just the function you want to test.  No routers, middleware or any other side-effects that come from setting up servers, services, handler factories, handler factory factories or whatever abominations are thrown upon you by ideas your former self thought were good.

To see more of this pattern in action, see this post by Mark Berger.

5. Use a separate _test package

Most tests in the ecosystem are created in files pkg_test.go but still live in the same package: package pkg.  A separate test package is a package you create in a new file, foo_test.go, inside the directory of the package you wish to test, foo/, with the declaration package foo_test .  From there, you can import github.com/example/foo and other dependencies.  This feature enables a number of things. It is the recommended workaround for cyclic dependencies in tests, it prevents brittle tests, and it allows the developer to feel what it’s like to consume their own package.    If your package is hard to use, it will likely also be hard to test using this method.

This strategy prevents brittle tests by restricting access to private variables.  In particular, if your tests break and you’re using a separate test packages it’s almost guaranteed that a client using the feature which broke in tests will also break when called.

Finally, this aids in avoiding import cycles in tests.  Most packages likely depend on other packages you wrote aside from those being tested, so you’ll eventually run into a situation where an import cycle happens as a natural consequence.  An external package sits above both packages in the package hierarchy.   To take an example from The Go Programming Language (Chp. 11 Sec 2.4), net/url implements a URL parser which net/http imports for use.  However, net/url would like to test using a real use case by importing net/http.  Thus net/url_test was born.

Now that you’re using a separate test package you might require access to unexported entities in your package where they were previously accessible.  Most people hit this first when testing something time based (e.g. time.Now gets mocked via a function).  In this situation, we can use an additional file to expose them exclusively during testing since _test.go files are excluded from regular builds.

Something to remember

It’s important to remember that none of the methods suggested above are silver bullets.  The best solution is to always apply critical analysis to the situation and decide upon the best solution that fits the problem.

Want to learn even more Go testing techniques?

Check out these posts:

Or these videos:

Fouad Matin on October 12th 2017

It’s not hyperbole to say that Segment would not exist, if not for open source. We’re heavy users of Kafka, Redis, Terraform, Docker, Golang, and Node.js, just to name a few of the tools we use. And we literally got our start as an open source library launched on Hacker News

Today, our engineering team actively contributes to hundreds of open source repos on Github. And while it’s a good start, we wanted to do more. So six months ago, we embarked on a bit of an experiment.

Following in the footsteps of Stripe and Google, we opened a request for Open Source Fellows”

The pitch was simple: we give you $24,000 and three months to develop open source software, no strings attached. At the end, you present your results.

Today, we’re excited to announce the results of the program and the progress our fellows have made.

  • Tobias Koppers’ work on Webpack

  • Ben Weinstein’s work on DeepMeerkat

  • Justin Keyes’ work on Neovim

  • Julia Evans upcoming work on a Ruby Debugger and Profiler

For more information on the progress each participant made, read on!

Tobias Koppers

Tobias is one of the core maintainers and project lead for Webpack–a module bundler for web applications. Tens of thousands of companies (including us here at Segment!) use Webpack to bundle together their image assets, javascript files, and CSS into single ‘bundled files’.

Instead of writing various pieces of inlined javascript, Webpack is the “universal tool” to get all of your raw source code into a single set of optimized, dependency-managed files. 

Tobias started Webpack back in 2012 while working on his masters thesis. As part of his thesis, he was trying to create a simple webapp, and wanted a way to bundle his javascript together. But he couldn’t find a lot of good prior art out there. 

At that time, the frontend Javascript ecosystem looked very different. You didn’t really see require statements or dependency management. Most libraries were referenced through global variables stored on the window object (think jQuery’s window.$ object).

In the best case, you’d have an asset pipeline like Rails that injected all of your core libraries into a single HTML template.

Worst case (let’s be real, average case), you’d manage them by hand like this:

And while the odd project might be using a tool like require.js or browserify–at the time they weren’t really mainstream.

So Tobias started by looking at Medikoo’s modules-webmake to package together the dependencies in his webapp. He liked the idea of not having to worry about the dependencies themselves and instead focusing just on the core development. 

Tobias had previously used GWT’s code splitting feature, so he pull requested it as an addition to modules-webmake. But it was such a drastic change, he ended up forking the project into a new repository dubbed modules-webpack, which eventually became webpack/webpack

As part of his work over the past three months, Tobias has focused on three separate additions to Webpack: implementing properly ordered scope hoisting for modules, speeding up the core parser, and in-lining ‘pure’ modules.

The features themselves are really cool–so it’s worth digging a little bit into what’s going under the hood. 

Scope Hoisting

For a bundler like Webpack, it’s a top priority to respect the ECMAScript specification when it comes to re-writing and generating spec-compliant code. And previously, Webpack had a few issues here. 

As a quick example, suppose that we have two files, an index.js containing most of our code, and a corresponding b.js file which contains additional code. It should be possible to structure these modules like this:



While this is a contrived example, it matches the hoisting rules for ES6 functions and modules (all defined functions should be ‘hoisted’ to the scope of the module). 

You can see the fix for yourself here.

Parser Speedup

Second, Tobias turned his focus towards speeding up the core webpack parser. Since Webpack might re-build its source javascript hundreds of times locally in a given development cycle, one of the core requirements is that it has to be fast.

In particular, he implemented what he terms a StackedSetMap. The general idea is that it combines the stack-based rules of variable and functional scoping with the quick access of the maps and sets. You can think of it as a “stack” of “maps”, with each map representing the variables in-scope at a certain point. 

Instead of repeatedly copying large amounts of memory, the StackedSetMap just tracks parent references (typically another stack frame) and then searches the map of locally defined variables.

In this way, the StackedSetMap keeps both variable access and writing fast:

  • creating child scope: O(1)

  • reading a value the first time: O(n) with n is the number of parent scopes

  • reading a value the second time: O(1)

  • writing a value: O(1)

If the parser ever wants to know that a given variable is in scope, it can then query the map like so.

If you’d like to see the change in action, you can find it here.

Side-effect-free pure modules

Finally, Tobias added support for side-effect-free ‘pure’ modules. As a toy example, suppose we have the following folder structure.

Here we have an a.js, a b.js and an index.js file which exports them both. Suppose our index file looks something like this:

In most cases, the compiler or bundler will have to re-package the entire module (index.js, a.js, and b.js) to meet the spec. After all, ES7 specifies that the entire module be loaded, just in case something from a.js modifies something from b.js

However, Webpack module authors can specifically mark their fields as being "pure-module": true in their package.json

When this happens, Webpack knows that each individual file won’t have any sort of side-effects when imported directly. So it can introduce an optimization and effectively omit including the index.js file, and instead require my-module/a and my-module/b directly. 

This creates smaller builds which skip the extra set of lookups. It’s a win for everyone! And here’s the PR.

You can find Tobias’ full write up of all his improvements (and those of the full Webpack team) up on his the Webpack medium blog

Ben Weinstein

Ben differs from the majority of the applicants to the fellowship. His focus isn’t on developer tooling or infrastructure. And while he has some software engineering background, he self-identifies primarily as a field research biologist. 

Ben is a postdoc at the Oregon State University where he studies ecology. He spends his time applying statistical methods to Ecology, and is one of the world experts on Taxonomic, phylogenetic, and trait beta diversity in South American hummingbirds.

Ben tells us that the most common means for ecologists to collect data is by observing animals directly in their natural habitat.

Here’s the typical setup. There’s a stationary camera somewhere, filming a congregation spot for hummingbirds, or butterflies, even sharks. The camera films the location for fifteen days in a row (!), and then the ecologist comes and retrieves the footage. 

In his case, his camera takes a still frame several times per second, gathering 46,000 images over the course of the day.

The ecologist then combs through the video, hoping to catch a fleeting glimpse of the animal to identify it. Once they catch the handful of ‘good frames’, they’ll loop those frames again and again to properly tag the specimen. 

It’s a needle in a haystack search to find 10-15mb of “good frames” amongst nearly 25gb of noise. 

Obviously, this process is time-consuming, error-prone, and tedious. So Ben thought it was a ripe candidate for a little bit of automation.

The result is a project he calls “DeepMeerkat”. It’s designed to run on your laptop, you just feed it an animal video, and it will automatically skip empty frames, and highlight the useful/interesting ones. 

Under the hood, DeepMeerkat uses Tensorflow to build a model to help identify the animals in videos. Each video is loaded into memory, and first analyzed using OpenCV

For each frame in the video, OpenCV will filter for background vs foreground noise by using a Gaussian image filter and then draw ‘bounding boxes’ around any distinct shapes. 

If the difference of the foreground is significant, the frame will be then passed into Tensorflow to train as part of the model there. 

Ben’s been testing his model against a variety of animals and landscapes:

And he’s uploaded a number of sample videos to his youtube channel to help train his dataset against.

Over the course of the fellowship, Ben built out most of the software, and just added the ability to substitute your own training models in Tensorflow. His idea is that anyone can train their own Tensorflow model for their particular dataset–though right now he’s trained his net against ImageNet.

Additionally, Ben has started work to do the full model training in Tensorflow, rather than passing it to OpenCV. While pushing frames into Tensorflow currently is insanely slow (2-3fps), he’d like to be able to do a first “filter” with OpenCV and then actually identify the animal within Tensorflow.

Currently, everything runs locally on his laptop, but he’s been experimenting with feeding the data into Google’s Cloud Dataflow to run it in the cloud. It’s his dream that someday he can just upload a video and have it automatically tagged. 

He’s hoping in the next six months he can spread it more widely. Over the past year, he’s seen about 800 downloads of DeepMeerkat, and he’d like to start onboarding more contributors. You can find the source on his github.

Justin Keyes

Justin Keyes has been working on Neovim, a modernized fork of popular text editor: vim. If you’ve SSH’d into a unix-based server at some point in the past 10 years, you’ve probably used vim to edit your files. 

Vim itself has been around since 1991, when Bram Moolenaar ported the stevie editor from the Atari ST (which in turn, was based upon Bill Joy’s vi for Unix) to the Amiga. 

And while vim has shipped on practically every linux, unix, and mac distribution since 1991, it means that there are a lot of parts of the codebase that have been added to handle 20 years worth of quirks and inconsistencies. 

Enter Neovim–an effort to create a fork of vim that is extensible, pluggable, and hackable.

Instead of starting from scratch and then re-building all of the battle-tested functionality that vim has acquired over its 20-year history, the Neovim team started with vim itself as a foundation. But with the explicit goal of encouraging contributions, hacking, and pluggable functionality.

While Vim still just has a single named contributor (Bram Moolenaar), Neovim takes commits from hundreds of different engineers and enthusiasts.

Over the three months of the fellowship, Justin has been hard at work closing issues, merging contributions, and generally focusing on cleaning up various parts of the codebase. 

In addition, he’s been working on a significant new feature: multicursor support. The goal is to allow users to queue up multiple actions at once, and then apply them in one go. 

But when he started to build out the new feature, Justin realized that creating multi-cursor support in Neovim wasn’t actually feasible with the current codebase. 

Most projects that re-implement vim, model sets of operations as "pipelines" that produce nice, composable “commands”. These commands are then “executed”. But Neovim wasn't implemented like that. 

Instead, in Neovim, many normal-mode operations are implemented by pushing keys onto the internal input queue, as if the user had typed them. It’s clever, and preserves parity, but hard to reason about. And even harder to leverage programmatically.

So Justin started work to fix the root problem by introducing two new primitives: atoms and contexts


Atoms are simple, they provide a combination of user-actions, rather than individual keystrokes. And as the name implies, they can be grouped and applied atomically. As an example, instead of using 3j to jump ahead three characters and treating 3  and j as different single keystrokes–this update groups them as an ‘atom’ that can be repeated re-used.

Previously, Neovim would track macros as a string buffer of chacters. A given macro might look like the raw input stream defined below:

With the multicursor work, Neovim now tracks the individual atoms:

By grouping individual commands into actions, user scripts, plugins, and remote clients can query the atoms queue at any time via the Neovim API.

For example, here's the last 3 items in the atoms queue from a sample editing session:

It's easy to see that this is useful for applications beyond multicursor, such as introspection of user editing patterns, or re-invoking the last "thing" or the Nth last "thing". Using the current implementation, the following code maps "space" to repeat the last atom (unlike dot-repeat, this repeats motions, navigation, etc.):

With that mapping, a motion like 3j or CTRL-D can be repeated by pressing "space".


The context contains the current Neovim state. It allows you to serialize a given Neovim session and actions, and retrieve it via Neovim’s API. It should enable multiple frontends to read from the same Neovim state to enable functionality like remote vim UIs.

In the current implementation, you can call the nvim_get_context() API call, and it will return you the full context–no complicated parsing required:

Justin expects the work to land just after the upcoming release in 0.2.1. And it helps ensure that everything in Neovim is API first.

If you’d like to check out Neovim’s source, you can find it on Github

Julia Evans

Last, we have Julia Evans who deferred her participation until early next year. She hasn’t yet started the program, but we’re excited to help fund her next project: an new profiling tool for Ruby.

If you haven’t yet read her blog, Julia runs a wonderful set of descriptions and zines on linux performance and tracing tools. She’s written invaluable posts on everything from strace to pprof to eBPF

When Julia applied, she said she was excited to work on better profiling tools for ruby. And naturally, she’d already outlined some initial work in a post. :D

The inspiration for her work comes from using interactive tools like gdb or perf, and a desire to provide that same kind of experience for Ruby. 

Today, if you want to get a CPU profile for an arbitrary Ruby program, you… can’t. You can use tools like stackprof (which is great), but to use stackprof you need to instrument your program in advance. The initial goal of this project is to be able to get a CPU profile for any Ruby program. You can do this for C/C++ programs, Rust programs, Go programs, and Java programs… so why should Ruby have worse tooling? 

As she talks about in her post, you can interactively connect to a Ruby process with gdb. And you can even get the currently executing spot within that program using your gdb session. This excerpt is taken from her blog post:

If you only know the process number, you can still understand exactly what your program is doing at this very moment.

Yet this still isn’t great. gdb uses the ptrace system call, which causes the program to stop in its tracks and then intensely query it for its internals. It’s not really a ‘passive’ profiler at all, and won’t work in cases where your program is actually under load.  

In just a few days, Julia built a prototype in Rust which could interactively introspect the system calls, function calls, and even spy on the memory contents using the process_vm_readv call (a syscall which can directly read memory from a user-space program). 

Here’s an example of a flamegraph generated using her prototype:

It’s still has a little ways to go, and she plans to make it more portable and work quickly and reliably across any sort of Linux distribution. 

She’ll be starting early next year to work full-time on the fellowship. And during that time, she’ll be on sabbatical from her work at Stripe.

If you’d like you can check out the early code on her Github, you can find it here and her full blog post outlining the project here.

Looking Ahead

We’ve been quite pleased with the breadth and depth of open source work that has come out of the fellowship. And we’ve been impressed with what a single focused individual or small team can accomplish in as little as three months.

A number of the fellows agreed that having the funding and 3-month period allowed them to really focus on tackling ‘bigger’ projects that they wouldn’t have had time for otherwise. 

We’re hopeful that we can continue the program again next year, and encourage another batch of fellows to help spread the open source love. If you’re interested in applying next year, leave your email as part of our Open Fellowships email list and we’ll send you a reminder once applications are open.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.