Go back to Blog


Pablo Vidal Bouza on July 15th 2021

How Segment moved from traditional SSH bastion hosts to use AWS Systems Manager SSM to manage access to infrastructure.

All Engineering articles

Maxime Santerre on February 28th 2019

Counting things is hard. And it gets even harder when you have millions of independent items. At Segment we see this every single day, counting everything from number of events, to unique users and other high-cardinality metrics. 

Calculating total counts proves to be easier since we can distribute that task over many machines, then sum up those counts to get the total sum. Unfortunately, we can’t do this for calculating cardinality since the sum of the number of unique items in multiple sets, isn’t equal to the number of unique items in the union of those sets.

Let’s go through the really basic way of calculating cardinality:

This works great for small amounts of data, but when you start getting large sets, it becomes very hard to keep all of this in memory. For example, Segment’s biggest workspaces will get over 5,600,000,000 unique anonymous_id in a month.

Let’s look at an example of a payload:

Quick maths:


Another big issue, is that you need to know which time range you want to calculate cardinality over. The number of unique items from February 14 to February 20 can be drastically different than February 15 to February 21. We would essentially need to keep the sets at the lowest granularity of time we need, then merge them to get the unique counts. Great if you need to be precise and you have a ton of time, not great if you want to do quick ad hoc queries for analytics.

Using HyperLogLog

Thankfully, we’re not the first ones to have this issue. In 2007, Philippe Flajolet came up with an algorithm called HyperLogLog, which could approximate the cardinality of a set using only a small, constant amount of memory. HyperLogLog also allows you to increase this amount of memory to increase accuracy by using more memory, or to decrease memory at the cost of accuracy if you have memory limitations.

16kb for 0.81% error rate? Not too bad. Certainly cuts down on that 134.4GB we had to deal with earlier. 

Our first approach

We initially decided to use Redis to solve this problem. The first attempt was a very simple process where we would take every message out of a queue and put it directly into Redis and then services would call PFMERGE and PFCOUNT to get the cardinality counts we needed.

This worked very well for our basic needs then, but it was hard for our analytics team to do any custom reporting. The second issue is that this architecture doesn’t scale very well. When initially got to that point, our quick fix was this:

Not ideal, but you would be surprised how much we got out of this. This scaled upwards of 300,000 messages/s due to Redis HLL operations being amazingly fast.

Ever so often, our message rate would exceed this and our queue depth would grow, but our system would keep chugging along once the message rate went back down. At this point we had the biggest Redis machine available on AWS, so we couldn’t throw more money at it. 

The maximum queue depth is steadily increasing every week and soon we will always be behind. Any kind of reporting that depends on this pipeline will become hours late, and soon maybe even days late, which is unacceptable. We really need to find a solution quickly

Introducing Spark

We already used Spark for our billing calculations, so we thought we could use what was there to feed into another kind of reporting.

The problem with the billing count is that they’re static. We only calculate for the workspace’s billing period. It’s a very expensive operation that takes every message for a workspace and calculates a cardinality over the user_id and anonymous_id for that exact slice. For some bigger customers, that’s several terabytes of data, so we need big machines, and it takes a while.

We previously used HLLs to solve this problem, so we’d love to use it again.

The problem is: How do we use HLLs outside of Redis?

Our requirements are:

  • We need a way to calculate cardinality across several metrics.

  • We need to be able to store this in a store with quick reads.

  • We can’t pre-calculate cardinality, we need to be able to run ad-hoc queries across any range.

  • It needs to scale almost infinitely. We don’t want to fix this problem again in a year when our message rate doubles.

  • It needs very little engineering resources.

  • (Optional) Doesn’t require urgent attention if it breaks.

First solution: Fix the scale issue

We want something reliable and easy to query, so we think of PostgreSQL and MySQL. At that moment, we find they are equally fine, but we choose MySQL because we already have one to store reporting data and the clock is ticking .

We use Scala for our Spark jobs, so we settle on Algebird to compute HLLs. It’s an amazing library and using it with Spark is trivial, so this step was very smooth. 

The map-reduce job goes like this:

  1. On every message, create a HLL out of every item you want cardinality metrics on and return that as the result of your key/map.

  2. On the reduce step, add all of the HLLs resulting from the map.

  3. Convert the resulting HLLs to the bytes representation and write them to MySQL.

Now we have all the HLLs as bytes into MySQL, we need a way to use them from our Go service. A bit of Googling around led me to this one. There were no ways of directly injecting the registers, but I was able to add it pretty easily and created a PR which will hopefully get merged one day

The next step is to get the registers and precision bits from Algebird, which was very simple. Next, we just take those bytes and make GoHLL objects that we can use!

Sweet! Now we can calculate cardinality of almost anything very easily. If it breaks, we fix the bug and turn it back on without any data loss. The worse that can happen is data delays. We can start using this for any use case: Unique events, unique traits, unique mobile device ids, unique anything.

It doesn’t take long for us to realize there’s 2 small problems with this:

  • Querying this is done through RPC calls on internal services, so it’s not easy for the analytics team to access.

  • At 16kb per HLL, queries ranging over several days on several metrics creates a big response from MySQL.

The latter being quite important. If we’re doing analysis on all our workspaces multiple times a day, we need something that iterates quickly. Let’s say someone wants to figure out the cardinality of 10 metrics, over 5 sources and 180 days.

We’re pulling down 140 MB of data, creating 9,000 HLLs, merging them, and calculating cardinality on each metric. This takes approximately 3–5s to run. Not a huge lot, but if we want to do this for hundreds of thousands of workspaces, it takes way too long.

Making this lightning fast

So we have a solution, but it’s not ideal.

I sat down at a coffee shop one weekend and started looking around for better solutions. Ideally, we’d want to do this on the database level so we don’t have to do all of this data transfer.

MySQL has UDFs (user defined functions) that we could use for this, but we use MySQL on AWS, and from my research, there doesn’t seem to be a way to use UDFs on Aurora, or RDS.

PostgreSQL on the other hand, has an extension called postgresql-hll, which is available on PostgresSQL RDS.

The storage format is quite different unfortunately, but it’s not as much of a problem since they have a library called java-hll that I can use in my Spark job instead of Algebird. No need to play with the headers this time.

Now we’re only pulling down 64 bit integers for each metric, we can query cardinality metrics with SQL directly, and this pipeline can scale almost infinitely. The best part? If it breaks for some reason at 4am, I don’t need to wake up and take care of it. We’re not losing any data because we’re using our S3 data archives and it can be retried at any time without any queues filling up disk.

The Final Step: Migrating the MySQL data to Postgres

So now that we have all of this processing from now going forward, we need to back fill everything back to the beginning of the year. I could take the lazy way and just run the Spark jobs on every day from January 1st 2018 to now, but that’s quite expensive. 

I said earlier we didn’t need to play with the headers, but we have an opportunity to save a bit of money (and have some fun) here by taking the HLLs saved in MySQL and transforming them into the postgresql-hll format and migrating them to our new PostgreSQL database.

You can take a look at the whole storage specification here. We’ll just be looking at the dense storage format for now.

Let’s look at an example of HLL bytes in hexadecimal: 146e0000000…

The first byte: 14 is defined as the version byte. The top half is the version and the bottom is the type. Version is easy, there’s only one, which is 1. The bottom half is 4, which represents a FULL HLL, or what we’ve been describing as dense.

The next 2 bytes represents the log2m parameter and the registerWidth parameter. log2m is the same as precision bits that we’ve seen above. The top 3 bits represents registerWidth - 1 and the bottom 5 represents log2m. Unfortunately, we can’t do this directly from the hex, so let’s expand it out to bits:

The rest are the HLL registers bytes.

Now that we have this header, we can easily reconstruct our MySQL HLLs by taking their registers and prefixing those with our 146e header. The only thing we need to do is take all of our existing serialized HLLs in MySQL and dump them in this new format in PostgreSQL. Money saved, and definitely much fun had.

Tyson Mote on February 7th 2019

The premise behind autoscaling in AWS is simple: you can maximize your ability to handle load spikes and minimize costs if you automatically scale your application out based on metrics like CPU or memory utilization. If you need 100 Docker containers to support your load during the day but only 10 when load is lower at night, running 100 containers at all times means that you’re using 900% more capacity than you need every night. With a constant container count, you’re either spending more money than you need to most of the time or your service will likely fall over during a load spike.

At Segment, we reliably deliver hundreds of thousands of events per second to cloud-based destinations but we also routinely handle traffic spikes of up to 300% with no warning and while keeping our infrastructure costs reasonable. There are many possible causes for traffic spikes. A new Segment customer may instrument their high-volume website or app with Segment and turn it on at 3 AM. A partner API may have a partial outage causing the time to process each event to skyrocket. Alternatively, a customer may experience an extreme traffic spike themselves, thereby passing on that traffic to Segment. Regardless of the cause, the results are similar: a fast increase in message volume higher than what the current running process count can handle.

Sidebar: you can use Segment to track events once and send to all the tools in your stack. Sign up for a free workspace here or a get a demo here 👉

To handle this variation in load, we use target-tracking AWS Application Autoscaling to automatically scale out (and in) the number of Docker containers and EC2 servers running in an Elastic Container Service (ECS) cluster. Application Autoscaling is not a magic wand, however. In our experience, people new to target tracking autoscaling on AWS encounter three common surprises leading to slow scaling and giant AWS billing statements.

Surprise 1: Scaling Speed Is Limited

Target tracking autoscaling scales out your service in proportion to the amount that a metric is exceeding a target value. For example, if your CPU target utilization is 80%, but your actual utilization is 90%, AWS scales out by just the right number of tasks to bring the CPU utilization from 90% to your target of 80% using the following formula:

Continuing the above example, AWS would scale out a task count of 40 to 45 to bring the CPU utilization from 90% to 80% because the ratio of actual metric value to target metric value is 113%:

However, because target tracking scaling adjusts the service task count in proportion to the percentage that the actual metric value is above the target, a low ratio of maximum possible metric value to target metric value significantly limits the maximum “magnitude” of a scale out event. For example, the maximum value for CPU utilization that you can have regardless of load is 100%. Unlike a basketball player, EC2 servers can not give it 110%. So, if you’re targeting 95% CPU utilization in a web service, the maximum amount that the service scales out after each cooldown period is 11%: 100 / 90 = 1.1

In the above example, the problem is that if your traffic went up by 200%, you’d probably need to wait for seven separate scaling events to reach just under double the task count to handle the load:

If your scale out cooldown is one minute, seven scaling events will take seven minutes during which time your service is under-scaled.

If you need to be able to scale up faster, you have a few options:

  • Reduce your target value to allow for a larger scale out ratio, at the risk of being over-scaled all the time ($$$).

  • Add target tracking on a custom CloudWatch metric with no logical maximum value like inflight request count (for web services) or queue depth (for queue workers).

  • Use a short scale out cooldown period to allow for more frequent scale out events. But, short cooldowns introduce their own unpleasant side effects. Read on for more on that surprise!

Surprise 2: Short Cooldowns Can Cause Over-Scaling / Under-Scaling

AWS Application Autoscaling uses two CloudWatch alarms for each target tracking metric:

  • A “high” alarm that fires when the metric value has been above the target value for the past 3 minutes. When the high alarm fires, ECS scales your service out in proportion to the amount that the actual value is above the target value. If more than one of the high alarms for your service fire, ECS takes the highest calculated task count and scales out to that value.

  • A “low” alarm that fires when the metric value has been more than 10% below the target value for the past 15 minutes. Only when all of your low alarms for a service fire does ECS slowly scale your service task count in by an undefined and undocumented amount.

In addition to the target metric value, AWS Application Autoscaling allows you to configure a “cooldown” period that defines the minimum amount of time that must pass between subsequent scaling events in the same direction. For example, if the scale out cooldown is five minutes, the service scales out, at most, every five minutes. However, a scale out event can immediately follow a scale in event to ensure that your service can quickly respond to load spikes even if it recently scaled in.

The catch is that these values cannot be arbitrarily short without causing over-scaling and under-scaling. Cooldown durations should instead be at least the amount of time it takes the target metric to reach its new “normal” after a scaling event. If it takes three minutes for your CPU utilization to drop by about 50% after scaling up 2x, a cooldown less than three minutes causes AWS to scale out again before the previous scale out has had time to take effect on your metrics, causing it to scale out more than necessary.

Additionally, CloudWatch usually stores your target metric in one- or five-minute intervals. The cooldown associated with those metrics cannot be shorter than that interval. Otherwise, after a scaling event, CloudWatch re-evaluates the alarms before the metrics have been updated, causing another, potentially incorrect, scaling event.

Surprise 3: Custom CloudWatch Metric Support is Undocumented

Update: AWS has significantly improved documentation around custom CloudWatch metric support! See:

Target tracking scaling on ECS comes “batteries included” with CPU and memory utilization targets, and they can be configured directly via the ECS dashboard. For other metrics, target tracking autoscaling also supports tracking against your custom CloudWatch metrics, but that information is almost entirely undocumented. The only reference I was able to find was a brief mention of a CustomizedMetricSpecification in the API documentation.

Additionally, the ECS dashboard does not yet support displaying target tracking policies with custom CloudWatch metrics. You can’t create or edit target tracking autoscaling policies; you can only create them manually using the PutScalingPolicy API. Moreover, once you create them, they’ll cause your Auto Scaling tab to fail to load:

Thankfully, Terraform makes creating and updating target tracking autoscaling policies relatively easy, though it too is rather light on documentation. Here’s an example target tracking autoscaling policy using a CloudWatch metric with multiple dimensions (“Environment” and “Service”) for a service named “myservice”:

The above autoscaling policy tries to keep the number of inflight requests at 100 for our “myservice” ECS service in production. It scales out at most every 1 minute and scales in at most every 5 minutes.

Even More Surprises

Target tracking scaling can be tremendously useful in many situations by allowing you to quickly scale out ECS services by large magnitudes to handle unpredictable load patterns. However, like all AWS conveniences, target tracking autoscaling also brings with it a hidden layer of additional complexity that you should consider carefully before choosing it over the simple step scaling strategy, which scales in and out by a fixed number of tasks.

We’ve found that target tracking autoscaling works best in situations where your ECS service and CloudWatch metrics meet the following criteria:

  • Your service should have at least one metric that is directly affected by the running task count. For example, a web service likely uses twice the amount of CPU time when handling twice the volume of requests, so CPU utilization is a good target metric for target tracking scaling.

  • The metrics that you target should be bounded, or your service should have a maximum task count that is high enough to allow for headroom in scaling out but is low enough to prevent you from spending all your money. Target tracking autoscaling scales out proportionally so if the actual metric value exceeds the target by orders of magnitude, AWS scales your application (and your bill) out by corresponding orders of magnitude. In other words, if you are target tracking scaling on queue depth and your target depth is 100 but your actual queue depth is 100,000, AWS scales out to 1,000x more tasks for your service. You can protect yourself against this by setting a maximum task count for your queue worker service.

  • Your target metrics should be relatively stable and predictable given a stable amount of load. If your target metric oscillates wildly given a stable traffic volume or load, your service may not be a good fit for target tracking scaling because AWS is not able to correctly scale your task count to nudge the target metrics in the right direction. 

One of the services that we use target tracking autoscaling for at Segment is the service that handles sending 200,000 outbound requests per second to the hundreds of external destination APIs that we support. Each API has unpredictable characteristics like latency (during which time the service is idle waiting for network I/O) or error rates. This unpredictability makes scaling on CPU utilization a poor scaling target, so we also scale on the count of open requests or “inflight count.” Each worker has a target inflight count of 50, a configured maximum of 200, a scale out cooldown of 30 seconds, and a scale in cooldown of 45 seconds. For this particular service, that config is a sweet spot that allows us to scale out quickly (but not so quick as to burn money needlessly) while also scaling in less aggressively.

In the end, the best way to find the right autoscaling strategy is to test it in your specific environment and against your specific load patterns. We hope that by knowing the above surprises ahead of time you can avoid a few more 3AM pager alerts and a few more shocking AWS bills.

See what Segment is all about

Since you've made it this far, perhaps you want to check out Segment? Sign up for a free workspace here or a get a demo here 👉

Chris Sperandio on December 12th 2018

Today, we announced the availability of our Config API for developers to programmatically provision, audit, and maintain the Sources, Destinations, and Tracking Plans in their Segment workspaces. This is one step forward in Segment’s greater strategy to transition from API-driven to API-first development and become infinitely interoperable with companies’ internal infrastructure.

Our shift reflects a greater market shift over the past 30 years in how technology has impacted where and how companies create value. In the 80s, most industries were horizontally integrated, and few companies could afford to interact directly with customers. They created competitive advantage through operations and logistics and relied on additional layers of the value chain to reach customers. Software has made it easier to deliver services and goods more efficiently all the way to end consumers. As a result, today’s companies crave APIs that are extensible and responsive to their modular infrastructure and enable them to differentiate on customer experience for the first time. 

In this post, we’re excited to share our motivations for becoming an API-first company and the historical context for how to think about why APIs are eating software that is eating the world.  

Identifying where companies create value

So why go API-first? Because in every industry, the value chain is transforming, and APIs are the only way to keep up.

The idea of a value chain isn’t new. Businesses have been using this tool, first coined by Michael Porter of HBS, since 1985. He decomposed businesses into their various functions and arranged those functions as a pipeline, separating the “primary” activities of a firm from “supporting” ones. Primary activities are how you create and deliver value to the market, and supporting activities are those that, well, support these endeavors. 

The value chain of businesses in 1985

Thinking of a firm or business unit as a value chain is helpful for understanding where a firm has or can create a meaningful competitive advantage. In other words, it’s a system for determining where to double down on building unique, differentiated value and where to outsource to create cost advantages. 

Businesses themselves are only one link in a broader market or industry value system: the outputs of a firm’s value pipeline will subsequently pass through additional links in that chain, such as distributors and/or retailers, before they’re purchased by “end user” customers. 

The value system from supplier to end user

Before the internet — and still today in heavily industrialized or regulated industries — vertically integrating your business to own the end customer experience incurred high marginal costs and was prohibitively expensive. For consumer goods or healthcare, conventional wisdom holds that this is largely still true, though companies like Dollar Shave Club or Spruce Health might beg to differ!

The skills and competencies to differentiate in retail are different from distribution, which are different from manufacturing, etc. In focusing along these logistical steps, companies become horizontally focused, and become distant and removed from their true end customers. All too often, our everyday customer experiences still reflect this!

The critical path in a pipeline business

Such businesses, links in a linear value system from raw material to a real-world product in the hands of customers, might best be described as “pipeline businesses.” For these pipeline businesses, the links in their chain where they could best differentiate — where their moats were widest — were inbound logistics (sourcing inputs), operations, and outbound logistics (delivering outputs). Together these comprised the “critical path,” or chain of primary activities, that created value for a pipeline business.

Porter was careful to put customer-facing functions, including sales, marketing, and customer support, inside what he called primary activities. However, only very few large companies, and generally only the luxury brands affordable to the few — think Nordstrom, Mercedes, Four Seasons, or American Express — actually differentiated on these dimensions. For most large companies, these customer-facing activities were better described as secondary activities, and they expanded their profit pools by viewing them as cost centers and outsourcing or deferring to further specialized firms. (Hello, Dunder Mifflin). 

But when the internet happened, the critical path was reshaped forever.

Digital reformation: software enters the value chain

When software first emerged as a viable business tool, most enterprises considered the technology an opportunity to do what they already did more efficiently. Hence the inclusion of “technology development” as a supporting activity in the original value chain composition. 

As vendors popped up to offer software products to help support these value chain reformations, pipeline companies were most open to buying applications that could streamline their secondary activities like sales, marketing, and support. These were less risky, and most of the direct investment in building technology was thought to be better allocated in further differentiating the existing primary activities in the critical path. Because the software buyers were less invested in results — these were secondary activities, after all — they had low expectations for app usability.

The B2B vendors got away with long, onerous implementations and forced their customers to adapt the way they work to the vendor’s way of doing things. They charged extra for services that were needed to extract any value from their software. Because APIs made it easier to work with and integrate their software, these vendors saw APIs as a “nice-to-have.” Or, they charged extra for the use of these APIs to capture more from the IT budget.

Platforms over pipelines: software eats the value chain

But today, software is no longer viewed only as a tool to optimize existing things; it’s combinatorially interconnected, and it permeates everything. In this networked world, customer experience is the only true competitive advantage.

As the marginal cost of customer interactions trends to zero, companies can now afford to reach large audiences at scale and integrate their value proposition around customer experience. And in order to provide excellent customer experiences, what we used to think of as secondary activities are better framed as belonging right in the critical path through integration.

The predominant model of how businesses are organized shifts from “Pipeline” to “Platform,” and the mental model of a request/response lifecycle becomes more useful than that of a value chain.

In consumer-facing businesses, the embodiment of the request/response model is an omnipresent “mobile, on demand” company like Uber or Instacart.

In B2B, it’s an API-first one like AWS, Stripe, Plaid, or Twilio.  

These companies have digitized and vertically integrated every link of their value chain. They have slick websites and apps — on every platform — on the inbound side, and free, two-day shipping with no-worries returns on the outbound side.  

Apps are increasingly becoming thin wrappers around use cases, not weighty shells around brands. Chris Maddern, Co-Founder of Button

Because inbound and outbound logistics are ever “thinning” experiences, increasingly mediated via HTTP requests from mobile phones, tablets, laptops or servers, operations become everything behind those applications, and APIs make those experiences effective, relevant, worthwhile, and endearing. Request/response becomes the new pipeline.

The new critical path

Customer experience is the new logistics, and rapid learning, iteration, and integration are the new operations.

Regression models in excel supporting inventory planning? Support activity.

Data Science permeating every facet of the customer experience? Primary activity.

Traditional, reactive “Business Intelligence”? Support activity.

ML-powered supply and demand forecasting to drive real-time marketplace optimization? Primary Activity.

For consumer companies to differentiate on customer experience, they have to integrate their sales, marketing, and customer support functions — links that were once thought of as secondary. These customer-facing departments and customer-facing digital experiences should converge on a shared, ever-updating understanding of who their customer is to tailor their experiences accordingly. Moreover, companies must operationalize the learnings and insights from these interactions to contextualize and tailor subsequent experiences. 

For firms that do this right, everything from content, to product recommendations, to promotions should be based on a real-time, integrated understanding of the factors that drive great customer experiences. This process of self-tuning requires indexing massive amounts of data and then the infrastructure to iterate, optimize, and personalize on the basis of it.

Our humble revision of the Value Chain model for 2018 — the company as a request/response lifecycle

While this model of a request/response firm may not look surprising to platforms, aggregators, digital native retailers, or API-driven middleware in B2B, the stalwart companies who drive the economy are catching on. And as the modern enterprise looks more like these request/response firms every day, the nature of enterprise software is changing with them to fit the model.

Streamlining the critical path: the emergence of API-first for a request/response world

As software became networked, and those networks hit a critical density in the 2000s, technology shifted the value chain composition again. After adopting new technology in secondary business units, consumer companies realized that software could improve processes and margins by outsourcing in their primary focus areas, as well. At this point, they started to introduce technology to their critical paths

This is where the first B2B API-first companies emerged. They turned the “pipeline” model on its head by removing the heft, ceremony, and friction associated with their own critical path. They optimized this experience with software, then productized the software itself.  As a result, they helped B2C companies outsource micro-components of their value chain and enabled these companies to enter into new primary focus areas. 

The API-first companies’ entire end-to-end value proposition is integrated between the lifecycle of an HTTP request and response. Need to process a payment? Just make a request to Stripe, and by the time they respond—a few hundred milliseconds later—they’ve handled a ton of complexity under the hood to issue the charge. Send a text to your customer? 

Companies like Stripe and Twilio set themselves apart not only by the sheer amount of operative complexity they’re able to put behind an API, but because of how elegant, simple, and downright pleasant their APIs are to use for developers. In doing so, they give developers literal superpowers. 

As these companies became the de facto mechanism for accomplishing these operative tasks, they’ve aggregated happy customers along the way. What started as humble request/response companies, have morphed into juggernaut platforms, expanding the scope of their missions and offerings. Before we knew it, “payment processing” became “empowering global commerce,” and “send an SMS” became “infrastructure for better communication.” 

Reducing the cost of integrating these functions via APIs propelled the creation of countless startups with lower barriers to entry.

Building B2B software in a request/response world

For B2B companies selling into enterprises that are increasingly embodying the request/response model, modularity and recognizing that you’re only a part of a much greater whole is key. 

IT is an increasingly embedded function driving interconnection and integration. Companies and their partners — be they base infrastructure providers like AWS and GCP, advertising platforms like Facebook and Google Ads, or the smartest players in the SaaS space — are embracing interoperability through common infrastructure, APIs, and technical co-investment.

Rather than view the software they buy as end-to-end solutions that they’re going to train their teams up on, these enterprises are shifting to a “build and buy” model of private and public networked applications, where security and privacy are necessarily viewed as a shared responsibility. 

The components of their infrastructure that they do choose to buy are part of a broader, sprawling network composed of on-prem deployments, as well as private, public, and third-party cloud services. As a result, they emphasize the need for data portability and the ability to bring a new tool “into the fold” of their existing governance and change control policies and procedures. In fact, it’s generally preferred that the tool acquiesces to those existing procedures than to force the team to adapt their procedures to the tool.

The worst thing you can tell your customer is that they should conform to your opinions about how to do something.

Sure, you built a beautiful user experience atop the data in your SaaS tool. But there’s an edge case you didn’t think of. And without an API, your customers have no recourse. With one, they can channel their needs into an opportunity for them to further invest in your ecosystem. More importantly, they can take “enterprise readiness” into their own hands and enact it on their own terms. In fact, I’ve been personally involved in several of our enterprise-facing initiatives, such as SSO integration with SAML IDPs and fine-grained permissions. While developing requirements for these features, far and away the most common refrain I’ve heard is, “just give me an API.”

Why is that? Amongst software developers, operations practitioners, and IT administrators alike, the concept of Infrastructure as Code (IaC) has taken hold.  This means writing code to manage configurations and automate provisioning of the underlying infrastructure (servers, databases, etc.) in addition to application deployments. The reason we were so excited to adopt this practice ourselves at Segment is that IaC inserts proven software development practices, like version control, continuous testing, and small deployments, into the management lifecycle of the base infrastructure that applications run on.

In the past, “base infrastructure” had a relatively static and monolithic connotation. Today companies are deploying their application not just to “servers” or VMs in their VPCs, but to a dynamic network of cloud-agnostic container runtimes and managed “serverless” deployment targets. At the same time, they rely on a growing network of third-party, API-driven services that provide key functions such as payments, communications, shipping, identity verification, background checking, monitoring, alerting, and analytics. 

At Segment, our own engineers refuse to waste our time and increase our risk profile by clicking around in the AWS console, instead opting to use terraform for provisioning. They go so far as to home-roll applications, like specs for “peering into,” and station agent for querying our ECS clusters. None of these workflows or custom applications would be possible without the ECS control plane APIs. 

And it goes beyond AWS. We want to make it functionally impossible to deploy a service that doesn’t have metrics and monitoring. To do this, we threw together a terraform provider against the Datadog API and codified our baseline alerting thresholds right into our declarative service definitions. 

Now, we’re offering that same proposition to our customers through our Config API for provisioning integrations, workspaces, and configuring tracking plans. We’re excited to see a terraform provider pop up. (And, we have it on good authority the community is already working on it.) Using the Config API and terraform, customers can codify and automate their pre-configured integration settings and credentials when provisioning new domains or updating tracking plans. 

…and that’s where we get back to Segment

Because I know what you’re thinking. Wasn’t Segment already API-first?

Well, partially. Segment, historically, has been API-driven. Which is to say that we’ve been API-first, but only in a few key areas, and hopefully the models and context we explored above can help to explain why!

When we first launched analytics.js, we introduced an elegant and focused API for recording events about how your customers interact with your business. So you made requests to Segment — but did you wait on a response? No! You just let us handle sending the events to your chosen integrations.

That’s because, then, it was a better inbound link to a secondary value chain activity— “analytics.” Companies didn’t want to wait any milliseconds to hear back from Segment because we weren’t in the critical path of their value delivery. (Side note, we went to great lengths to avoid any waiting at all — all our collection libraries are entirely asynchronous and non-blocking.)

And while engineers loved the simplicity of our Data Collection API, the real reason they love Segment is that integrating with that API is the last analytics, marketing, sales, or support integration they ever have to do. That value proposition is what lies between our “API-driven” inbound and outbound value chain links.  The operative link in Segment’s Connections Product is the act of multiplexing, translating, and routing the data customers send us to wherever those customers want.

What exploded underneath our feet when we released analytics.js was the realization that the larger the organization, the more likely it is that the person who needs to access and analyze data is different from the person who can instrument their applications to collect that data. By adopting Segment, companies decoupled customer data instrumentation from analysis and automation, disentangling “what data do we need?” from “how are we going to use it?”

In effect, Segment became the “backbone network router” in charge of packet-switching customer data inside a company’s data network.

Becoming Customer Data Infrastructure

We got this far without thinking API-first when it came to our control plane. Even with all our high-minded prognostications about the end of traditional value chains! So why make the shift now?  

The reason to make such a change, as ever, is strong customer pull.

Since introducing our data router, Segment has evolved substantially. Today, the original Segment Data Collection API you know and love is the inbound link in the customer data infrastructure request/response lifecycle. 

With each big new product release this year, be it our GDPR functionality, Protocols, or Personas, we’ve heard emphatically from Customers that they want to “drive” these features programmatically, and we’ve shipped key APIs with each to deliver on those needs.

All the while, we’ve also noticed more than a few customers — and even partners looking to develop deeper, workflow-based integrations with Segment — poking around under the hood of the private control plane APIs that drive these products.

What’s clear is that while our original, “entry-level” job to be done — analytics instrumentation — may have been a “send-it-and-forget it” API interaction. However, companies have come to rely on their customer data in the critical path of delivering value through their applications, products, and experiences. Now, data collection has moved from fueling “secondary” links to a first-order priority. 

In fact, this thesis (and the accompanying customer pull) has driven Segment’s product portfolio expansion to help companies put clean, consented, synthesized customer data in the critical path of their customer experiences.

And this is where we bring it all together. Because it’s not just consuming the data that fits the mold for an API-first model. As our customers build and adopt applications that fit into a broader network, and they bring once-“supporting” value chain links into their critical path, they want to program the infrastructure that enables that as well.  

With the APIs, our customers have built Segment change management into their SDLC workflow. They run GDPR audits of data flow through their workspace with a button click. They’re keeping their privacy policies and consent management tools up-to-date in real-time with the latest tools they are using.

It’s incredibly humbling to have customers who push the boundaries of your product and are sufficiently invested to want to integrate it more deeply and more safely into their workflows. We’re proud to be enabling that by opening up our Config API, which we welcome you to explore here.

David Scrobonia on December 4th 2018

Security tools are not user friendly. This is a problem in a world where the security community is trying to “push security left” and integrate into development culture. Developers don’t want to use tools that have a poor user experience and add friction to their daily workflows. By ignoring UX, security tools are preventing teams from making their organizations more secure.

The team behind the Zed Attack Proxy (ZAP), a popular OSS attack proxy for testing web apps, worked on addressing this problem with our application. As a result we came up with three takeaways for improving the UX of security tools.

So What’s the Problem? 

Let’s walk through using ZAP to scan a web app for vulnerabilities. Our goal is to add the target application to our scope of attack, spider the site to discover all of the pages, and run the active scanner. So we would:

  1. Look in the Site Tree pane of ZAP to find our target (http://localhost:8000 in this case)

  2. Right click on the target and select the “Include in Context” 

  3. This opens a new menu asking you to select an existing context to add our app to scope (or create a new one)

  4. This opens another configuration menu that prompts you to define any regex rules before adding the application to scope 

  5. Now that our app is in scope, we click the target icon to hide other applications from the Site Tree

  6. To start the spider right click on our application and hover over the “Attack” option in this list to expose a sub-context menu

  7. Click the newly exposed “Spider” option to open a configuration menu and press “Start Scan”.

  8. To start an active scan, we again right click on our app and hover over “Attack” to expose a menu

  9. Click “Active Scan” to open a configuration menu and finally press “Start Scan” to begin

Scanning an app with ZAP

Whew! This is not a simple workflow. It requires us to hunt through hidden context-menus, click through several different menus, and assumes that the user understands all of the configuration options. And this is for the most commonly used feature!

While this is not a great experience, ZAP is far from the least usable security tool available. Ask anybody on your security team about their experience with enterprise static analysis tools and they’ll be sure to give you an earful about a product still stuck in a pre-Y2K user interface.  In contrast to these commercial tools, ZAP is free, open source, and maintained by a handful of people working part time. This presents a huge gap in time and money that left us wondering where we should start in order to improve our user experience. To keep our efforts efficient we focused on three things.

3 Ways to Improve the UX of Security Tools

1. Make it Native

When assessing the security of a web app, you're frequently switching between ZAP and your browser. This quickly becomes distracting. Let's look at how we intercept requests with ZAP, another common feature, to see why.

  1. In the browser, navigate and use the feature they want to test

  2. Go back to ZAP to observe the requests that were sent and turn on the “Break” feature to start intercepting messages

  3. Go back to the browser and reuse the feature they wanted to test

  4. Go back to ZAP to see the intercepted message, modify the request, and then press “Continue”

  5. Go back to the app to see if modified request changed the behaviour in the app

  6. Rinse and repeat for as many requests needed to test the feature.

Intercepting HTTP messages with ZAP

Everytime we just want to intercept a request we have to go back and forth between browser and ZAP five times! This may not seem like a lot at first, but considering that you may intercept hundreds of messages testing for vulnerabilities, this becomes a headache.

The problem is that we are constantly changing contexts between ZAP and the native context for testing, the browser. I like to compare this to how a fighter pilot operates a jet. Their native context for flying the plane is looking through the windshield, so all of the important information they need to make decisions is presented on a heads up display (HUD). Imagine trying to survive a dog fight if all of the feedback about altitude, acceleration, and weapons status was only available in a dashboard of knobs and gauges. It would be impossible if you were continually having to monitor a separate context. 

A fighter jet’s HUD efficiently displays information

To provide a natively integrated experience in ZAP, we took inspiration from a fighter jet’s heads up display.

The ZAP Heads Up Display is new UI overlaid on the target page you are testing, providing the functionality of ZAP in the browser.

Intercepting HTTP messages with the HUD

Making this change wasn’t a simple “lift and shift”, however. It required us to get creative with how we approached our design. We knew we didn’t want to implement the HUD as a browser plugin, which would require us to support multiple code bases that were bound to the restrictions of their plugin APIs. Even then we wouldn’t be able to support all browsers. This was a non-starter for a small development team with a global user base that uses a variety of browsers.

Instead of a browser plugin, we leveraged ZAP’s all powerful position as a proxy to inject the HUD into the target application. When ZAP intercepts an HTTP response from a server, it modifies the HTML to include an extra script tag. The source of this script is a javascript file that is served from ZAP. When this script is executed in the web app it adds several iframes to the DOM which make up the components of the ZAP HUD.

ZAP modifies the HTTP responses of the target application

This approach allows us to add the HUD to any target application running in any modern browser. By making the HUD native we’ve created a frictionless experience, which is essential if you want users to adopt your work. 

2. Sacrifice Power for Accessibility

When looking at ZAP there are a lot of powerful features, but where to find and how to access them isn’t immediately intuitive. To access a common feature like “Active Scan” we had to tediously crawl into a context menu, parse a large list, and click through a sub-menu. This inconvenience is multiplied when you consider there are multiple places in the UI you can find new features. To start a scan you can also navigate to the bottom pane, open a new tab, choose the “Active Scan” feature, and proceed through the configuration menus that way. 

Accessing features via context menus

Accessing features via the bottow drawer

Presenting features in multiple places in the UI is disorienting for a new user trying to figure out how to navigate the application. Will the next feature be found by opening a new tab, or will it be found in a sublayer of a context menu?

To address our complex UI we made a trade off - limit features by simplifying their interfaces. We tossed away the ability to configure features upfront, eliminated multiple entry points into features, and forced scattered ZAP features into consistent UX elements. The HUD now presents features as tools, the discrete buttons on either side of the Heads Up Display.

This consistency creates the same experience when using different tools. Now, when a user wants to find more features, they know exactly where to find it - in another tool! 

Remember how complicated the “Include in Context” flow was? Now with HUD tools, all of these features have been built into the “Scope” tool. Simply click tool, select “Add to Scope”, and we’re done! You’re now ready to attack the application. And how would we start spidering the site? You guessed it - click the Spider tool, select “Start Spider”, and it will start to run!

Scanning an app with the HUD

The simplified tools interface does not provide the same level of feature depth that ZAP traditionally provides. We can’t define a scope regex or define the maximum depth of a spider crawl right out of the box. This is the trade off we choose though: to forfeit feature power for accessibility.

There is a lot happening behind the scenes to enable all of the functionality of ZAP within our simpler “tool” interface. To keep the HUD responsive we aren’t loading all of the functionality into the iframes. Instead, the HUD leverages a service worker, a background javascript process similar to a web worker, to handle most of the heavy lifting so the iframes can be lightweight and responsive. Service workers are more privileged than web workers and can persist across multiple page loads, meaning we only have to load our javascript once and keep it tucked away in the background.* The service worker hosts all of tool logic and exposes it to the different UI iframes via the postMessage API, a browser API for inter-window (and cross origin) communication. 

Not all of the functionality of the tools is stored in the service worker, though. Because we’re already running ZAP we make heavy use of its existing features via the ZAP API. ZAP’s API is very thorough and enables developers to run almost the entire application via a simple REST API. The service worker communicates with this API along with a websocket API that streams events captured in ZAP so that the HUD can have live, up to date notifications. 

The Heads Up Display uses several different technologies

A great way to see all of these pieces in motion is with the new “Break” tool. When the Break tool is clicked in the HUD, a postMessage is sent from the UI frame to the service worker where the tool logic is running, which sends an API request to ZAP to start intercepting messages. When the user tries to open a new page ZAP will intercept the message, use the websocket API to notify the service worker that it just intercepted a new message, and then the service worker will notify an iframe to display the intercepted message. 

All of these technologies help to keep the HUD accessible. If a user can’t figure out how to use your software, it doesn’t matter how good it is at solving a problem it won’t be used. 

3. Keep it Flexible

Even with an improved interface for accessing ZAP, we didn’t want to assume how users would interact with the HUD. To prevent locking our users into rigid workflows we made the UI as configurable as possible and made it easy for users to add more functionality to the HUD.

Applications come in all shapes and sizes and we don’t want the HUD’s display to get in the way. To prevent usability issues users can arrange the tools however they like, or remove tools that get in the way. The entire HUD can even be temporarily hidden from view. The ultimate goal is to have a fully customizable drag and drop interface where users can manage the HUD like its the home screen of their smartphone: changing fonts, adding widgets, and changing the layout.

Users can also quickly add features to the HUD, so they aren’t stuck using only the default tools. Developers often have custom testing or build scripts and connecting them to the HUD would make testing that much easier. That's something we can do in just a few minutes using only ZAP.

ZAP has a “Scripts” plugin that allows users to hook custom scripts into various points of the application. Scripts can be used to modify requests or responses, add active scanning rules, or change any other ZAP behaviour. The plugin also provides access to the HUD code, allowing users to quickly copy, paste, and modify the code for an existing tool. By tweaking just a few lines we can create a tool that uses the ZAP API to start running any user defined script: a developer’s custom testing scripts, a QA’s web automation script, or a hacker’s favorite tool.

Users can add custom functionality to the HUD in a few minutes

In the example above we have a script called “Hack it!” that replaces the text “Juice Shop” with “HACKED”. After quickly modifying an existing tool, we change a ZAP API call to enable our defined script, and when we restart the HUD you can see that the new “Hack it!” tool is now available to be added to our display.

Although this is a simple example, the scripting feature can be used to add any functionality to the HUD and supports several different scripting languages.


By focusing on three things we were able to make a powerful security tool much more accessible to a wider audience. By making it native we empower users in the environment their most comfortable in. By sacrificing power for accessibility we enable users of all levels to quickly start security testing. By keeping it flexible our tool adapts to a user’s specific needs.

While these design concepts aren’t revolutionary, or even that original, they highlight a fundamental gap between how the security community talks about security and how we practice security. If we honestly want to “push security left” we must leverage these principles to provide frictionless security for our users. 


The HUD is now in Alpha release! If you would like to test fly the HUD visit https://github.com/zaproxy/zap-hud to get up and running in a few minutes and see how it works. This is still a very early release so it may buggy, but please share your feedback on usability, reliability, and feature requests. If you’re interested in helping out with the project please reach out to us via the Github project or on Twitter at @david_scrobonia or @zaproxy.

* Service worker savvy readers will know that service workers are event driven, and that the lifecycle of a service worker expects them to be constantly terminated and activated, requiring all dependencies to be imported each time this happens. To prevent this we have hacked around this spec by sending the service worker a heart beat to keep it alive while the HUD is active.

Gurdas Nijor on November 16th 2018

At this point it's well-accepted that analytics data is the beating heart of a great customer experience. It's what allows us to understand our customer's journey through the product and pinpoint opportunities for improvement.

This all sounds great, but reality is a messy place. As a team becomes an organization, the structure used for recording this data can easily diverge. One part of the product might use userId and another user_id. There might be both a CartCheckout and a CheckoutCart event that mean the same thing. Given thousands of call sites and hundreds of kinds of records, this sort of thing is an inevitability.

Without an enforceable shared vocabulary for describing this data, an organization's ability to use this data in any meaningful way becomes crippled.

Downstream tools for analyzing data begin to lose value as redundant or unexpected data trickles in as a result of implementation errors at the source. Fixing these issues after they’ve made it downstream turns out to be a very expensive proposition, with estimates as high as 60% of a data scientist’s time being spent cleaning and organizing data.

At Segment, we’ve put a considerable amount of engineering effort into scaling our data pipeline to ensure low-latency, high throughput and reliable delivery of event data to power the customer data infrastructure for over 15 thousand companies.

We also recently launched Protocols to help our customers ensure high quality data at scale.

In this post, I want to explore some approaches we’re taking to tackle that dimension of scalability from a developer perspective, allowing organizations to scale the domain of their customer data with a shared, consistent representation of it.

Tracking Plans

To ensure a successful implementation of Segment, we’ll typically recommend that customers maintain something known as a “Tracking Plan.”

An example of a Tracking Plan that we would use internally for our own product

This spreadsheet gives structure and meaning to the events and fields that are present in every customer data payload.

A tracking plan (also known as an implementation spec) helps clarify what events to track, where those events need to go in the code base, and why those events are necessary from a business perspective.


An example of where a Tracking Plan becomes a critical tool would be in any scenario involving multiple engineers working across products and platforms. If there are no standards around how one should represent a “Signed Up” event or what metadata would be important to capture, you’d eventually find every permutation of it when it comes time to make use of that business critical data, rendering it “worse than useless.”

This Tracking Plan serves as a living document for Central Analytics, Product, Engineering and other teams to agree on what’s important to measure, how those measures are represented and how to name them (the three hard problems in analytics).

Where it breaks down, and how to fix it

As a Tracking Plan evolves, the code that implements it will not often change accordingly. Tickets may get assigned, but oftentimes feature work will get prioritized over maintaining the tracking code, leading to a divergence between the Tracking Plan and implementation. In addition to this, natural human error is still a very real factor that can lead to an incorrect implementation.

This error is a pretty natural result of not having a system to provide validation to an implementor (both at implementation-time, and on an ongoing basis).

Validation that an implementation is correct relative to some idealized target sounds exactly like something that machines can help us with, and indeed they have — from compilers that enforce certain invariants of programs - to test frameworks that allow us to author scenarios in which to run our code and assert expected behaviors.

As alluded to above, an ideal system will provide feedback and validation at three critical places in the product development lifecycle:

  • At development time “What should I implement?”

  • At build-time “Is it right?”

  • At CI time “Has it stayed right?”

As a developer-focused company, these elements of aligning a great developer experience with the process improvements of a centralized tracking plan became a compelling problem to solve.

That’s why we built, and are now open sourcing Typewriter - a tool that lets developers “bring a tracking plan into their editor” by generating a strongly typed client library across a variety of languages from a centrally defined spec.

The developer experience of using a Typewriter generated library in Typescript

Typewriter delivers a higher degree of developer ergonomics over our more general purpose analytics libraries by providing a strongly typed API that speaks to a customer’s data domain. The events, their properties, types, and all associated documentation are present to inform product engineers that need to implement them perfectly to spec, all without leaving the comfort of their development environment.

Compile time (and runtime) validation is performed to ensure that tracking events are dispatched with the correct fields and types to give realtime validation that an implementation is correct.

This answers the questions of “What should I implement” and “Is it right” mentioned earlier. The remaining question of “Has it stayed right?” can be answered by integrating Typewriter as a task in your CI system.

How it works

Typewriter uses what amounts to a machine-readable Tracking Plan with a rich language built on JSON Schema for defining and validating events, their properties and associated types that can be compiled into a standalone library (making use of the excellent quicktype library to generate types for languages we target with static type systems).

This spec can be managed within your codebase, ensuring that any changes to it will result in a regenerated client library, and always up to date tracking code.

What comes next

Being avid Segment users ourselves, we’ve been migrating our mountains of hand written tracking code to Typewriter generated libraries and have been excited to realize the productivity gains of offloading that work to the tooling.

Typewriter will continue to evolve to support the needs of all Segment customers too — we’re continuing to expand it and are open to community PRs!

We’d like you to give Typewriter a shot in your own projects and feel free to open issues, submit PRs, or reach us on twitter @segment.

A special thanks to Colin King for all of his work in making Typewriter a reality, and the team at quicktype for producing an amazing library for us to build on top of.

Michael Fischer on October 30th 2018

“How should I secure my production environment, but still give developers access to my hosts?” It’s a simple question, but one which requires a fairly nuanced answer. You need a system which is secure, resilient to failure, and scalable to thousands of engineers–as well as being fully auditable.

There’s a number of solutions out there: from simply managing the set of authorized_keys to running your own LDAP or Kerberos setup. But none of them quite worked for us (for reasons I’ll share down below). We wanted a system that required little additional infrastructure, but also gave us security and resilience. 

This is the story of how we got rid of our shared accounts and brought secure, highly-available, per-person logins to our fleet of Linux EC2 instances without adding a bunch of extra infrastructure to our stack.

Limitations of shared logins

By default, AWS EC2 instances only have one login user available.  When launching an instance, you must select an SSH key pair from any of the ones that have been previously uploaded.  The EC2 control plane, typically in conjunction with cloud-init, will install the public key onto the instance and associate it with that user (ec2-user, ubuntu, centos, etc.).  That user typically has  sudo access so that it can perform privileged operations.

This design works well for individuals and very small organizations that have just a few people who can log into a host.  But when your organization begins to grow, the limitations of the single-key, single-user approach quickly becomes apparent.  Consider the following questions:

Who should have a copy of the private key?  Usually the person or people directly responsible for managing the EC2 instances should have it.  But what if they need to go on vacation?  Who should get a copy of the key in their absence?

What should you do when you no longer want a user to be able to log into an instance?  Suppose someone in possession of a shared private key leaves the company.  That user can still log into your instances until the public key is removed from them.   Do you continue to trust the user with this key?  If not, how do you generate and distribute a new key pair?  This poses a technical and logistical challenge.  Automation can help resolve that, but it doesn’t solve other issues.

What will you do if the private key is compromised? This is a similar question as the one above, but requires more urgent attention.  It might be reasonable to trust a departing user for awhile — but if you know your key is compromised, there’s little doubt you’ll want to replace it immediately.  If the automation to manage it doesn’t yet exist, you may find yourself in a very stressful situation; and stress and urgency often lead to automation errors that can make bad problems worse.

One solution that’s become increasingly popular in response to these issues has been to set up a Certificate Authority that signs temporary SSH credentials. Instead of trusting a private key, the server trusts the Certificate Authority.  Netflix’s BLESS is an open-source implementation of such a system.  

The short validity lifetime of the issued certificates does mitigate the above risks.  But it still doesn’t quite solve the following problems:

How do you provide for different privilege levels?  Suppose you want to allow some users to perform privileged operations, but want others to be able to log in in “read-only” mode.  With a shared login, that’s simply impossible: everyone who holds the key gets privileged access.

How do you audit activity on systems that have a shared login?  At Segment, we believe that in order to have best-in-class security and reliability, we must know the “Five Ws” of every material operation that is performed on our instances:

  • What happened?

  • Where did it take place?

  • When did it occur?

  • Why did it happen?

  • Who was involved?

Only with proper auditing can we know with certainty the answers to these questions.  As we’ve grown, our customer base has increasingly demanded that we have the ability to know, too.  And if you ever find yourself coveting approval from such compliance organizations such as ISO 27001, PCI-DSS, or SOC 2, you will be required to show you have an audit trail at hand.

We needed better information than this:


Our goals were the following:

  1. Be able to thoroughly audit activity on our servers;

  2. Have a single source of truth for user information;

  3. Work harmoniously with our single sign-on (SSO) providers; and

  4. Use two-factor authentication (2FA) to have top-notch security.

Here’s how we accomplished them.

Segment’s solution: LDAP with a twist

LDAP is an acronym for “Lightweight Directory Access Protocol.”  Put simply, it’s a service that provides information about users (people) and things (objects).  It has a rich query language, configurable schemas, and replication support.  If you’ve ever logged into a Windows domain, you probably used it (it’s at the heart of Active Directory) and never even knew it.  Linux supports LDAP as well, and there’s an Open Source server implementation called OpenLDAP.

You may ask: Isn’t LDAP notoriously complicated?  Yes, yes it is.  Running your own LDAP server isn’t for the faint of heart, and making it highly available is extremely challenging.

You may ask: Aren’t all the management interfaces for LDAP pretty poor? We weren’t sold on any of them we’d seen yet.  Active Directory is arguably the gold standard here — but we’re not a Windows shop, and we have no foreseeable plans to be one.  Besides, we didn’t want to be in the business of managing Yet Another User Database in the first place.

You may ask: Isn’t depending on a directory server to gain access to production resources risky?  It certainly can be.  Being locked out of our servers because of an LDAP server failure is an unacceptable risk.  But this risk can be significantly mitigated by decoupling the information — the user attributes we want to propagate — from the service itself.  We’ll discuss how we do that shortly.

Choosing an LDAP service

As we entered the planning process, we made a few early decisions that helped guide our choices.

First, we’re laser-focused on making Segment better every day, and we didn’t want to be distracted by having to maintain a dial-tone service that’s orthogonal to our product. We wanted a solution that was as “maintenance free” as possible. This quickly ruled out OpenLDAP, which is challenging to operate, particularly in a distributed fashion.

We also knew that we didn’t want to spend time maintaining the directory.  We already have a source of truth about our employees: Our HR system, BambooHR, is populated immediately upon hiring.  We didn’t want to have to re-enter data into another directory if we could avoid it.  Was such a thing possible?

Yes, it was! 

We turned to Foxpass to help us solve the problem.  Foxpass is a SaaS infrastructure provider that offers LDAP and RADIUS service, using an existing authentication provider as a source of truth for user information.  They support several authentication providers, including Okta, OneLogin, G Suite, and Office 365.

We use Okta to provide Single Sign-On for all our users, so this seemed perfect for us.  (Check out aws-okta if you haven’t already.) And better still, our Okta account is synchronized from BambooHR — so all we had to do was synchronize Foxpass with Okta.

The last bit of data Foxpass needs are our users’ SSH keys.  Fortunately, it’s simple for a new hire to upload their key on their first day: Just log into the web interface — which, of course, is protected by Okta SSO via G Suite — and add it.  SSH keys can also be rotated via the same interface.

Service Architecture

In addition to their SaaS offering, Foxpass also offers an on-premise solution in the form of a Docker image.  This appealed to us because we wanted to reduce our exposure to network-related issues, and we are already comfortable running containers using AWS ECS (Elastic Container Service).  So we decided to host it ourselves.  To do this, we:

  • Created a dedicated VPC for the cluster, with all the necessary subnets, security groups, and Internet Gateway

  • Created an RDS (Aurora MySQL) cluster used for data storage

  • Created a three-instance EC2 Auto Scaling Group of instances having the ECS Agent and Docker Engine installed - if an instance goes down, it’ll be automatically replaced

  • Created an ECS cluster for Foxpass

  • Created a ECS service pair for Foxpass to manage its containers on our EC2 instances (one service for its HTTP/LDAP services; one service for its maintenance worker)

  • Stored database passwords and TLS certificates in EC2 Parameter Store

We also modified the Foxpass Docker image with a custom ENTRYPOINT script that fetches the sensitive data from Parameter Store (via Chamber) before launching the Foxpass service:

Client instance configuration

On Linux, you need to configure two separate subsystems when you adopt LDAP authentication:

Authentication:  This is the responsibility of PAM (Pluggable Authentication Modules) and sshd (the ssh service).  These subsystems check the credentials of anyone who either logs in, or wants to switch user contexts (e.g. sudo).

User ID mappings: This is the realm of NSS, the Name Service Switch.  Even if you have authentication properly configured, Linux won’t be able to map user and group names to UIDs and GIDs (e.g. mifi is UID 1234) without it.

There are many options for setting these subsystems up.  Red Hat, for example, recommends using SSSD on RHEL.   You can also use pam_ldap and nss_ldap to configure Linux to authenticate directly against your LDAP servers.  But we chose neither of those options:  We didn’t want to leave ourselves unable to log in if the LDAP server was unavailable, and both of those solutions have cases where a denial of service is possible.  (SSSD does provide a read-through cache, but it’s only populated when a user logs in.  SSSD is also somewhat complex to set up and debug.)


Ultimately we settled on nsscache.  nsscache (along with its companion NSS library, libnss-cache) is a tool that queries an entire LDAP directory and saves the matching results to a local cache file.    nsscache is run when an instance is first started, and about every 15 minutes thereafter via a systemd timer.   

This gives us a very strong guarantee: if nsscache runs successfully once at startup, every user who was in the directory at instance startup will be able to log in.  If some catastrophe occurs later, only new EC2 instances will be affected; and for existing instances, only modifications made after the failure will be deferred.  

To make it work, we changed the following lines in /etc/nsswitch.conf.  Note the cache keyword before compat:

So that users’ home directories are automatically created at login, we added the following to /etc/pam.d/common-session:

nsscache also ships with a program called nsscache-ssh-authorized-keys which takes a single username argument and returns the ssh key associated with the user.  The sshd configuration (/etc/ssh/sshd_config) is straightforward:

Emergency login

We haven’t had any reliability issues with nsscache or Foxpass since we rolled it out in late 2017.  But that doesn’t mean we’re not paranoid about losing access, especially during an incident! So just in case, we have a group of emergency users whose SSH keys live in a secret S3 bucket.   At instance startup, and regularly thereafter, a systemd unit reads the keys from the S3 bucket and appends them to the /etc/ssh/emergency_authorized_keys file.  

As with ordinary users, the emergency user requires two-factor authentication to log in.  For extra security, we’re also alerted whenever a new key is added to the S3 bucket via Event Notifications.

We also had to modify /etc/ssh/sshd_config to make it work:


Bastion servers

Security is of utmost importance at Segment. Consistent with best practices, we protect our EC2 instances by forcing all access to them through a set of dedicated bastion servers.  

Our bastion servers are a bit different than some in that we don’t actually permit shell access to them: their sole purpose is to perform Two-Factor Authentication (2FA) and forward inbound connections via secure tunneling to our EC2 instances.

To enforce these restrictions, we made a few modifications to our configuration files.  

First, we published a patch to nsscache that optionally overrides the users’ shell when creating the local login database from LDAP.   On the bastion servers, the shell for each user is a program that prints message to stdout explaining why the bastion cannot be logged into, and exits nonzero.

Second, we enabled 2FA via Duo Security. Duo is an authentication provider who sends our team push notifications to their phones and requires confirmation before logging in.  Setting it up involved installing their client package and making a few configuration file changes.  

First, we had to update PAM to use their authentication module:

Then, we updated our /etc/ssh/sshd_config file to allow keyboard-interactive authentication (so that users could respond to the 2FA prompts):

On the client side, access to protected instances is managed through a custom SSH configuration file distributed through Git.  An example stanza that configures proxying through a bastion cluster looks like this:

Mutual TLS

To avoid an impersonation attack, we needed to ensure our servers connected to and received information only from the LDAP servers we established ourselves.  Otherwise, an attacker could provide their own authentication credentials and use them to gain access to our systems.

Our LDAP servers are centrally located and trusted in all regions.  Since Segment is 100% cloud-based, and cloud IP addresses are subject to change, we didn’t feel comfortable solely using network ACLs to protect us.  This is also known as the zero-trust network problem: How do you ensure application security in a potentially hostile network environment?

The most popular answer is to set up mutual TLS authentication, or mTLS.  mTLS validates the client from the server’s point of view, and the server from the client’s point of view.  If either validity check fails, the connection fails to establish.

We created a root certificate, then used it to sign the client and server certificates.  Both kinds of certificates are securely stored in EC2 Parameter Store and encrypted both at rest and in transit, and are installed at instance-start time on our clients and servers. In the future, we may use AWS Certificate Manager Private Certificate Authority to generate certificates for newly-launched instances.


That's it! Let's recap:

  • By leaning on Foxpass, we were able to integrate with our existing SSO provider, Okta, and avoided adding complicated new infrastructure to our stack.

  • We leveraged nsscache to prevent any upstream issues from locking us out of our hosts.  We use mutual TLS between nsscache and the Foxpass LDAP service to communicate securely in a zero-trust network. 

  • Just in case, we sync a small number of emergency SSH keys from S3 to each host at boot.

  • We don't allow shell access to our bastion hosts. They exist purely for establishing secure tunnels into our AWS environments.

In sharing this, we're hoping others find an easier path towards realizing secure, personalized logins for their compute instance fleets. Let us know if you have any thoughts or questions, feel free to tweet @Segment, and I’m michael@dynamine.net.

Jeroen Ransijn on October 16th 2018

Design systems are emerging as a vital tool for product design at scale. These systems are collections of components, styles, and processes to help teams design and build consistent user experiences. It seems like everyone is building one, but there is no playbook on how to take it from the first button to a production-ready system adopted across an organization. Much of the advice and examples out there are for teams that seem to have already figured it out.

Today I want to share my experience in bootstrapping a design system and driving adoption within our organization, Segment. I will share how we got started by creating something small and useful  first. Then I will share how I hijacked a project to build out that small thing into our full blown design system known as Evergreen. Finally, I will share how we continue to drive and track adoption of our design system.

What is a Design System?

A design system is a collection of components, styles and processes to help teams design and build consistent user experiences — faster and better. Design systems often contain components such as buttons, popovers and checkboxes, and foundational styles such as typography and colors. Teams that use the design system can focus on what’s unique to their product instead of reinventing common UI components.

What’s in our Design System

Before I share my experience bootstrapping our design system called Evergreen. I want to set some context first, and explain what is in our design system.

  • Design Resources

    • Sketch UI Kit

    • Design Guidelines

  • Code Resources

    • React UI Framework

    • Developer Documentation

  • Operational Resources

    • Roadmap documents

    • On-boarding process

Our design system didn’t start out with all of those resources. In fact, I built something small and useful first. In the next sections I will share the lessons I learned in bootstrapping a design system and driving adoption within our organization.

How We Got Started

About 2 years ago, I joined Segment as a product designer. I worked as a front-end developer in the past and I wanted to use my skillset to create interactive prototypes. To give you a bit of context, the Segment application allows our customers to collect data from your website or app, synthesize that data and integrate with over 200 integrations for marketing and analytics.

The prototypes I wanted to develop would live outside of our Segment application and would have no access to the application codebase. This means that I didn’t have access to the components already in the application — I had to create everything from scratch.

Most advice online talks about starting with a UI audit or trying to get executive buy-in. Those are all part of the long journey of creating a successful design system, but there are many ways to get started. If you set out to solve all of the problems in your product, you might be taking on too much at once. Instead, build something small and useful, provide value quickly, and iterate on what works.

Build something small and useful

One of the first challenges you run into when creating a component library is how to deal with styling and CSS. There are a few different ways to deal with this:

  • Traditional CSS: Verbose to write, hard to maintain at scale. Often relies on conventions.

  • CSS Preprocessor such as Sass or Less: Easier way to write CSS, chance of naming collisions. Often relies on conventions.

  • CSS-in-JS solutions: Write CSS in JavaScript. Powerful ways to abstract into components.

I wanted a solution that didn’t require any extra build steps or extra imports when using the component library. CSS-in-JS made this very easy. You can import a component in your code and it works out of the box.

I wanted to avoid having to create a ton of utility class names to override simple CSS properties on components such as dimensions, position and spacing. It turns out there is a way to achieve this in an elegant way — enter the React UI primitive.

Choosing React

There are many choices of frameworks for your component library. When I started building a component library, we were already using React, so it was the obvious choice.

React UI Primitive

After doing research, I found the concept of UI primitives. Instead of dealing with CSS directly, you deal with the properties on a React component. I bounced ideas off my coworkers and got excited about what this would mean. In the end we built UI-BOX.


UI-BOX exports a single Box component that allows you to use React props for CSS properties. Instead of creating a class name, you pass the property to the Box component directly:

Why is this Box component useful?

The Box component is useful because it helps with 3 common use cases

  • Create layouts without helper classes.

  • Define components without worrying about CSS.

  • Override single properties when using components.

Create layouts without helper classes.

Define components without worrying about CSS

Override single properties when using components

Flexibility and composability

The Box component makes it easy to start writing new components that allows setting margin properties directly to the component. For example, quickly space out two buttons by adding marginRight={10} to the left button. Also, you can override CSS properties without adding new distinct properties to the component. For example, this is useful when full-width button is needed, or want to remove the border-radius on one side of a button. Furthermore, layouts can be created instantly by using the Box component directly.

Still a place for CSS

It is important to note that UI-BOX only solves some of the problems. A class is still needed to control the appearance of a component. For example, a button can add dimensions and spacing with UI-BOX, but a class defines the appearance: background color, box shadows, color as well as the hover, active and focus states. In our design system called Evergreen a CSS-in-JS library called Glamor is used to create appearance classes.

Why it drove adoption of Evergreen

A design system can start with something small and useful. In our case it was using a UI primitive that abstracted away dealing with CSS directly. Roland, one of our lead engineers said the following about UI-BOX.

UI-BOX really drove adoption of Evergreen…

…there is no need to consider every configuration when defining a new component. And no need to wrap components in divs for spacing.

— Roland Warmerdam, Lead Software Engineer, Segment

The lesson learned here is that it’s possible to start with something small and slowly grow that out to a full fledged design system. Don’t think you have time for that? Read the next section for some ideas.

How we started driving adoption

Up until now, I had built a tool for myself in my spare time, but it was still very much a side project. Smaller startups often can’t prioritize a design system as it doesn’t always directly align with business value. I will share how I hijacked a project, scaled out the system, and finally focused on accessibility to drive adoption across teams at Segment — and how you can do the same.

Hijack a project

About a year ago I switched teams within Segment. I joined a small team called Personas, which was almost like a small startup within Segment. With Personas we were building user profiles and audience capabilities on top of the Segment platform. It turned out to be a perfect opportunity to build out more of the design system.

Deadline in sight — our first user conference

The company wanted to announce the Personas product at our first ever user conference, with only 3 months of lead time to prepare. The idea was that our CEO and Head of Product would demo it on stage. However, there was no way we could finish a fully-baked consumer-facing product in time. We were pivoting too often based on customer feedback.

The company wanted to announce this product at our first ever user conference, with only 3 months of lead time to prepare.

Seize the opportunity

It seemed like an impossible deadline. Then it hit me: We could build a standalone prototype to power the on-stage demo. This prototype would be powered by fake data and only support just the functionality that was part of the demo.

This prototype would live outside of the confines of our application. This would allow us to build things quickly, but the downside is that there is no access to the code and components that live in our application codebase. Every component we want to use in the prototype needs to be built — a perfect opportunity to build out more of the design system. We decided it would be the lowest risk, highest reward option for us to pursue.

While we worked on the demo script for the on-stage demo, I was crunching away on the prototype and Evergreen. Having the prototype available and easily shareable made it easier for the team to practice and fine-tune the script. It was a great time at Segment; I could see our team and company growing closer while readying for launch.

Huge Success

The interactive prototype was a huge success. It helped us show the vision of what our product and Personas could be. It drove considerable interest to our newest product, Personas. I was happy, because not only did we have a interactive prototype, we also have the first parts of our design system.

Focus on the developer experience

So far, we built something small and useful and hijacked a project that allowed us to build out a big chunk of our design system, Evergreen. The prototype also proved to be a great way to drive adoption of Evergreen in our application. Our developers simply took code from the prototype and ported it over in our application. 

At that point, Evergreen components were adopted in over 200 source code files. Our team was happy about the components, but there were some pain points with the way Evergreen was structured. When we started building Evergreen, we copied some of the architecture decisions of bigger design systems. That turned out to be a mistake. It slowed us down.

Too early for a mono-repo

When I started building Evergreen I took a lot of inspiration from Atlassian’s AtlasKit. It is one of the most mature and comprehensive enterprise design systems out there. We used the same mono-repo architecture for Evergreen, but it turns out there is quite a lot of overhead to when using a mono-repo.

Our developers were not happy with the large number of different imports in each file. There were over 20 different package dependencies. Maintaining these dependencies was painful. Besides unhappy developers, it was time-consuming to add new components.

A single dependency is better for us (for now)

I wanted to remove as much friction for our developers using Evergreen as possible, which is why I wanted to migrate away from the mono-repo. Instead, a single package would export all of our components as a single dependency.

Migrate our codebase in a single command

When we decided to migrate to a single package, it required updating the imports in all the places Evergreen was used in our application. At this point Evergreen was used in over 200 source code files in the Segment application. It seemed like a pretty daunting challenge, not something anyone got excited about doing manually. We started exploring our options and ways to automate the process, and to our surprise it was easier than we thought.

Babel parser to the rescue

We created a command line tool for our application that could migrate the hundreds of files of source code using Evergreen with one command. The syntax was transformed using a tool called b. Now it’s a much better experience for our developers in the application. In the end, our developers were happy.

Lesson Learned, Face the challenge

A big change like this can feel intimidating, and give you second doubts. Although I wish I started Evergreen with the architecture it has right now — sometimes the right choice isn’t clear up front. The most important thing is to learn and move forward. 

Driving adoption of a design system is very challenging. It is hard to understand progress. We came up with a quite nifty way to visualize the adoption in our application — and in turn make data-driven decisions about the future of Evergreen.

How to get to 100% adoption

Within our company, product teams operate on key metrics to get resources and show they are being successful to the rest of the company. One of the key metrics for Evergreen is 100% adoption in our application. What does 100% even mean? And how can we report on this progress?

What does 100% adoption even mean?

100% adoption at Segment means building any new products with Evergreen and deprecating our legacy UI components in favor of Evergreen components. The first part is the easiest as most teams are already using Evergreen to build new products. The second part is harder. How do we migrate all of our legacy UI components to Evergreen components?

What legacy UI components are in our app?

Active code bases will accrue a large number of components over time. In our case this comes in the form of legacy component libraries that live in the application codebase.

In our case it comes in the following two legacy libraries:

  • React UI Library, precursor to Evergreen.

  • Legacy UI folder, literally a folder called ui in our codebase that holds some very old components.

Evergreen versions

In addition to the legacy libraries, the application is able to leverage multiple versions of Evergreen. This allows gradual migration from one version to another.

  • Evergreen v4, the latest and greatest version of Evergreen. We want 100% of this.

  • Evergreen v3, previous version of Evergreen. We are actively working on migrating this over to v4

How can we report on the progress of adoption?

The solution we came up with to report on the adoption of Evergreen is an adoption dashboard. At any single point in time the dashboard shows the following metrics:

  • Global Adoption, the current global state of adoption

  • Adoption Week Over Week, the usage of Evergreen (and other libraries) week over week

  • Component Usage, a treemap of each component sorted by framework. Each square is sized by how many times the component is imported in our codebase.

The Component Usage Treemap

Besides the aggregates, we know exactly which files import a component. To visualize this, a Treemap chart on the dashboard shows each component with the size of the square representing how many times it imported in our application.

Understand exactly where you are using a component

Clicking on one of the squares in the treemap shows a side sheet with a list of all the files which import that component. This information allows us to confidently deprecate components.

Filter down a list of low hanging fruit to deprecate

The adoption dashboard also helps to prioritize the adoption roadmap. For example, legacy components that are only imported once or twice are easy to deprecate.

How it works

Earlier I shared how we used babel-parser to migrate to the new import structure.  Being true to our roots, we realized the same technique could be used to collect analytics for our design system! To get to the final adoption dashboard there are a few steps involved.

Step 1. Create a report by analyzing the codebase

We wrote a command line utility that returns a report by analyzing the import statements at the top of each file in our codebase. An index is built that maps these files to their dependencies. Then the index can be queried by package and optionally the export.  Here is an example:



We open-sourced this tool if you are interested in learning more or want to build out your own adoption dashboard see https://github.com/segmentio/dependency-report

Step 2. Create and save a report on every app deploy

  • Every time we deploy our application, the codebase is analyzed and a JSON report is generated using the dependency-report tool.

  • Once the report is generated, it is persisted to object storage (S3).

  • After persisting the report, a webhook triggers the rebuild of our dashboard via the Gatsby static site generator.

Step 3. Build the dashboard and load the data

To reduce the number of reports on the dashboard, the generator only retrieves the most current report as well as a sample report from each previous week. The latest report is used to show the current state. The reports of the previous weeks are used to calculate an aggregate for the week over week adoption chart.

How the adoption dashboard is pushing Evergreen forward

The adoption dashboard was the final piece in making Evergreen a success as it helped us migrate over old parts of our app systematically and with full confidence. It was easy to identify usage of legacy components in the codebase and know when it was safe to deprecate them. Our developers were also excited to see a visual representation of the progress. These days it helps us make data-driven decisions about the future of Evergreen and prioritize our roadmap. And honestly, it is pretty cool.


To those of you who are considering setting out on this journey, I’ll leave you with a few closing thoughts:

  • Start small. It’s important to show the value of a potential design system by solving a small problem first.

  • Find a real place to start. A design system doesn’t have value by itself. It only works when applied to a real problem.

  • Drive adoption and measure your progress. The real work starts once the adoption begins. Don’t forget that the real value is in adoption. Design systems are only valuable once they are fully integrated into the team’s workflow.

This is only the start of our journey. There are still many challenges ahead. Remember, building a design system is not about reaching a single point in time. It’s an ongoing process of learning, building, evangelizing and driving adoption in your organization.

Alan Braithwaite on September 27th 2018

“How should we test this?”

“Let’s just run it in production and monitor it closely.”

— You and your coworker, probably.

While often mocked, testing in production is the most definitive way to ensure that your system is operating as expected.  Whether you’re A/B testing with end-users or testing in-house, testing in production gives you real-time feedback as you roll out code changes that enable you to quickly fix bugs and improve functionality & the user experience.

Segment has been on a journey for the last 18 months to include end-to-end testing in production as part of our broader testing strategy, so we wanted to share our methodology and some of the work we’ve been doing in this area. But first, let's establish what exactly we mean by "testing in production."

What is testing in production? 

Testing in production (TIP) is a process in software development in which changes to the code base of a product or website are tested live as they’re completed, whether internally by developers or with real users. Popular with software development methodologies like Agile and Scrum that involve continuous improvement, testing in production creates a tight feedback loop between iteration of new features and improvement.

How Segment tests in production

For those unfamiliar, Segment is a Customer Data Infrastructure which helps our customers route data about their users from various collection points (web, mobile, server-side) to hundreds of Destinations (partners which receive data from segment) and data warehouses.  

The numerous components which compose Segment’s backend creates a challenging environment for testing in general, but especially in production. To manage this complexity, we’ve decided to focus on two areas.

First, we’ve been building towards a staging environment that faithfully represents our production environment. Second, since we cannot cost-effectively operate a staging environment at the same scale as our production environment, we’ve been developing end-to-end tests for production.

Much has been written about other types of testing, so we’re going to focus on end-to-end testing in this post.

End-to-end tests are tests which run against the entire infrastructure. They are distinct from integration tests because they’re run on real infrastructure whereas integration tests are not. These tests are also distinct from unit tests, which only test a very small amount of code or even just one method.  End-to-end tests should also exercise the exact same code paths used by a customer sending data to Segment’s API.

So what does an end-to-end test look like for Segment?

  1. Send an event to the Segment Tracking API

  2. Process that event through our many streaming services (e.g. validation, deduplication, etc)

  3. Send the event into Centrifuge, which handles reliable delivery of events to Destinations in the presence of network timeouts or other failures outside of our control

  4. Verify that the event is received by a Webhook destination

  5. Emit latency and delivery metrics to alert on using segmentio/stats.

To implement this kind of test, we required an end-to-end testing framework that would make it easy for developers to build new tests.

When we started looking at solutions, we played around with some other end-to-end frameworks with varying degrees of success. They often incorporated ideas about contracts and assertions which were tightly coupled to the framework. This not only made it difficult to add new types of tests, but it also made them difficult to debug.

Before we had end-to-end tests, our staging environment wasn’t effective at preventing bugs from getting to production.  Software is updated more frequently in staging, often being a week or so ahead of the production version.  Additionally, configuration of the staging environment was haphazard and occasionally broke due to changes in the software. These breaks were often silent because we had not been monitoring them in staging.

Today we’re open sourcing Orbital, a framework which meets the requirements presented above and helped us reach our testing goals.

Orbital provides the means to define, register and run tests as part of a perpetually-running end-to-end test service. Additionally, it provides metrics (using segmentio/stats) around test latency and failure rates which we can monitor and alert on.

Designing in-production tests

Orbital is a lightweight test framework for running systems tests defined in Go.  Orbital is inspired by Go’s own testing library, specifically the testing.T abstraction.  testing.T is a struct that gets injected into each test which defines a set of methods to determine whether or not that test passed.  We like Go’s testing package for two reasons.

First, the package takes a users first approach in it’s design.  The API couldn’t be more simple!  Doing this greatly reduces friction when writing tests, increasing the likelihood that they’ll get written and maintained properly.

Second, modeling orbital.O  after testing.T gives us the flexibility we need to define our arbitrarily complex tests.  After trying to enumerate all the different things we’d like to support we found that there are just too many behavioral edge cases that needed actual code to describe properly.  For example, say you want to check events were received by a webhook and also that some counters were updated.  This was difficult to articulate with an assertion based framework like the one we were using before. With Orbital, we’re now only limited by what the Go language supports which is an improvement over the “mutation→assertion” style tests which we encountered before.

The following example exercises the above illustrated case: sending an event to our Tracking API produces a webhook to a configured endpoint. In this case, we’ve configured the webhook to be our own end-to-end service’s API for test verification.

As you can see, the code is very straightforward.  Each test runs in its own goroutine and blocks until it’s completed or the context is cancelled.  Modeling tests in this way allows us to check arbitrary side effects and allows for any kind of behavioral testing your imagination can come up with.

Orbital provides a Service struct which registers the tests and manages the process lifecycle. This struct allows you to set global timeouts for all tests as well as configure logging and metrics. During test registration, you set the period (how often the test is run), name, function and optional timeout override.

One key factor in the design of this framework is the embedded webhook package. This special webhook operates like a normal HTTP server which logs requests to an interface.  One implementation of this interface (RouteLogger) is configured such that after sending an event, you can block your goroutine waiting until that event is received by the webhook or a timeout occurs.

With this primitive, we can send requests to the API, then wait for them to be sent back after being processed in our pipeline. In the above example, we’re doing this on the line h.Waiter.Wait(ctx, evt.ID). To see a full example of both a tester and tested service, check out the examples directory on GitHub.

How do we use testing in production?

Our Orbital tests are deployed as a service that runs inside of our staging and production infrastructure.  It sends events to the Segment Tracking API using our various library implementations.  We even fork out to headless Chrome to execute tests in the browser with analytics.js!  This framework generates metrics used for dashboards and alerting. Here you can see a comparison of our staging vs production environments.



From the screenshots above you can see that something was broken in staging by looking at the top center graph.

This library was important to creating confidence that our stage environment behaves the same way as our production environment.  We’re now at the point where we can block a release if any of the tests fail in stage.  We know for certain that something did actually break and needs to be investigated. This is the testing strategy you need to have in your infrastructure to reach the ever elusive 5-9s of reliability.

What remains?

Orbital has already proven instrumental in reducing the number of bugs making it to production.  We’ve written numerous tests across multiple teams which exercise various known customer configurations.  However, the framework is not yet bulletproof.

Although you can scale on a single instance to tens of thousands of requests per second, eventually you’ll hit a bottleneck somewhere.  Unfortunately, this framework doesn’t elegantly scale out right now.  Currently, the “RouteLogger/Waiter” records messages sent and received in memory; not to a shared resource or DB.  So if you have multiple tasks running which are load balanced, those requests are unlikely to be sent to the right task and the tests will fail/timeout.  This is a non-trivial but ultimately a solvable problem.

If this discussion of testing in production is interesting to you, reach out to us!  We’d love to hear from you.  You can find us on twitter @Segment.  Check out our Open Source initiatives here.  We’ve also got many positions in Engineering which involve solving problems similar to this which you can see here.

If you’re interested in reading more about this topic, check out these other great resources on testing in production:





Noah Zoschke on August 7th 2018

Segment is a hub for a tremendous amount of data. It processes peaks of 230,000 events per second inbound, and 280,000 events per second outbound between more than 200 integration partners. You may think of Segment as black box for delivering all this data. You send data once to its tracking API, and it coordinates translating data and delivering it to many destinations.

When everything works perfectly, you don’t need to open the black box. Unfortunately, the world of data delivery at scale is far from perfect. Think of all the software, networks, databases and engineers behind Segment and our partners. You can imagine at any given time a database is failing, a network is unreachable, etc.

Segment engineering has taken lengths to operate reliably in this environment. Our latest efforts have been around visibility into the HTTP response codes from destinations. We spent the last few months adding hooks to measure everything from the volume of events, how quickly they are sent to destinations, and what HTTP status code and error response body, if any, occurred for every request.

This instrumentation is ultimately for Segment users to see into the black box to answer one question: how do delivery challenges affect my data? To this end, we built an event delivery dashboard around the data.

It turns out the data in aggregate is also tremendously useful to cloud service engineers at Segment and our destination partners alike. Looking at HTTP status codes alone has unveiled lots of insights on how data flows between services and how we can maximize delivery rates.

I’d like to share some of the things we have found in a day of HTTP responses that we see at Segment.


First the good news. 92.6% of events — 24.4 billion on the sample day — are delivered on the first attempt. In this happy path, Segment makes an HTTP request to a destination and receives a HTTP 2XX success status code response.

Terminal problems

Next, the bad news. 5.5% of events — 1.4 billion on the sample day — never make it to their destination. In this path, Segment makes an HTTP request to a destination and generally receives an HTTP 4XX client error status code response. These codes indicate the client — either Segment or the user it represents — made an error that the server can’t reconcile.

What’s the password?

The most common client errors Segment sees are HTTP 401 Unauthorized and HTTP 403 Forbidden on 3.8% of requests. In this case, the server doesn’t recognize the given username, password or token, and can’t accept any data. Neither Segment nor the destination server can resolve this automatically for a given request.

This is either due to wrong credentials configured in Segment in the first place or credentials that expired on a destination. Segment always attempts to send the latest events just in case the problem was resolved on either side.

No comprende

The next most common client error is HTTP 400 Bad Request on 0.51% of requests. In this case, the received the request payload but couldn’t make sense of it. These are generally validation errors. Again, Segment and the destination can’t do anything about it automatically, except show instructive error messages to the user.

Next steps…

These errors are considered fatal, but the qualitative data can inform ways to improve delivery over time. The first big step here was building the event delivery dashboard to surface these issue to users.

For authentication errors, a logical next step would be to send notifications when delivery begins to encounter 401 errors. We can also imagine a mechanism to disable event delivery after a threshold to spare partners the request overhead.

For validation errors, visibility into requests per-customer and per-destination can inform improvements to the Segment integration code. Segment can review partner API requirements and not attempt to deliver data it can determine is bad ahead of time, or automatically massage data to conform to the destination API.

Ephemeral problems

Now the interesting challenge… a large class of HTTP problems on the internet are not fatal. In fact, most of the HTTP 5XX server error status codes reflect an unexpected error and imply that the system may accept data at a later time, as does one critical 4XX status code.


The largest class of temporary problems seen by Segment are of the HTTP 429 -- Too Many Requests class. It’s not hard to imagine why… 

Segment itself has very high rate limits with the aim of accepting all of the data a customer throws at it. Not every downstream destination has the same capabilities, particularly those that are systems of record. Intercom, Zendesk, and Mailchimp, for example, all have well-designed and lower API rate limits.

Segment has to mediate between the customer data volume and the destination rate limits. A combination of internal metering, request batching, and retry with backoff get most of the data through.

But about 7.3% of requests — 2.1 billion a day — encounter a 429 response along the way. Retries help a lot, but if a customer is simply over their limits consistently over a long enough time frame, Segment has no choice but to drop some messages. At least we can quantify how much this is happening with the delivery data and report this to a customer.

Out of service

The next largest class of error — 1.3% of requests — is from destination servers. Segment often sees servers respond with an error like:

  • HTTP 502 Bad Gateway

  • HTTP 504 Gateway Timeout

  • HTTP 500 Internal Server Error

  • HTTP 503 Service Unavailable

Perhaps it’s a temporary glitch for a single request, or perhaps the destination service is experiencing an outage. But every day Segment encounters 371 million of these error responses.

Unreliable channel

Finally, 1.1% of requests error out because of the network layer. At scale, Segment sees a noticeable number of network errors, such as:

  • ENOTFOUND — hostname not found

  • ECONNREFUSED — connection refused

  • ECONNRESET — connection reset

  • ECONNABORTED — connection aborted

  • EHOSTUNREACH — host unreachable

  • EAI_AGAIN — DNS lookup timeout

Maybe it’s due to bad host, flaky network, or DNS error.

If at first, you don’t succeed…

As seen above, a significant number of HTTP or network status codes indicate transient problems. When Segment encounters these, it retries delivery over a 4-hour window with exponential backoff. We can see that this retry strategy is successful. We go from 92.6% success on the first attempt to 93.9% success after ten attempts, an extra 163 million events delivered, all thanks to the destination server sending proper HTTP status codes.

WTF webhooks

Finally, we see some bizarre errors. A very popular destination is webhooks — arbitrary HTTP addresses to POST events. The error codes we see from these destinations imply webhooks might not always follow best practices.

We see every number from 1  to 101 as HTTP status code, which is far outside the HTTP status code specification. Perhaps this is someone testing Segment delivery rates themselves?

We see HTTP 418 I'm a teapot which is in the HTTP spec as an April fools joke.


Unfortunately, all of these strange responses are considered terminal errors by Segment. Sorry, webhooks!


It’s literally impossible to achieve 100% delivery on the first attempt over the internet. Transient network errors, unexpected server errors, and rate limiting all present challenges that add up to significant problems at scale. On top of that, encryption, authentication and data validation add another layer of challenges for perfect machine-to-machine delivery.

Retries are the primary strategy to improve delivery, and a retry strategy can only be as good as the destination service response codes.

As a service provider, returning status codes like 400, 403 or 501 is a powerful signal that Segment has no choice but to drop data. Inversely, returning status codes like 500, 502, and 504 is strong hint that Segment should try again. And 429 — rate limit exceeded — is an explicit sign that Segment needs to retry later.

If you’re running cloud service APIs or writing webhooks, think carefully about HTTP status codes. User data depends on it!

For more information about cloud service APIs, visit Segment’s Destinations catalog at https://segment.com/catalog#integrations/all

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.