Go back to Blog

Engineering

Benjamin Yolken on September 15th 2020

Today, we're releasing kubeapply, a lightweight tool for git-based management of Kubernetes configs. Here's why.

All Engineering articles

Leif Dreizler on June 25th 2019

Segment receives billions of events from thousands of customers that authenticate weekly, mostly using usernames and passwords. It’s common knowledge in the security community that users frequently pick weak passwords and reuse them across sites. Six months ago, we set out to change that by helping customers select stronger passwords and allowing them to protect their Segment account with Multi-Factor Authentication (MFA).

But First, A Backstory

At Segment, we have various sub-teams working diligently to improve our security story. These efforts have provided significant security improvements which are seldom seen by customers. A perfect example of this is our previous blog post, Secure access to 100 AWS accounts, which describes some incredibly impactful work that most customers will never know about.

In an online business, the parts of your security program most customers interact with are your product security features—even though this is just the tip of the security iceberg. 

Some customers will view your compliance certifications, but the foundational areas of your security program will mostly go unnoticed outside of your customers’ vendor security teams. 

To provide the security your customers deserve, you need to have a well-rounded security program. If you’re interested in seeing what that journey looks like from our Chief Information Security Officer, Coleen Coolidge, check out her recent presentation at an OWASP Bay Area meetup on how to build security capabilities at a startup.

Two tenets of our security team are to be part of the overall business’ success, and to partner closely with the rest of engineering. The importance of being part of the overall business’ success is obvious—without Segment there is no Segment Security Team. This translates to making practical security decisions for Segment employees, helping sales and marketing, and always keeping the customer in mind.

Working closely with engineering is important because the cumulative number of design choices, lines of code written, and bugs avoided by engineering is much higher than that of security. Working collaboratively with engineering allows us to learn from each other and help everyone be a better security champion. 

We believe that making good software requires making good security choices, and our security team is here to help the larger Segment team achieve that goal.

Luckily for the security industry, security is becoming an increasingly important part of the software-evaluation process. Security should be something customers have positive experiences with throughout their evaluation of your product. Two areas of opportunity that Segment Security and Engineering recently partnered on are our sign up and authentication workflows. 

Password Strength Guidance

A few years ago NIST released updated guidelines for passwords. Historically, many applications made users select an 8 character password with letters, numbers, and symbols which resulted in an overwhelming number of people picking Password1!. Applications also required regular rotation of passwords, which resulted in Password2!. Some of the new guidelines for passwords include disallowing commonly used passwords, disallowing passwords from previous breaches, and encouraging users to employ a variety of strong password strategies as illustrated in this well-known XKCD comic. To accomplish this, we turned to Dropbox’s zxcvbn module and Troy Hunt’s Have I Been Pwned (HIBP) API.

zxcvbn allows us to meet most of NIST’s password guidelines. Commonly used passwords or those that are easy to guess are scored low, and good password created by a variety of strategies are scored high. However, zxcvbn does not identify when passwords have been part of a known breach, which is why we rely on Have I Been Pwned (don’t worry, we aren’t sending your password outside of Segment and if you’re interested in how this process works it is explained on Troy Hunt’s website—linked above).

As a SaaS company, the sign up process at Segment is extremely important to the user. It sets the stage for a smooth experience. To make sure our design wouldn’t negatively impact signups and would positively impact password strength, we took a few steps to make sure we got things right. If you want your security engineering team to be treated like an engineering team, you need to follow the same principles and processes as software engineering teams. 

For this feature, that meant getting feedback from our activation engineering, product, and design teams early in the process. 

Through a close partnership with these teams, we were able to take my original mockups and turn them into customer-facing feature. 

As you can see in the above images, we block users from selecting any passwords of zxcvbn score 0 or 1, and warn them if they are selecting a password that has previously shown up in a breach. We chose to warn instead of block on breached passwords partially to reduce our reliance on the Have I Been Pwned API, and partially to limit the times a user is blocked during sign up.

Once we had a design and workflow we were all happy with, we released it to all customers completing the change and reset password workflows. These flows are seldom taken compared to sign up, and less risky to change because existing customers are more forgiving to a potentially negative experience.

To verify our changes were having the intended effect, we used our analytics.js module to track changes in password strength. As customers and frequent readers of our blog will know, Segment helps our customers track events and make informed decisions for their customers. While monitoring stats about recently changed passwords, we saw a 30% decrease of breached passwords and an increase in average zxcvbn score from 3 to 3.5, which meant our new UI was having the intended effect.

To make sure our design was pleasing to customers we added the new password UI to our signup flow and displayed it to half of our users as part of a 2 week A/B test. If things went well, we would make it live for everyone, and if it didn’t we’d try a different design. During the first week the signup percentages were the same, and during the second week we actually saw a slightly higher conversion rate on the new password interface. As a result, every customer that signs up now goes through the new user interface and receives guidance on choosing a strong password.

Multi-Factor Authentication (MFA)

Unfortunately, despite our best efforts in helping users pick strong and unique passwords, we know that many will not and those that do would like the added layer of security of a second factor. Several weeks ago we quietly released MFA to all workspaces that do not authenticate via Single Sign-On (MFA is usually handled at the Identity Provider level for SSO users). Everyone now has the option to use Time-based One Time Passwords (TOTP) via an app that can read QR Codes (e.g. Google Authenticator or 1Password), and our U.S. and Canadian customers can use SMS-based codes sent by Authy.

Similar to the development of our new password experience, Security partnered closely with Engineering and Design, and used Segment tracking events to monitor adoption. Once the feature was released, we let customers know it was available using a simple notification.

We also wanted to closely monitor MFA failures. If someone completes the username and password portion of authentication, and then repeatedly fails the MFA code portion it may mean that account password has been compromised by an attacker. If we see this behavior, we want to let our customer know that they should change their password to help prevent the attacker from gaining access to their account. To do this we used Segment’s Personas product. We created a custom audience of users that have failed MFA a certain number of times in a given time period. When a user enters that audience we send them an email.

Using Segment’s own product to deliver a better Segment experience to our customers is a perfect embodiment of the two tenets we talked about earlier. It helps us become more familiar with the product we’re helping defend which makes us more relevant during design reviews and other security engineering efforts. It also identifies new ways that Segment could be used and marketed to our customers which makes the business successful—all while improving our security posture, helping us safeguard the data our customers have entrusted with us. Our product security story may never be complete, but we’re thrilled to have our customers supporting us on this journey of incremental improvement.


A special thanks goes out to Cat and everyone else that helped make these features possible. There are too many to list but you know who you are 🙂

Theodore Chao, Bilal Mahmood on April 1st 2019

This post is a guest submission by our partner, ClearBrain. It shares their experience building a Segment Destination. Thanks to ClearBrain for sharing their story!

—Segment

At ClearBrain, we’re building the first self-serve predictive analytics platform. Growth marketers have one consistent objective—forecasting and testing with their best users for incremental ROI, as fast as possible. At ClearBrain, our proprietary AI enables customers to personalize and test every user journey by their predicted intent, for any conversion event, in minutes.

Delivering on this promise of predicting user intent requires two distinct components working together: a data layer and an intelligence layer. On the data side, you need a standardized collection of user attributes and events, aggregated across disparate digital channels—effectively requiring dozens of API integrations. The intelligence layer, in turn, normalizes that data to run machine learning models and predict any conversion event, automating the statistical analyses needed to recommend which audiences are most likely to perform an event.

The challenges and infrastructures required to build these two components of a predictive analytics platform couldn’t be more different. It’s hard enough to build one product, let alone two at the same time. Thankfully due to Segment opening up their platform to partners, this tradeoff was not an issue for ClearBrain.

Why we built on top of Segment

Segment’s Customer Data Infrastructure enabled us to focus on the intelligence components that truly differentiate ClearBrain. Rather than spending years building API integrations into every customer data input, we instead invested that time into automating the statistical inference necessary to power a predictive analytics platform.

Segment was a natural partner to power the data layer for our platform. Breaking it down, there are 3 critical features of a data layer necessary for predictive analytics: omni-channel data, standardized data, and historical data.

Predictive analytics is built on the foundation of predicting user intent via lookalike modeling. You predict your user’s intent to sign up by analyzing the users who signed up in the past (vs those who didn’t). Thus to build an accurate model of user intent you need a complete digital picture of their user journey. The problem, of course, is data heterogeneity. Apps may be built on Go or Java, running on Android or iOS, or integrated with email clients like Braze or Iterable. Further, companies in different verticals organize their user journeys in completely different ways, from user-based checkout to account-based subscription funnels.

Segment resolves a lot of this data heterogeneity. By building an integration to Segment via their platform, ClearBrain was able to build for one API and automatically collect data from any client, server-side, platform, or email integration. Rather than spending years building an integration for every code library and email API, we got instant access to the dozens of data sources available in Segment’s sources catalog. And all of that data is cleanly organized. Regardless of whether it is a server-side attribute or an email event, all data is received via a universal spec of four main event collections: identifies, tracks, pages, and screens. Further, there are vertical-specific specs for eCommerce and B2B SaaS that map out the user journey via standardized sets of event names specific to each vertical.  Regardless of data source, we can always be guaranteed that the data is received in a predictable format.

Clean data is just as vital as homogeneous data when powering a predictive analytics platform. There’s the classic statement, “garbage in, garbage out”. If the data you’re receiving is anomalous, stale, or redundant, the predictive models powering your insights will be too. Thankfully, a benefit of building on top of Segment is that they provide tools for data governance and quality. Their Protocols product guarantees that data received into ClearBrain will be standardized, live, and accurate. I can’t tell you the number of times we’ve seen data come in from other data sources where there are 4 different events for sign up (e.g. sign_up, signup, sign-up, and Sign Up).

Lastly, a critical component for any analytics product is time to value. Any data visualization requires multiple weeks of data to appreciate engagement or retention trends (remember, it takes a minimum of two points to make a line!). This problem is only compounded if your analytics relies on predictive modeling. Predictive modeling is based on analyzing past behavior to infer future behavior, so it follows that the more historical user data you have, the better you can project w/w changes, seasonality, and other critical trends. Segment’s historical Replay plays a critical role for ClearBrain here—rather than waiting weeks to collect enough historical data to power their predictive insights, they can replay years of user events in a matter of days.

These three facets–omni-channel, standardized, and historical data, made integrating with Segment a no-brainer. Rather than spending years on the pipes, we could focus on the statistical inference that makes ClearBrain truly unique. You effectively get all the benefits of being customer data infrastructure, with none of the work!

How we built our Segment integration

Building an integration on the Segment platform was really straightforward, thanks to their new Developer Center, clean documentation, and a developer-friendly support team. There are a few main steps to integrate with Segment as a Destination:

Step 1a: Set up a static HTTPS endpoint to receive customer data

The way that Segment is able to send customer data to your destination is by passing a batch of JSON messages as part of a POST request payload to an HTTPS endpoint that you provide. Their specs detail the requirements for your endpoint, but in a nutshell, your endpoint must satisfy the following requirements:

  • Accept POST requests

  • Accept data in JSON format

  • Use HTTPS

Our particular endpoint is set up using a CNAME in our DNS provider that points to the endpoint for an Application Load Balancer (ALB) in AWS. In the following section, we will talk about the use of reference architecture provided by AWS, that handles setting up an ALB.

Step 1b: Set up your API server (using AWS!)

The bulk of the work centers around building infrastructure that can support the amount of data your customers will be sending you at any point in time (keep in mind that historical Replay is an option that customers may be able to leverage, which can cause a one-time rate of requests higher than your average rate, but we’ll get to that later). Building a scalable API server is not the simplest of tasks, but there are solid templates made publicly available that you can reference. At ClearBrain, we decided to build our API server in AWS, which was made easier using a boilerplate provided by AWS.  

This particular reference architecture provided by AWS uses a set of Cloudformation templates to define a set of related resources in AWS (e.g. VPCs, ECS Clusters, Subnets, ECS Services) that will represent the entire API server. We won’t go into specifics of how we adapted the templates for developing the ClearBrain destination, but here are a few changes we made on top of the reference architecture to productionize our API server:

  • Replace product-service and website-service with our own API service, which was loosely based on the product-service .yaml file

  • Define auto scaling policy for for ECS cluster

    • The AWS template provides auto-scaling at the ecs service level, but does NOT provide auto-scaling at the cluster level, which means that the cluster would not auto-scale in response to the auto-scaling of the contained services

  • Define a Makefile

    • We wanted to be able to run commands that would handle various tasks, such as:

      • running the API server locally for development

      • sending batches of messages at various rates for stress testing

      • validating cloudformation templates prior to deploying

      • deploying cloudformation templates

To find a sweet spot for minimal cluster size, as well as verify that our cluster would be able to scale up to handle larger loads (especially during Replay), we performed a series of load tests. This involved sending synthetic loads to a test cluster at varying levels of traffic, both with autoscaling off and on, carefully observing resulting performance (latency, errors), and verifying autoscaling mechanics and speed.

For the load test, we used an open source benchmarking tool, wrk. This tool allows specification of parallelism level and even the exact query to send (which allows sending realistic queries). It then measures performance while sending requests as fast as possible–and as wrk is written in C++ it’s able to do so with very minimal CPU usage.  However, the nodes sending load are still limited by their network throughput, so we made sure to use high network throughput instances. On AWS, we used c5n class instances to send the load, but even then, we had to run multiple instances in parallel to ensure that enough load was sent to the cluster so that it was pushed to its limit.

Finally, in evaluating the cluster size, consider that it will take some time for a high CPU condition to be measured, plus some additional time to provision instances, install the image, start the server, and add them to the load balancer.  Leave a comfortable margin of time to be safe. In our case, we verified (with auto-scaling off) that the cluster could comfortably handle well over 90% CPU utilization without affecting correctness or latency. Then, when setting up auto-scaling we set a target tracking policy with CPU utilization target of 50%.  We also set a minimum cluster size so that even in low-traffic times if we received a surge of traffic, the cluster could handle it comfortably without needing to wait to scale up.

Step 1c: Build your ingestion logic

Once your API server is set up to receive requests, the rest of the work is (mostly) business logic. Ultimately, you just need to ensure that incoming requests get handled appropriately.  Possible tasks include:

  • validating requests are coming from expected customers’ Segment sources

  • stripping / augmenting requests

  • writing data to persistent storage + partitioning

It is strongly encouraged that you build validation into your API service, to ensure that you are processing data that is being expected. The simplest way to perform validation is to check the Authorization Header that is sent with each request from Segment. As mentioned in their documentation, each request will have an Authorization Header that contains a Base64 encoded string containing a secure token that you will be passing to Segment as part of the OAuth flow (see Step 2). By decoding the header, you can verify whether you have already issued this token to a customer, and additionally map the token to a customer, if you so choose.  

The next recommendation is to write specific logic to handle any of the various types of calls that Segment supports (page, track, identify, group). You can do this by checking the type property on each JSON message and routing your logic accordingly.  

The last recommendation is to respond to the request with an appropriate status code.  See documentation for details on what each unique status code means to Segment.

Step 2: Build the OAuth flow to allow users to set up the destination

Once your API endpoint has been tested, submitted for review, and approved, the last step is to build an easy OAuth flow to make it easy for your customers to set up your integration as a destination in their Segment accounts.  

Segment provides a button, which you can embed on your site/application, that handles redirecting your users to Segment and allowing them to select a source to set up for a Destination. Due to some technical complications with how ClearBrain’s app works, we ended up inspecting their source code and boiled it down to a simple redirecting of the browser to https://app.segment.com/enable, passed with the following properties as query parameters:

Query parameter

Description

integration

the unique name of your integration in Segment

settings

Base64 encoded string of {"apiKey":"<securityToken>"}

redirect_url

Base64 encoded string of the url for Segment to redirect back to

A sample url would look like:

https://app.segment.com/enable?integration=yourcompany&settings=abcdef12345&redirect_url=foobar

Notice the securityToken that is passed in as part of the base64 encoded settings query parameter. This will be a unique and secure token that you will generate (and save!) on your end and pass to Segment. Segment will then send this security token back with every request to your API endpoint, which you can use to validate the request (as mentioned in Step 1c).

Final takeaways

Building our integration into Segment–from OAuth to API server to data ingestion, took only a couple days to implement. Thats days of work compared to the months, if not years, it would take otherwise to build and maintain a whole data infrastructure layer.

In turn, we’ve been able to focus on the statistics and machine learning components necessary for a best-in-class predictive analytics platform–one that from day one can deliver on the promise of predicting user intent on top of omni-channel, standardized, and historical data powered by Segment.

Learn more about how Clearbrain and Segment work together here.

Maxime Santerre on February 28th 2019

Counting things is hard. And it gets even harder when you have millions of independent items. At Segment we see this every single day, counting everything from number of events, to unique users and other high-cardinality metrics. 

Calculating total counts proves to be easier since we can distribute that task over many machines, then sum up those counts to get the total sum. Unfortunately, we can’t do this for calculating cardinality since the sum of the number of unique items in multiple sets, isn’t equal to the number of unique items in the union of those sets.

Let’s go through the really basic way of calculating cardinality:

This works great for small amounts of data, but when you start getting large sets, it becomes very hard to keep all of this in memory. For example, Segment’s biggest workspaces will get over 5,600,000,000 unique anonymous_id in a month.

Let’s look at an example of a payload:

Quick maths:

Oof.

Another big issue, is that you need to know which time range you want to calculate cardinality over. The number of unique items from February 14 to February 20 can be drastically different than February 15 to February 21. We would essentially need to keep the sets at the lowest granularity of time we need, then merge them to get the unique counts. Great if you need to be precise and you have a ton of time, not great if you want to do quick ad hoc queries for analytics.

Using HyperLogLog

Thankfully, we’re not the first ones to have this issue. In 2007, Philippe Flajolet came up with an algorithm called HyperLogLog, which could approximate the cardinality of a set using only a small, constant amount of memory. HyperLogLog also allows you to increase this amount of memory to increase accuracy by using more memory, or to decrease memory at the cost of accuracy if you have memory limitations.

16kb for 0.81% error rate? Not too bad. Certainly cuts down on that 134.4GB we had to deal with earlier. 

Our first approach

We initially decided to use Redis to solve this problem. The first attempt was a very simple process where we would take every message out of a queue and put it directly into Redis and then services would call PFMERGE and PFCOUNT to get the cardinality counts we needed.

This worked very well for our basic needs then, but it was hard for our analytics team to do any custom reporting. The second issue is that this architecture doesn’t scale very well. When initially got to that point, our quick fix was this:

Not ideal, but you would be surprised how much we got out of this. This scaled upwards of 300,000 messages/s due to Redis HLL operations being amazingly fast.

Ever so often, our message rate would exceed this and our queue depth would grow, but our system would keep chugging along once the message rate went back down. At this point we had the biggest Redis machine available on AWS, so we couldn’t throw more money at it. 

The maximum queue depth is steadily increasing every week and soon we will always be behind. Any kind of reporting that depends on this pipeline will become hours late, and soon maybe even days late, which is unacceptable. We really need to find a solution quickly

Introducing Spark

We already used Spark for our billing calculations, so we thought we could use what was there to feed into another kind of reporting.

The problem with the billing count is that they’re static. We only calculate for the workspace’s billing period. It’s a very expensive operation that takes every message for a workspace and calculates a cardinality over the user_id and anonymous_id for that exact slice. For some bigger customers, that’s several terabytes of data, so we need big machines, and it takes a while.

We previously used HLLs to solve this problem, so we’d love to use it again.

The problem is: How do we use HLLs outside of Redis?

Our requirements are:

  • We need a way to calculate cardinality across several metrics.

  • We need to be able to store this in a store with quick reads.

  • We can’t pre-calculate cardinality, we need to be able to run ad-hoc queries across any range.

  • It needs to scale almost infinitely. We don’t want to fix this problem again in a year when our message rate doubles.

  • It needs very little engineering resources.

  • (Optional) Doesn’t require urgent attention if it breaks.

First solution: Fix the scale issue

We want something reliable and easy to query, so we think of PostgreSQL and MySQL. At that moment, we find they are equally fine, but we choose MySQL because we already have one to store reporting data and the clock is ticking .

We use Scala for our Spark jobs, so we settle on Algebird to compute HLLs. It’s an amazing library and using it with Spark is trivial, so this step was very smooth. 

The map-reduce job goes like this:

  1. On every message, create a HLL out of every item you want cardinality metrics on and return that as the result of your key/map.

  2. On the reduce step, add all of the HLLs resulting from the map.

  3. Convert the resulting HLLs to the bytes representation and write them to MySQL.

Now we have all the HLLs as bytes into MySQL, we need a way to use them from our Go service. A bit of Googling around led me to this one. There were no ways of directly injecting the registers, but I was able to add it pretty easily and created a PR which will hopefully get merged one day

The next step is to get the registers and precision bits from Algebird, which was very simple. Next, we just take those bytes and make GoHLL objects that we can use!

Sweet! Now we can calculate cardinality of almost anything very easily. If it breaks, we fix the bug and turn it back on without any data loss. The worse that can happen is data delays. We can start using this for any use case: Unique events, unique traits, unique mobile device ids, unique anything.

It doesn’t take long for us to realize there’s 2 small problems with this:

  • Querying this is done through RPC calls on internal services, so it’s not easy for the analytics team to access.

  • At 16kb per HLL, queries ranging over several days on several metrics creates a big response from MySQL.

The latter being quite important. If we’re doing analysis on all our workspaces multiple times a day, we need something that iterates quickly. Let’s say someone wants to figure out the cardinality of 10 metrics, over 5 sources and 180 days.

We’re pulling down 140 MB of data, creating 9,000 HLLs, merging them, and calculating cardinality on each metric. This takes approximately 3–5s to run. Not a huge lot, but if we want to do this for hundreds of thousands of workspaces, it takes way too long.

Making this lightning fast

So we have a solution, but it’s not ideal.

I sat down at a coffee shop one weekend and started looking around for better solutions. Ideally, we’d want to do this on the database level so we don’t have to do all of this data transfer.

MySQL has UDFs (user defined functions) that we could use for this, but we use MySQL on AWS, and from my research, there doesn’t seem to be a way to use UDFs on Aurora, or RDS.

PostgreSQL on the other hand, has an extension called postgresql-hll, which is available on PostgresSQL RDS.

The storage format is quite different unfortunately, but it’s not as much of a problem since they have a library called java-hll that I can use in my Spark job instead of Algebird. No need to play with the headers this time.

Now we’re only pulling down 64 bit integers for each metric, we can query cardinality metrics with SQL directly, and this pipeline can scale almost infinitely. The best part? If it breaks for some reason at 4am, I don’t need to wake up and take care of it. We’re not losing any data because we’re using our S3 data archives and it can be retried at any time without any queues filling up disk.

The Final Step: Migrating the MySQL data to Postgres

So now that we have all of this processing from now going forward, we need to back fill everything back to the beginning of the year. I could take the lazy way and just run the Spark jobs on every day from January 1st 2018 to now, but that’s quite expensive. 

I said earlier we didn’t need to play with the headers, but we have an opportunity to save a bit of money (and have some fun) here by taking the HLLs saved in MySQL and transforming them into the postgresql-hll format and migrating them to our new PostgreSQL database.

You can take a look at the whole storage specification here. We’ll just be looking at the dense storage format for now.

Let’s look at an example of HLL bytes in hexadecimal: 146e0000000…

The first byte: 14 is defined as the version byte. The top half is the version and the bottom is the type. Version is easy, there’s only one, which is 1. The bottom half is 4, which represents a FULL HLL, or what we’ve been describing as dense.

The next 2 bytes represents the log2m parameter and the registerWidth parameter. log2m is the same as precision bits that we’ve seen above. The top 3 bits represents registerWidth - 1 and the bottom 5 represents log2m. Unfortunately, we can’t do this directly from the hex, so let’s expand it out to bits:

The rest are the HLL registers bytes.

Now that we have this header, we can easily reconstruct our MySQL HLLs by taking their registers and prefixing those with our 146e header. The only thing we need to do is take all of our existing serialized HLLs in MySQL and dump them in this new format in PostgreSQL. Money saved, and definitely much fun had.

Tyson Mote on February 7th 2019

The premise behind autoscaling in AWS is simple: you can maximize your ability to handle load spikes and minimize costs if you automatically scale your application out based on metrics like CPU or memory utilization. If you need 100 Docker containers to support your load during the day but only 10 when load is lower at night, running 100 containers at all times means that you’re using 900% more capacity than you need every night. With a constant container count, you’re either spending more money than you need to most of the time or your service will likely fall over during a load spike.

At Segment, we reliably deliver hundreds of thousands of events per second to cloud-based destinations but we also routinely handle traffic spikes of up to 300% with no warning and while keeping our infrastructure costs reasonable. There are many possible causes for traffic spikes. A new Segment customer may instrument their high-volume website or app with Segment and turn it on at 3 AM. A partner API may have a partial outage causing the time to process each event to skyrocket. Alternatively, a customer may experience an extreme traffic spike themselves, thereby passing on that traffic to Segment. Regardless of the cause, the results are similar: a fast increase in message volume higher than what the current running process count can handle.

Sidebar: you can use Segment to track events once and send to all the tools in your stack. Sign up for a free workspace here or a get a demo here 👉

To handle this variation in load, we use target-tracking AWS Application Autoscaling to automatically scale out (and in) the number of Docker containers and EC2 servers running in an Elastic Container Service (ECS) cluster. Application Autoscaling is not a magic wand, however. In our experience, people new to target tracking autoscaling on AWS encounter three common surprises leading to slow scaling and giant AWS billing statements.

Surprise 1: Scaling Speed Is Limited

Target tracking autoscaling scales out your service in proportion to the amount that a metric is exceeding a target value. For example, if your CPU target utilization is 80%, but your actual utilization is 90%, AWS scales out by just the right number of tasks to bring the CPU utilization from 90% to your target of 80% using the following formula:

Continuing the above example, AWS would scale out a task count of 40 to 45 to bring the CPU utilization from 90% to 80% because the ratio of actual metric value to target metric value is 113%:

However, because target tracking scaling adjusts the service task count in proportion to the percentage that the actual metric value is above the target, a low ratio of maximum possible metric value to target metric value significantly limits the maximum “magnitude” of a scale out event. For example, the maximum value for CPU utilization that you can have regardless of load is 100%. Unlike a basketball player, EC2 servers can not give it 110%. So, if you’re targeting 95% CPU utilization in a web service, the maximum amount that the service scales out after each cooldown period is 11%: 100 / 90 = 1.1

In the above example, the problem is that if your traffic went up by 200%, you’d probably need to wait for seven separate scaling events to reach just under double the task count to handle the load:

If your scale out cooldown is one minute, seven scaling events will take seven minutes during which time your service is under-scaled.

If you need to be able to scale up faster, you have a few options:

  • Reduce your target value to allow for a larger scale out ratio, at the risk of being over-scaled all the time ($$$).

  • Add target tracking on a custom CloudWatch metric with no logical maximum value like inflight request count (for web services) or queue depth (for queue workers).

  • Use a short scale out cooldown period to allow for more frequent scale out events. But, short cooldowns introduce their own unpleasant side effects. Read on for more on that surprise!

Surprise 2: Short Cooldowns Can Cause Over-Scaling / Under-Scaling

AWS Application Autoscaling uses two CloudWatch alarms for each target tracking metric:

  • A “high” alarm that fires when the metric value has been above the target value for the past 3 minutes. When the high alarm fires, ECS scales your service out in proportion to the amount that the actual value is above the target value. If more than one of the high alarms for your service fire, ECS takes the highest calculated task count and scales out to that value.

  • A “low” alarm that fires when the metric value has been more than 10% below the target value for the past 15 minutes. Only when all of your low alarms for a service fire does ECS slowly scale your service task count in by an undefined and undocumented amount.

In addition to the target metric value, AWS Application Autoscaling allows you to configure a “cooldown” period that defines the minimum amount of time that must pass between subsequent scaling events in the same direction. For example, if the scale out cooldown is five minutes, the service scales out, at most, every five minutes. However, a scale out event can immediately follow a scale in event to ensure that your service can quickly respond to load spikes even if it recently scaled in.

The catch is that these values cannot be arbitrarily short without causing over-scaling and under-scaling. Cooldown durations should instead be at least the amount of time it takes the target metric to reach its new “normal” after a scaling event. If it takes three minutes for your CPU utilization to drop by about 50% after scaling up 2x, a cooldown less than three minutes causes AWS to scale out again before the previous scale out has had time to take effect on your metrics, causing it to scale out more than necessary.

Additionally, CloudWatch usually stores your target metric in one- or five-minute intervals. The cooldown associated with those metrics cannot be shorter than that interval. Otherwise, after a scaling event, CloudWatch re-evaluates the alarms before the metrics have been updated, causing another, potentially incorrect, scaling event.

Surprise 3: Custom CloudWatch Metric Support is Undocumented

Update: AWS has significantly improved documentation around custom CloudWatch metric support! See:

Target tracking scaling on ECS comes “batteries included” with CPU and memory utilization targets, and they can be configured directly via the ECS dashboard. For other metrics, target tracking autoscaling also supports tracking against your custom CloudWatch metrics, but that information is almost entirely undocumented. The only reference I was able to find was a brief mention of a CustomizedMetricSpecification in the API documentation.

Additionally, the ECS dashboard does not yet support displaying target tracking policies with custom CloudWatch metrics. You can’t create or edit target tracking autoscaling policies; you can only create them manually using the PutScalingPolicy API. Moreover, once you create them, they’ll cause your Auto Scaling tab to fail to load:

Thankfully, Terraform makes creating and updating target tracking autoscaling policies relatively easy, though it too is rather light on documentation. Here’s an example target tracking autoscaling policy using a CloudWatch metric with multiple dimensions (“Environment” and “Service”) for a service named “myservice”:

The above autoscaling policy tries to keep the number of inflight requests at 100 for our “myservice” ECS service in production. It scales out at most every 1 minute and scales in at most every 5 minutes.

Even More Surprises

Target tracking scaling can be tremendously useful in many situations by allowing you to quickly scale out ECS services by large magnitudes to handle unpredictable load patterns. However, like all AWS conveniences, target tracking autoscaling also brings with it a hidden layer of additional complexity that you should consider carefully before choosing it over the simple step scaling strategy, which scales in and out by a fixed number of tasks.

We’ve found that target tracking autoscaling works best in situations where your ECS service and CloudWatch metrics meet the following criteria:

  • Your service should have at least one metric that is directly affected by the running task count. For example, a web service likely uses twice the amount of CPU time when handling twice the volume of requests, so CPU utilization is a good target metric for target tracking scaling.

  • The metrics that you target should be bounded, or your service should have a maximum task count that is high enough to allow for headroom in scaling out but is low enough to prevent you from spending all your money. Target tracking autoscaling scales out proportionally so if the actual metric value exceeds the target by orders of magnitude, AWS scales your application (and your bill) out by corresponding orders of magnitude. In other words, if you are target tracking scaling on queue depth and your target depth is 100 but your actual queue depth is 100,000, AWS scales out to 1,000x more tasks for your service. You can protect yourself against this by setting a maximum task count for your queue worker service.

  • Your target metrics should be relatively stable and predictable given a stable amount of load. If your target metric oscillates wildly given a stable traffic volume or load, your service may not be a good fit for target tracking scaling because AWS is not able to correctly scale your task count to nudge the target metrics in the right direction. 

One of the services that we use target tracking autoscaling for at Segment is the service that handles sending 200,000 outbound requests per second to the hundreds of external destination APIs that we support. Each API has unpredictable characteristics like latency (during which time the service is idle waiting for network I/O) or error rates. This unpredictability makes scaling on CPU utilization a poor scaling target, so we also scale on the count of open requests or “inflight count.” Each worker has a target inflight count of 50, a configured maximum of 200, a scale out cooldown of 30 seconds, and a scale in cooldown of 45 seconds. For this particular service, that config is a sweet spot that allows us to scale out quickly (but not so quick as to burn money needlessly) while also scaling in less aggressively.

In the end, the best way to find the right autoscaling strategy is to test it in your specific environment and against your specific load patterns. We hope that by knowing the above surprises ahead of time you can avoid a few more 3AM pager alerts and a few more shocking AWS bills.


See what Segment is all about

Since you've made it this far, perhaps you want to check out Segment? Sign up for a free workspace here or a get a demo here 👉


Chris Sperandio on December 12th 2018

Today, we announced the availability of our Config API for developers to programmatically provision, audit, and maintain the Sources, Destinations, and Tracking Plans in their Segment workspaces. This is one step forward in Segment’s greater strategy to transition from API-driven to API-first development and become infinitely interoperable with companies’ internal infrastructure.

Our shift reflects a greater market shift over the past 30 years in how technology has impacted where and how companies create value. In the 80s, most industries were horizontally integrated, and few companies could afford to interact directly with customers. They created competitive advantage through operations and logistics and relied on additional layers of the value chain to reach customers. Software has made it easier to deliver services and goods more efficiently all the way to end consumers. As a result, today’s companies crave APIs that are extensible and responsive to their modular infrastructure and enable them to differentiate on customer experience for the first time. 

In this post, we’re excited to share our motivations for becoming an API-first company and the historical context for how to think about why APIs are eating software that is eating the world.  

Identifying where companies create value

So why go API-first? Because in every industry, the value chain is transforming, and APIs are the only way to keep up.

The idea of a value chain isn’t new. Businesses have been using this tool, first coined by Michael Porter of HBS, since 1985. He decomposed businesses into their various functions and arranged those functions as a pipeline, separating the “primary” activities of a firm from “supporting” ones. Primary activities are how you create and deliver value to the market, and supporting activities are those that, well, support these endeavors. 

The value chain of businesses in 1985

Thinking of a firm or business unit as a value chain is helpful for understanding where a firm has or can create a meaningful competitive advantage. In other words, it’s a system for determining where to double down on building unique, differentiated value and where to outsource to create cost advantages. 

Businesses themselves are only one link in a broader market or industry value system: the outputs of a firm’s value pipeline will subsequently pass through additional links in that chain, such as distributors and/or retailers, before they’re purchased by “end user” customers. 

The value system from supplier to end user

Before the internet — and still today in heavily industrialized or regulated industries — vertically integrating your business to own the end customer experience incurred high marginal costs and was prohibitively expensive. For consumer goods or healthcare, conventional wisdom holds that this is largely still true, though companies like Dollar Shave Club or Spruce Health might beg to differ!

The skills and competencies to differentiate in retail are different from distribution, which are different from manufacturing, etc. In focusing along these logistical steps, companies become horizontally focused, and become distant and removed from their true end customers. All too often, our everyday customer experiences still reflect this!

The critical path in a pipeline business

Such businesses, links in a linear value system from raw material to a real-world product in the hands of customers, might best be described as “pipeline businesses.” For these pipeline businesses, the links in their chain where they could best differentiate — where their moats were widest — were inbound logistics (sourcing inputs), operations, and outbound logistics (delivering outputs). Together these comprised the “critical path,” or chain of primary activities, that created value for a pipeline business.

Porter was careful to put customer-facing functions, including sales, marketing, and customer support, inside what he called primary activities. However, only very few large companies, and generally only the luxury brands affordable to the few — think Nordstrom, Mercedes, Four Seasons, or American Express — actually differentiated on these dimensions. For most large companies, these customer-facing activities were better described as secondary activities, and they expanded their profit pools by viewing them as cost centers and outsourcing or deferring to further specialized firms. (Hello, Dunder Mifflin). 

But when the internet happened, the critical path was reshaped forever.

Digital reformation: software enters the value chain

When software first emerged as a viable business tool, most enterprises considered the technology an opportunity to do what they already did more efficiently. Hence the inclusion of “technology development” as a supporting activity in the original value chain composition. 

As vendors popped up to offer software products to help support these value chain reformations, pipeline companies were most open to buying applications that could streamline their secondary activities like sales, marketing, and support. These were less risky, and most of the direct investment in building technology was thought to be better allocated in further differentiating the existing primary activities in the critical path. Because the software buyers were less invested in results — these were secondary activities, after all — they had low expectations for app usability.

The B2B vendors got away with long, onerous implementations and forced their customers to adapt the way they work to the vendor’s way of doing things. They charged extra for services that were needed to extract any value from their software. Because APIs made it easier to work with and integrate their software, these vendors saw APIs as a “nice-to-have.” Or, they charged extra for the use of these APIs to capture more from the IT budget.

Platforms over pipelines: software eats the value chain

But today, software is no longer viewed only as a tool to optimize existing things; it’s combinatorially interconnected, and it permeates everything. In this networked world, customer experience is the only true competitive advantage.

As the marginal cost of customer interactions trends to zero, companies can now afford to reach large audiences at scale and integrate their value proposition around customer experience. And in order to provide excellent customer experiences, what we used to think of as secondary activities are better framed as belonging right in the critical path through integration.

The predominant model of how businesses are organized shifts from “Pipeline” to “Platform,” and the mental model of a request/response lifecycle becomes more useful than that of a value chain.

In consumer-facing businesses, the embodiment of the request/response model is an omnipresent “mobile, on demand” company like Uber or Instacart.

In B2B, it’s an API-first one like AWS, Stripe, Plaid, or Twilio.  

These companies have digitized and vertically integrated every link of their value chain. They have slick websites and apps — on every platform — on the inbound side, and free, two-day shipping with no-worries returns on the outbound side.  

"Apps are increasingly becoming thin wrappers around use cases, not weighty shells around brands." — Chris Maddern, Co-Founder of Button

Because inbound and outbound logistics are ever “thinning” experiences, increasingly mediated via HTTP requests from mobile phones, tablets, laptops or servers, operations become everything behind those applications, and APIs make those experiences effective, relevant, worthwhile, and endearing. Request/response becomes the new pipeline.

The new critical path: Customer experience is the new logistics, and rapid learning, iteration, and integration are the new operations.

Regression models in excel supporting inventory planning? Support activity.

Data Science permeating every facet of the customer experience? Primary activity.

Traditional, reactive “Business Intelligence”? Support activity.

ML-powered supply and demand forecasting to drive real-time marketplace optimization? Primary Activity.

For consumer companies to differentiate on customer experience, they have to integrate their sales, marketing, and customer support functions — links that were once thought of as secondary. These customer-facing departments and customer-facing digital experiences should converge on a shared, ever-updating understanding of who their customer is to tailor their experiences accordingly. Moreover, companies must operationalize the learnings and insights from these interactions to contextualize and tailor subsequent experiences. 

For firms that do this right, everything from content, to product recommendations, to promotions should be based on a real-time, integrated understanding of the factors that drive great customer experiences. This process of self-tuning requires indexing massive amounts of data and then the infrastructure to iterate, optimize, and personalize on the basis of it.

Our humble revision of the Value Chain model for 2018 — the company as a request/response lifecycle

While this model of a request/response firm may not look surprising to platforms, aggregators, digital native retailers, or API-driven middleware in B2B, the stalwart companies who drive the economy are catching on. And as the modern enterprise looks more like these request/response firms every day, the nature of enterprise software is changing with them to fit the model.

Streamlining the critical path: the emergence of API-first for a request/response world

As software became networked, and those networks hit a critical density in the 2000s, technology shifted the value chain composition again. After adopting new technology in secondary business units, consumer companies realized that software could improve processes and margins by outsourcing in their primary focus areas, as well. At this point, they started to introduce technology to their critical paths

This is where the first B2B API-first companies emerged. They turned the “pipeline” model on its head by removing the heft, ceremony, and friction associated with their own critical path. They optimized this experience with software, then productized the software itself.  As a result, they helped B2C companies outsource micro-components of their value chain and enabled these companies to enter into new primary focus areas. 

The API-first companies’ entire end-to-end value proposition is integrated between the lifecycle of an HTTP request and response. Need to process a payment? Just make a request to Stripe, and by the time they respond—a few hundred milliseconds later—they’ve handled a ton of complexity under the hood to issue the charge. Send a text to your customer? 

Companies like Stripe and Twilio set themselves apart not only by the sheer amount of operative complexity they’re able to put behind an API, but because of how elegant, simple, and downright pleasant their APIs are to use for developers. In doing so, they give developers literal superpowers. 

As these companies became the de facto mechanism for accomplishing these operative tasks, they’ve aggregated happy customers along the way. What started as humble request/response companies, have morphed into juggernaut platforms, expanding the scope of their missions and offerings. Before we knew it, “payment processing” became “empowering global commerce,” and “send an SMS” became “infrastructure for better communication.” 

Reducing the cost of integrating these functions via APIs propelled the creation of countless startups with lower barriers to entry.

Building B2B software in a request/response world

For B2B companies selling into enterprises that are increasingly embodying the request/response model, modularity and recognizing that you’re only a part of a much greater whole is key. 

IT is an increasingly embedded function driving interconnection and integration. Companies and their partners — be they base infrastructure providers like AWS and GCP, advertising platforms like Facebook and Google Ads, or the smartest players in the SaaS space — are embracing interoperability through common infrastructure, APIs, and technical co-investment.

Rather than view the software they buy as end-to-end solutions that they’re going to train their teams up on, these enterprises are shifting to a “build and buy” model of private and public networked applications, where security and privacy are necessarily viewed as a shared responsibility. 

The components of their infrastructure that they do choose to buy are part of a broader, sprawling network composed of on-prem deployments, as well as private, public, and third-party cloud services. As a result, they emphasize the need for data portability and the ability to bring a new tool “into the fold” of their existing governance and change control policies and procedures. In fact, it’s generally preferred that the tool acquiesces to those existing procedures than to force the team to adapt their procedures to the tool.

The worst thing you can tell your customer is that they should conform to your opinions about how to do something.

Sure, you built a beautiful user experience atop the data in your SaaS tool. But there’s an edge case you didn’t think of. And without an API, your customers have no recourse. With one, they can channel their needs into an opportunity for them to further invest in your ecosystem. More importantly, they can take “enterprise readiness” into their own hands and enact it on their own terms. In fact, I’ve been personally involved in several of our enterprise-facing initiatives, such as SSO integration with SAML IDPs and fine-grained permissions. While developing requirements for these features, far and away the most common refrain I’ve heard is, “just give me an API.”

Why is that? Amongst software developers, operations practitioners, and IT administrators alike, the concept of Infrastructure as Code (IaC) has taken hold.  This means writing code to manage configurations and automate provisioning of the underlying infrastructure (servers, databases, etc.) in addition to application deployments. The reason we were so excited to adopt this practice ourselves at Segment is that IaC inserts proven software development practices, like version control, continuous testing, and small deployments, into the management lifecycle of the base infrastructure that applications run on.

In the past, “base infrastructure” had a relatively static and monolithic connotation. Today companies are deploying their application not just to “servers” or VMs in their VPCs, but to a dynamic network of cloud-agnostic container runtimes and managed “serverless” deployment targets. At the same time, they rely on a growing network of third-party, API-driven services that provide key functions such as payments, communications, shipping, identity verification, background checking, monitoring, alerting, and analytics. 

At Segment, our own engineers refuse to waste our time and increase our risk profile by clicking around in the AWS console, instead opting to use terraform for provisioning. They go so far as to home-roll applications, like specs for “peering into,” and station agent for querying our ECS clusters. None of these workflows or custom applications would be possible without the ECS control plane APIs. 

And it goes beyond AWS. We want to make it functionally impossible to deploy a service that doesn’t have metrics and monitoring. To do this, we threw together a terraform provider against the Datadog API and codified our baseline alerting thresholds right into our declarative service definitions. 

Now, we’re offering that same proposition to our customers through our Config API for provisioning integrations, workspaces, and configuring tracking plans. We’re excited to see a terraform provider pop up. (And, we have it on good authority the community is already working on it.) Using the Config API and terraform, customers can codify and automate their pre-configured integration settings and credentials when provisioning new domains or updating tracking plans. 

…and that’s where we get back to Segment

Because I know what you’re thinking. Wasn’t Segment already API-first?

Well, partially. Segment, historically, has been API-driven. Which is to say that we’ve been API-first, but only in a few key areas, and hopefully the models and context we explored above can help to explain why!

When we first launched analytics.js, we introduced an elegant and focused API for recording events about how your customers interact with your business. So you made requests to Segment — but did you wait on a response? No! You just let us handle sending the events to your chosen integrations.

That’s because, then, it was a better inbound link to a secondary value chain activity— “analytics.” Companies didn’t want to wait any milliseconds to hear back from Segment because we weren’t in the critical path of their value delivery. (Side note, we went to great lengths to avoid any waiting at all — all our collection libraries are entirely asynchronous and non-blocking.)

And while engineers loved the simplicity of our Data Collection API, the real reason they love Segment is that integrating with that API is the last analytics, marketing, sales, or support integration they ever have to do. That value proposition is what lies between our “API-driven” inbound and outbound value chain links.  The operative link in Segment’s Connections Product is the act of multiplexing, translating, and routing the data customers send us to wherever those customers want.

What exploded underneath our feet when we released analytics.js was the realization that the larger the organization, the more likely it is that the person who needs to access and analyze data is different from the person who can instrument their applications to collect that data. By adopting Segment, companies decoupled customer data instrumentation from analysis and automation, disentangling “what data do we need?” from “how are we going to use it?”

In effect, Segment became the “backbone network router” in charge of packet-switching customer data inside a company’s data network.

Becoming Customer Data Infrastructure

We got this far without thinking API-first when it came to our control plane. Even with all our high-minded prognostications about the end of traditional value chains! So why make the shift now?  

The reason to make such a change, as ever, is strong customer pull.

Since introducing our data router, Segment has evolved substantially. Today, the original Segment Data Collection API you know and love is the inbound link in the customer data infrastructure request/response lifecycle. 

With each big new product release this year, be it our GDPR functionality, Protocols, or Personas, we’ve heard emphatically from Customers that they want to “drive” these features programmatically, and we’ve shipped key APIs with each to deliver on those needs.

All the while, we’ve also noticed more than a few customers — and even partners looking to develop deeper, workflow-based integrations with Segment — poking around under the hood of the private control plane APIs that drive these products.

What’s clear is that while our original, “entry-level” job to be done — analytics instrumentation — may have been a “send-it-and-forget it” API interaction. However, companies have come to rely on their customer data in the critical path of delivering value through their applications, products, and experiences. Now, data collection has moved from fueling “secondary” links to a first-order priority. 

In fact, this thesis (and the accompanying customer pull) has driven Segment’s product portfolio expansion to help companies put clean, consented, synthesized customer data in the critical path of their customer experiences.

And this is where we bring it all together. Because it’s not just consuming the data that fits the mold for an API-first model. As our customers build and adopt applications that fit into a broader network, and they bring once-“supporting” value chain links into their critical path, they want to program the infrastructure that enables that as well.  

With the APIs, our customers have built Segment change management into their SDLC workflow. They run GDPR audits of data flow through their workspace with a button click. They’re keeping their privacy policies and consent management tools up-to-date in real-time with the latest tools they are using.

It’s incredibly humbling to have customers who push the boundaries of your product and are sufficiently invested to want to integrate it more deeply and more safely into their workflows. We’re proud to be enabling that by opening up our Config API, which we welcome you to explore here.

David Scrobonia on December 4th 2018

Security tools are not user friendly. This is a problem in a world where the security community is trying to “push security left” and integrate into development culture. Developers don’t want to use tools that have a poor user experience and add friction to their daily workflows. By ignoring UX, security tools are preventing teams from making their organizations more secure.

The team behind the Zed Attack Proxy (ZAP), a popular OSS attack proxy for testing web apps, worked on addressing this problem with our application. As a result we came up with three takeaways for improving the UX of security tools.

So What’s the Problem? 

Let’s walk through using ZAP to scan a web app for vulnerabilities. Our goal is to add the target application to our scope of attack, spider the site to discover all of the pages, and run the active scanner. So we would:

  1. Look in the Site Tree pane of ZAP to find our target (http://localhost:8000 in this case)

  2. Right click on the target and select the “Include in Context” 

  3. This opens a new menu asking you to select an existing context to add our app to scope (or create a new one)

  4. This opens another configuration menu that prompts you to define any regex rules before adding the application to scope 

  5. Now that our app is in scope, we click the target icon to hide other applications from the Site Tree

  6. To start the spider right click on our application and hover over the “Attack” option in this list to expose a sub-context menu

  7. Click the newly exposed “Spider” option to open a configuration menu and press “Start Scan”.

  8. To start an active scan, we again right click on our app and hover over “Attack” to expose a menu

  9. Click “Active Scan” to open a configuration menu and finally press “Start Scan” to begin

Scanning an app with ZAP

Whew! This is not a simple workflow. It requires us to hunt through hidden context-menus, click through several different menus, and assumes that the user understands all of the configuration options. And this is for the most commonly used feature!

While this is not a great experience, ZAP is far from the least usable security tool available. Ask anybody on your security team about their experience with enterprise static analysis tools and they’ll be sure to give you an earful about a product still stuck in a pre-Y2K user interface.  In contrast to these commercial tools, ZAP is free, open source, and maintained by a handful of people working part time. This presents a huge gap in time and money that left us wondering where we should start in order to improve our user experience. To keep our efforts efficient we focused on three things.

3 Ways to Improve the UX of Security Tools

1. Make it Native

When assessing the security of a web app, you're frequently switching between ZAP and your browser. This quickly becomes distracting. Let's look at how we intercept requests with ZAP, another common feature, to see why.

  1. In the browser, navigate and use the feature they want to test

  2. Go back to ZAP to observe the requests that were sent and turn on the “Break” feature to start intercepting messages

  3. Go back to the browser and reuse the feature they wanted to test

  4. Go back to ZAP to see the intercepted message, modify the request, and then press “Continue”

  5. Go back to the app to see if modified request changed the behaviour in the app

  6. Rinse and repeat for as many requests needed to test the feature.

Intercepting HTTP messages with ZAP

Everytime we just want to intercept a request we have to go back and forth between browser and ZAP five times! This may not seem like a lot at first, but considering that you may intercept hundreds of messages testing for vulnerabilities, this becomes a headache.

The problem is that we are constantly changing contexts between ZAP and the native context for testing, the browser. I like to compare this to how a fighter pilot operates a jet. Their native context for flying the plane is looking through the windshield, so all of the important information they need to make decisions is presented on a heads up display (HUD). Imagine trying to survive a dog fight if all of the feedback about altitude, acceleration, and weapons status was only available in a dashboard of knobs and gauges. It would be impossible if you were continually having to monitor a separate context. 

A fighter jet’s HUD efficiently displays information

To provide a natively integrated experience in ZAP, we took inspiration from a fighter jet’s heads up display.

The ZAP Heads Up Display is new UI overlaid on the target page you are testing, providing the functionality of ZAP in the browser.

Intercepting HTTP messages with the HUD

Making this change wasn’t a simple “lift and shift”, however. It required us to get creative with how we approached our design. We knew we didn’t want to implement the HUD as a browser plugin, which would require us to support multiple code bases that were bound to the restrictions of their plugin APIs. Even then we wouldn’t be able to support all browsers. This was a non-starter for a small development team with a global user base that uses a variety of browsers.

Instead of a browser plugin, we leveraged ZAP’s all powerful position as a proxy to inject the HUD into the target application. When ZAP intercepts an HTTP response from a server, it modifies the HTML to include an extra script tag. The source of this script is a javascript file that is served from ZAP. When this script is executed in the web app it adds several iframes to the DOM which make up the components of the ZAP HUD.

ZAP modifies the HTTP responses of the target application

This approach allows us to add the HUD to any target application running in any modern browser. By making the HUD native we’ve created a frictionless experience, which is essential if you want users to adopt your work. 

2. Sacrifice Power for Accessibility

When looking at ZAP there are a lot of powerful features, but where to find and how to access them isn’t immediately intuitive. To access a common feature like “Active Scan” we had to tediously crawl into a context menu, parse a large list, and click through a sub-menu. This inconvenience is multiplied when you consider there are multiple places in the UI you can find new features. To start a scan you can also navigate to the bottom pane, open a new tab, choose the “Active Scan” feature, and proceed through the configuration menus that way. 

Accessing features via context menus

Accessing features via the bottow drawer

Presenting features in multiple places in the UI is disorienting for a new user trying to figure out how to navigate the application. Will the next feature be found by opening a new tab, or will it be found in a sublayer of a context menu?

To address our complex UI we made a trade off - limit features by simplifying their interfaces. We tossed away the ability to configure features upfront, eliminated multiple entry points into features, and forced scattered ZAP features into consistent UX elements. The HUD now presents features as tools, the discrete buttons on either side of the Heads Up Display.

This consistency creates the same experience when using different tools. Now, when a user wants to find more features, they know exactly where to find it - in another tool! 

Remember how complicated the “Include in Context” flow was? Now with HUD tools, all of these features have been built into the “Scope” tool. Simply click tool, select “Add to Scope”, and we’re done! You’re now ready to attack the application. And how would we start spidering the site? You guessed it - click the Spider tool, select “Start Spider”, and it will start to run!

Scanning an app with the HUD

The simplified tools interface does not provide the same level of feature depth that ZAP traditionally provides. We can’t define a scope regex or define the maximum depth of a spider crawl right out of the box. This is the trade off we choose though: to forfeit feature power for accessibility.

There is a lot happening behind the scenes to enable all of the functionality of ZAP within our simpler “tool” interface. To keep the HUD responsive we aren’t loading all of the functionality into the iframes. Instead, the HUD leverages a service worker, a background javascript process similar to a web worker, to handle most of the heavy lifting so the iframes can be lightweight and responsive. Service workers are more privileged than web workers and can persist across multiple page loads, meaning we only have to load our javascript once and keep it tucked away in the background.* The service worker hosts all of tool logic and exposes it to the different UI iframes via the postMessage API, a browser API for inter-window (and cross origin) communication. 

Not all of the functionality of the tools is stored in the service worker, though. Because we’re already running ZAP we make heavy use of its existing features via the ZAP API. ZAP’s API is very thorough and enables developers to run almost the entire application via a simple REST API. The service worker communicates with this API along with a websocket API that streams events captured in ZAP so that the HUD can have live, up to date notifications. 

The Heads Up Display uses several different technologies

A great way to see all of these pieces in motion is with the new “Break” tool. When the Break tool is clicked in the HUD, a postMessage is sent from the UI frame to the service worker where the tool logic is running, which sends an API request to ZAP to start intercepting messages. When the user tries to open a new page ZAP will intercept the message, use the websocket API to notify the service worker that it just intercepted a new message, and then the service worker will notify an iframe to display the intercepted message. 

All of these technologies help to keep the HUD accessible. If a user can’t figure out how to use your software, it doesn’t matter how good it is at solving a problem it won’t be used. 

3. Keep it Flexible

Even with an improved interface for accessing ZAP, we didn’t want to assume how users would interact with the HUD. To prevent locking our users into rigid workflows we made the UI as configurable as possible and made it easy for users to add more functionality to the HUD.

Applications come in all shapes and sizes and we don’t want the HUD’s display to get in the way. To prevent usability issues users can arrange the tools however they like, or remove tools that get in the way. The entire HUD can even be temporarily hidden from view. The ultimate goal is to have a fully customizable drag and drop interface where users can manage the HUD like its the home screen of their smartphone: changing fonts, adding widgets, and changing the layout.

Users can also quickly add features to the HUD, so they aren’t stuck using only the default tools. Developers often have custom testing or build scripts and connecting them to the HUD would make testing that much easier. That's something we can do in just a few minutes using only ZAP.

ZAP has a “Scripts” plugin that allows users to hook custom scripts into various points of the application. Scripts can be used to modify requests or responses, add active scanning rules, or change any other ZAP behaviour. The plugin also provides access to the HUD code, allowing users to quickly copy, paste, and modify the code for an existing tool. By tweaking just a few lines we can create a tool that uses the ZAP API to start running any user defined script: a developer’s custom testing scripts, a QA’s web automation script, or a hacker’s favorite tool.

Users can add custom functionality to the HUD in a few minutes

In the example above we have a script called “Hack it!” that replaces the text “Juice Shop” with “HACKED”. After quickly modifying an existing tool, we change a ZAP API call to enable our defined script, and when we restart the HUD you can see that the new “Hack it!” tool is now available to be added to our display.

Although this is a simple example, the scripting feature can be used to add any functionality to the HUD and supports several different scripting languages.

Conclusion

By focusing on three things we were able to make a powerful security tool much more accessible to a wider audience. By making it native we empower users in the environment their most comfortable in. By sacrificing power for accessibility we enable users of all levels to quickly start security testing. By keeping it flexible our tool adapts to a user’s specific needs.

While these design concepts aren’t revolutionary, or even that original, they highlight a fundamental gap between how the security community talks about security and how we practice security. If we honestly want to “push security left” we must leverage these principles to provide frictionless security for our users. 

Epilogue

The HUD is now in Alpha release! If you would like to test fly the HUD visit https://github.com/zaproxy/zap-hud to get up and running in a few minutes and see how it works. This is still a very early release so it may buggy, but please share your feedback on usability, reliability, and feature requests. If you’re interested in helping out with the project please reach out to us via the Github project or on Twitter at @david_scrobonia or @zaproxy.

* Service worker savvy readers will know that service workers are event driven, and that the lifecycle of a service worker expects them to be constantly terminated and activated, requiring all dependencies to be imported each time this happens. To prevent this we have hacked around this spec by sending the service worker a heart beat to keep it alive while the HUD is active.

Gurdas Nijor on November 16th 2018

At this point it's well-accepted that analytics data is the beating heart of a great customer experience. It's what allows us to understand our customer's journey through the product and pinpoint opportunities for improvement.

This all sounds great, but reality is a messy place. As a team becomes an organization, the structure used for recording this data can easily diverge. One part of the product might use userId and another user_id. There might be both a CartCheckout and a CheckoutCart event that mean the same thing. Given thousands of call sites and hundreds of kinds of records, this sort of thing is an inevitability.

Without an enforceable shared vocabulary for describing this data, an organization's ability to use this data in any meaningful way becomes crippled.

Downstream tools for analyzing data begin to lose value as redundant or unexpected data trickles in as a result of implementation errors at the source. Fixing these issues after they’ve made it downstream turns out to be a very expensive proposition, with estimates as high as 60% of a data scientist’s time being spent cleaning and organizing data.

At Segment, we’ve put a considerable amount of engineering effort into scaling our data pipeline to ensure low-latency, high throughput and reliable delivery of event data to power the customer data infrastructure for over 15 thousand companies.

We also recently launched Protocols to help our customers ensure high quality data at scale.

In this post, I want to explore some approaches we’re taking to tackle that dimension of scalability from a developer perspective, allowing organizations to scale the domain of their customer data with a shared, consistent representation of it.

Tracking Plans

To ensure a successful implementation of Segment, we’ll typically recommend that customers maintain something known as a “Tracking Plan.”

An example of a Tracking Plan that we would use internally for our own product

This spreadsheet gives structure and meaning to the events and fields that are present in every customer data payload.

A tracking plan (also known as an implementation spec) helps clarify what events to track, where those events need to go in the code base, and why those events are necessary from a business perspective.

https://segment.com/academy/intro/how-to-create-a-tracking-plan/

An example of where a Tracking Plan becomes a critical tool would be in any scenario involving multiple engineers working across products and platforms. If there are no standards around how one should represent a “Signed Up” event or what metadata would be important to capture, you’d eventually find every permutation of it when it comes time to make use of that business critical data, rendering it “worse than useless.”

This Tracking Plan serves as a living document for Central Analytics, Product, Engineering and other teams to agree on what’s important to measure, how those measures are represented and how to name them (the three hard problems in analytics).

Where it breaks down, and how to fix it

As a Tracking Plan evolves, the code that implements it will not often change accordingly. Tickets may get assigned, but oftentimes feature work will get prioritized over maintaining the tracking code, leading to a divergence between the Tracking Plan and implementation. In addition to this, natural human error is still a very real factor that can lead to an incorrect implementation.

This error is a pretty natural result of not having a system to provide validation to an implementor (both at implementation-time, and on an ongoing basis).

Validation that an implementation is correct relative to some idealized target sounds exactly like something that machines can help us with, and indeed they have — from compilers that enforce certain invariants of programs - to test frameworks that allow us to author scenarios in which to run our code and assert expected behaviors.

As alluded to above, an ideal system will provide feedback and validation at three critical places in the product development lifecycle:

  • At development time “What should I implement?”

  • At build-time “Is it right?”

  • At CI time “Has it stayed right?”

As a developer-focused company, these elements of aligning a great developer experience with the process improvements of a centralized tracking plan became a compelling problem to solve.

That’s why we built, and are now open sourcing Typewriter - a tool that lets developers “bring a tracking plan into their editor” by generating a strongly typed client library across a variety of languages from a centrally defined spec.

The developer experience of using a Typewriter generated library in Typescript

Typewriter delivers a higher degree of developer ergonomics over our more general purpose analytics libraries by providing a strongly typed API that speaks to a customer’s data domain. The events, their properties, types, and all associated documentation are present to inform product engineers that need to implement them perfectly to spec, all without leaving the comfort of their development environment.

Compile time (and runtime) validation is performed to ensure that tracking events are dispatched with the correct fields and types to give realtime validation that an implementation is correct.

This answers the questions of “What should I implement” and “Is it right” mentioned earlier. The remaining question of “Has it stayed right?” can be answered by integrating Typewriter as a task in your CI system.

How it works

Typewriter uses what amounts to a machine-readable Tracking Plan with a rich language built on JSON Schema for defining and validating events, their properties and associated types that can be compiled into a standalone library (making use of the excellent quicktype library to generate types for languages we target with static type systems).

This spec can be managed within your codebase, ensuring that any changes to it will result in a regenerated client library, and always up to date tracking code.

What comes next

Being avid Segment users ourselves, we’ve been migrating our mountains of hand written tracking code to Typewriter generated libraries and have been excited to realize the productivity gains of offloading that work to the tooling.

Typewriter will continue to evolve to support the needs of all Segment customers too — we’re continuing to expand it and are open to community PRs!

We’d like you to give Typewriter a shot in your own projects and feel free to open issues, submit PRs, or reach us on twitter @segment.


A special thanks to Colin King for all of his work in making Typewriter a reality, and the team at quicktype for producing an amazing library for us to build on top of.

Michael Fischer on October 30th 2018

“How should I secure my production environment, but still give developers access to my hosts?” It’s a simple question, but one which requires a fairly nuanced answer. You need a system which is secure, resilient to failure, and scalable to thousands of engineers–as well as being fully auditable.

There’s a number of solutions out there: from simply managing the set of authorized_keys to running your own LDAP or Kerberos setup. But none of them quite worked for us (for reasons I’ll share down below). We wanted a system that required little additional infrastructure, but also gave us security and resilience. 

This is the story of how we got rid of our shared accounts and brought secure, highly-available, per-person logins to our fleet of Linux EC2 instances without adding a bunch of extra infrastructure to our stack.

Limitations of shared logins

By default, AWS EC2 instances only have one login user available.  When launching an instance, you must select an SSH key pair from any of the ones that have been previously uploaded.  The EC2 control plane, typically in conjunction with cloud-init, will install the public key onto the instance and associate it with that user (ec2-user, ubuntu, centos, etc.).  That user typically has  sudo access so that it can perform privileged operations.

This design works well for individuals and very small organizations that have just a few people who can log into a host.  But when your organization begins to grow, the limitations of the single-key, single-user approach quickly becomes apparent.  Consider the following questions:

Who should have a copy of the private key?  Usually the person or people directly responsible for managing the EC2 instances should have it.  But what if they need to go on vacation?  Who should get a copy of the key in their absence?

What should you do when you no longer want a user to be able to log into an instance?  Suppose someone in possession of a shared private key leaves the company.  That user can still log into your instances until the public key is removed from them.   Do you continue to trust the user with this key?  If not, how do you generate and distribute a new key pair?  This poses a technical and logistical challenge.  Automation can help resolve that, but it doesn’t solve other issues.

What will you do if the private key is compromised? This is a similar question as the one above, but requires more urgent attention.  It might be reasonable to trust a departing user for awhile — but if you know your key is compromised, there’s little doubt you’ll want to replace it immediately.  If the automation to manage it doesn’t yet exist, you may find yourself in a very stressful situation; and stress and urgency often lead to automation errors that can make bad problems worse.

One solution that’s become increasingly popular in response to these issues has been to set up a Certificate Authority that signs temporary SSH credentials. Instead of trusting a private key, the server trusts the Certificate Authority.  Netflix’s BLESS is an open-source implementation of such a system.  

The short validity lifetime of the issued certificates does mitigate the above risks.  But it still doesn’t quite solve the following problems:

How do you provide for different privilege levels?  Suppose you want to allow some users to perform privileged operations, but want others to be able to log in in “read-only” mode.  With a shared login, that’s simply impossible: everyone who holds the key gets privileged access.

How do you audit activity on systems that have a shared login?  At Segment, we believe that in order to have best-in-class security and reliability, we must know the “Five Ws” of every material operation that is performed on our instances:

  • What happened?

  • Where did it take place?

  • When did it occur?

  • Why did it happen?

  • Who was involved?

Only with proper auditing can we know with certainty the answers to these questions.  As we’ve grown, our customer base has increasingly demanded that we have the ability to know, too.  And if you ever find yourself coveting approval from such compliance organizations such as ISO 27001, PCI-DSS, or SOC 2, you will be required to show you have an audit trail at hand.

We needed better information than this:

Goals

Our goals were the following:

  1. Be able to thoroughly audit activity on our servers;

  2. Have a single source of truth for user information;

  3. Work harmoniously with our single sign-on (SSO) providers; and

  4. Use two-factor authentication (2FA) to have top-notch security.

Here’s how we accomplished them.

Segment’s solution: LDAP with a twist

LDAP is an acronym for “Lightweight Directory Access Protocol.”  Put simply, it’s a service that provides information about users (people) and things (objects).  It has a rich query language, configurable schemas, and replication support.  If you’ve ever logged into a Windows domain, you probably used it (it’s at the heart of Active Directory) and never even knew it.  Linux supports LDAP as well, and there’s an Open Source server implementation called OpenLDAP.

You may ask: Isn’t LDAP notoriously complicated?  Yes, yes it is.  Running your own LDAP server isn’t for the faint of heart, and making it highly available is extremely challenging.

You may ask: Aren’t all the management interfaces for LDAP pretty poor? We weren’t sold on any of them we’d seen yet.  Active Directory is arguably the gold standard here — but we’re not a Windows shop, and we have no foreseeable plans to be one.  Besides, we didn’t want to be in the business of managing Yet Another User Database in the first place.

You may ask: Isn’t depending on a directory server to gain access to production resources risky?  It certainly can be.  Being locked out of our servers because of an LDAP server failure is an unacceptable risk.  But this risk can be significantly mitigated by decoupling the information — the user attributes we want to propagate — from the service itself.  We’ll discuss how we do that shortly.

Choosing an LDAP service

As we entered the planning process, we made a few early decisions that helped guide our choices.

First, we’re laser-focused on making Segment better every day, and we didn’t want to be distracted by having to maintain a dial-tone service that’s orthogonal to our product. We wanted a solution that was as “maintenance free” as possible. This quickly ruled out OpenLDAP, which is challenging to operate, particularly in a distributed fashion.

We also knew that we didn’t want to spend time maintaining the directory.  We already have a source of truth about our employees: Our HR system, BambooHR, is populated immediately upon hiring.  We didn’t want to have to re-enter data into another directory if we could avoid it.  Was such a thing possible?

Yes, it was! 

We turned to Foxpass to help us solve the problem.  Foxpass is a SaaS infrastructure provider that offers LDAP and RADIUS service, using an existing authentication provider as a source of truth for user information.  They support several authentication providers, including Okta, OneLogin, G Suite, and Office 365.

We use Okta to provide Single Sign-On for all our users, so this seemed perfect for us.  (Check out aws-okta if you haven’t already.) And better still, our Okta account is synchronized from BambooHR — so all we had to do was synchronize Foxpass with Okta.

The last bit of data Foxpass needs are our users’ SSH keys.  Fortunately, it’s simple for a new hire to upload their key on their first day: Just log into the web interface — which, of course, is protected by Okta SSO via G Suite — and add it.  SSH keys can also be rotated via the same interface.

Service Architecture

In addition to their SaaS offering, Foxpass also offers an on-premise solution in the form of a Docker image.  This appealed to us because we wanted to reduce our exposure to network-related issues, and we are already comfortable running containers using AWS ECS (Elastic Container Service).  So we decided to host it ourselves.  To do this, we:

  • Created a dedicated VPC for the cluster, with all the necessary subnets, security groups, and Internet Gateway

  • Created an RDS (Aurora MySQL) cluster used for data storage

  • Created a three-instance EC2 Auto Scaling Group of instances having the ECS Agent and Docker Engine installed - if an instance goes down, it’ll be automatically replaced

  • Created an ECS cluster for Foxpass

  • Created a ECS service pair for Foxpass to manage its containers on our EC2 instances (one service for its HTTP/LDAP services; one service for its maintenance worker)

  • Stored database passwords and TLS certificates in EC2 Parameter Store

We also modified the Foxpass Docker image with a custom ENTRYPOINT script that fetches the sensitive data from Parameter Store (via Chamber) before launching the Foxpass service:

Client instance configuration

On Linux, you need to configure two separate subsystems when you adopt LDAP authentication:

Authentication:  This is the responsibility of PAM (Pluggable Authentication Modules) and sshd (the ssh service).  These subsystems check the credentials of anyone who either logs in, or wants to switch user contexts (e.g. sudo).

User ID mappings: This is the realm of NSS, the Name Service Switch.  Even if you have authentication properly configured, Linux won’t be able to map user and group names to UIDs and GIDs (e.g. mifi is UID 1234) without it.

There are many options for setting these subsystems up.  Red Hat, for example, recommends using SSSD on RHEL.   You can also use pam_ldap and nss_ldap to configure Linux to authenticate directly against your LDAP servers.  But we chose neither of those options:  We didn’t want to leave ourselves unable to log in if the LDAP server was unavailable, and both of those solutions have cases where a denial of service is possible.  (SSSD does provide a read-through cache, but it’s only populated when a user logs in.  SSSD is also somewhat complex to set up and debug.)

nsscache

Ultimately we settled on nsscache.  nsscache (along with its companion NSS library, libnss-cache) is a tool that queries an entire LDAP directory and saves the matching results to a local cache file.    nsscache is run when an instance is first started, and about every 15 minutes thereafter via a systemd timer.   

This gives us a very strong guarantee: if nsscache runs successfully once at startup, every user who was in the directory at instance startup will be able to log in.  If some catastrophe occurs later, only new EC2 instances will be affected; and for existing instances, only modifications made after the failure will be deferred.  

To make it work, we changed the following lines in /etc/nsswitch.conf.  Note the cache keyword before compat:

So that users’ home directories are automatically created at login, we added the following to /etc/pam.d/common-session:

nsscache also ships with a program called nsscache-ssh-authorized-keys which takes a single username argument and returns the ssh key associated with the user.  The sshd configuration (/etc/ssh/sshd_config) is straightforward:

Emergency login

We haven’t had any reliability issues with nsscache or Foxpass since we rolled it out in late 2017.  But that doesn’t mean we’re not paranoid about losing access, especially during an incident! So just in case, we have a group of emergency users whose SSH keys live in a secret S3 bucket.   At instance startup, and regularly thereafter, a systemd unit reads the keys from the S3 bucket and appends them to the /etc/ssh/emergency_authorized_keys file.  

As with ordinary users, the emergency user requires two-factor authentication to log in.  For extra security, we’re also alerted whenever a new key is added to the S3 bucket via Event Notifications.

We also had to modify /etc/ssh/sshd_config to make it work:

Security

Bastion servers

Security is of utmost importance at Segment. Consistent with best practices, we protect our EC2 instances by forcing all access to them through a set of dedicated bastion servers.  

Our bastion servers are a bit different than some in that we don’t actually permit shell access to them: their sole purpose is to perform Two-Factor Authentication (2FA) and forward inbound connections via secure tunneling to our EC2 instances.

To enforce these restrictions, we made a few modifications to our configuration files.  

First, we published a patch to nsscache that optionally overrides the users’ shell when creating the local login database from LDAP.   On the bastion servers, the shell for each user is a program that prints message to stdout explaining why the bastion cannot be logged into, and exits nonzero.

Second, we enabled 2FA via Duo Security. Duo is an authentication provider who sends our team push notifications to their phones and requires confirmation before logging in.  Setting it up involved installing their client package and making a few configuration file changes.  

First, we had to update PAM to use their authentication module:

Then, we updated our /etc/ssh/sshd_config file to allow keyboard-interactive authentication (so that users could respond to the 2FA prompts):

On the client side, access to protected instances is managed through a custom SSH configuration file distributed through Git.  An example stanza that configures proxying through a bastion cluster looks like this:

Mutual TLS

To avoid an impersonation attack, we needed to ensure our servers connected to and received information only from the LDAP servers we established ourselves.  Otherwise, an attacker could provide their own authentication credentials and use them to gain access to our systems.

Our LDAP servers are centrally located and trusted in all regions.  Since Segment is 100% cloud-based, and cloud IP addresses are subject to change, we didn’t feel comfortable solely using network ACLs to protect us.  This is also known as the zero-trust network problem: How do you ensure application security in a potentially hostile network environment?

The most popular answer is to set up mutual TLS authentication, or mTLS.  mTLS validates the client from the server’s point of view, and the server from the client’s point of view.  If either validity check fails, the connection fails to establish.

We created a root certificate, then used it to sign the client and server certificates.  Both kinds of certificates are securely stored in EC2 Parameter Store and encrypted both at rest and in transit, and are installed at instance-start time on our clients and servers. In the future, we may use AWS Certificate Manager Private Certificate Authority to generate certificates for newly-launched instances.

Conclusion

That's it! Let's recap:

  • By leaning on Foxpass, we were able to integrate with our existing SSO provider, Okta, and avoided adding complicated new infrastructure to our stack.

  • We leveraged nsscache to prevent any upstream issues from locking us out of our hosts.  We use mutual TLS between nsscache and the Foxpass LDAP service to communicate securely in a zero-trust network. 

  • Just in case, we sync a small number of emergency SSH keys from S3 to each host at boot.

  • We don't allow shell access to our bastion hosts. They exist purely for establishing secure tunnels into our AWS environments.

In sharing this, we're hoping others find an easier path towards realizing secure, personalized logins for their compute instance fleets. Let us know if you have any thoughts or questions, feel free to tweet @Segment, and I’m michael@dynamine.net.

Jeroen Ransijn on October 16th 2018

Design systems are emerging as a vital tool for product design at scale. These systems are collections of components, styles, and processes to help teams design and build consistent user experiences. It seems like everyone is building one, but there is no playbook on how to take it from the first button to a production-ready system adopted across an organization. Much of the advice and examples out there are for teams that seem to have already figured it out.

Today I want to share my experience in bootstrapping a design system and driving adoption within our organization, Segment. I will share how we got started by creating something small and useful  first. Then I will share how I hijacked a project to build out that small thing into our full blown design system known as Evergreen. Finally, I will share how we continue to drive and track adoption of our design system.

What is a Design System?

A design system is a collection of components, styles and processes to help teams design and build consistent user experiences — faster and better. Design systems often contain components such as buttons, popovers and checkboxes, and foundational styles such as typography and colors. Teams that use the design system can focus on what’s unique to their product instead of reinventing common UI components.

What’s in our Design System

Before I share my experience bootstrapping our design system called Evergreen. I want to set some context first, and explain what is in our design system.

  • Design Resources

    • Sketch UI Kit

    • Design Guidelines

  • Code Resources

    • React UI Framework

    • Developer Documentation

  • Operational Resources

    • Roadmap documents

    • On-boarding process

Our design system didn’t start out with all of those resources. In fact, I built something small and useful first. In the next sections I will share the lessons I learned in bootstrapping a design system and driving adoption within our organization.

How We Got Started

About 2 years ago, I joined Segment as a product designer. I worked as a front-end developer in the past and I wanted to use my skillset to create interactive prototypes. To give you a bit of context, the Segment application allows our customers to collect data from your website or app, synthesize that data and integrate with over 200 integrations for marketing and analytics.

The prototypes I wanted to develop would live outside of our Segment application and would have no access to the application codebase. This means that I didn’t have access to the components already in the application — I had to create everything from scratch.

Most advice online talks about starting with a UI audit or trying to get executive buy-in. Those are all part of the long journey of a design system, but there are many ways to get started. If you set out to solve all of the problems in your product, you might be taking on too much at once. Instead, build something small and useful, provide value quickly, and iterate on what works.

Build something small and useful

One of the first challenges you run into when creating a component library is how to deal with styling and CSS. There are a few different ways to deal with this:

  • Traditional CSS: Verbose to write, hard to maintain at scale. Often relies on conventions.

  • CSS Preprocessor such as Sass or Less: Easier way to write CSS, chance of naming collisions. Often relies on conventions.

  • CSS-in-JS solutions: Write CSS in JavaScript. Powerful ways to abstract into components.

I wanted a solution that didn’t require any extra build steps or extra imports when using the component library. CSS-in-JS made this very easy. You can import a component in your code and it works out of the box.

I wanted to avoid having to create a ton of utility class names to override simple CSS properties on components such as dimensions, position and spacing. It turns out there is a way to achieve this in an elegant way — enter the React UI primitive.

Choosing React

There are many choices of frameworks for your component library. When I started building a component library, we were already using React, so it was the obvious choice.

React UI Primitive

After doing research, I found the concept of UI primitives. Instead of dealing with CSS directly, you deal with the properties on a React component. I bounced ideas off my coworkers and got excited about what this would mean. In the end we built UI-BOX.

UI-BOX

UI-BOX exports a single Box component that allows you to use React props for CSS properties. Instead of creating a class name, you pass the property to the Box component directly:

Why is this Box component useful?

The Box component is useful because it helps with 3 common use cases

  • Create layouts without helper classes.

  • Define components without worrying about CSS.

  • Override single properties when using components.

Create layouts without helper classes.

Define components without worrying about CSS

Override single properties when using components

Flexibility and composability

The Box component makes it easy to start writing new components that allows setting margin properties directly to the component. For example, quickly space out two buttons by adding marginRight={10} to the left button. Also, you can override CSS properties without adding new distinct properties to the component. For example, this is useful when full-width button is needed, or want to remove the border-radius on one side of a button. Furthermore, layouts can be created instantly by using the Box component directly.

Still a place for CSS

It is important to note that UI-BOX only solves some of the problems. A class is still needed to control the appearance of a component. For example, a button can add dimensions and spacing with UI-BOX, but a class defines the appearance: background color, box shadows, color as well as the hover, active and focus states. In our design system called Evergreen a CSS-in-JS library called Glamor is used to create appearance classes.

Why it drove adoption of Evergreen

A design system can start with something small and useful. In our case it was using a UI primitive that abstracted away dealing with CSS directly. Roland, one of our lead engineers said the following about UI-BOX.

UI-BOX really drove adoption of Evergreen…

…there is no need to consider every configuration when defining a new component. And no need to wrap components in divs for spacing.

— Roland Warmerdam, Lead Software Engineer, Segment

The lesson learned here is that it’s possible to start with something small and slowly grow that out to a full fledged design system. Don’t think you have time for that? Read the next section for some ideas.

How we started driving adoption

Up until now, I had built a tool for myself in my spare time, but it was still very much a side project. Smaller startups often can’t prioritize a design system as it doesn’t always directly align with business value. I will share how I hijacked a project, scaled out the system, and finally drove adoption across teams at Segment — and how you can do the same.

Hijack a project

About a year ago I switched teams within Segment. I joined a small team called Personas, which was almost like a small startup within Segment. With Personas we were building user profiles and audience capabilities on top of the Segment platform. It turned out to be a perfect opportunity to build out more of the design system.

Deadline in sight — our first user conference

The company wanted to announce the Personas product at our first ever user conference, with only 3 months of lead time to prepare. The idea was that our CEO and Head of Product would demo it on stage. However, there was no way we could finish a fully-baked consumer-facing product in time. We were pivoting too often based on customer feedback.


The company wanted to announce this product at our first ever user conference, with only 3 months of lead time to prepare.


Seize the opportunity

It seemed like an impossible deadline. Then it hit me: We could build a standalone prototype to power the on-stage demo. This prototype would be powered by fake data and only support just the functionality that was part of the demo.

This prototype would live outside of the confines of our application. This would allow us to build things quickly, but the downside is that there is no access to the code and components that live in our application codebase. Every component we want to use in the prototype needs to be built — a perfect opportunity to build out more of the design system. We decided it would be the lowest risk, highest reward option for us to pursue.

While we worked on the demo script for the on-stage demo, I was crunching away on the prototype and Evergreen. Having the prototype available and easily shareable made it easier for the team to practice and fine-tune the script. It was a great time at Segment; I could see the team and company growing closer while readying for launch.

Huge Success

The interactive prototype was a huge success. It helped us show the vision of what our product and Personas could be. It drove considerable interest to our newest product, Personas. I was happy, because not only did we have a interactive prototype, we also have the first parts of our design system.

Focus on the developer experience

So far, we built something small and useful and hijacked a project that allowed us to build out a big chunk of our design system, Evergreen. The prototype also proved to be a great way to drive adoption of Evergreen in our application. Our developers simply took code from the prototype and ported it over in our application. 

At that point, Evergreen components were adopted in over 200 source code files. Our team was happy about the components, but there were some pain points with the way Evergreen was structured. When we started building Evergreen, we copied some of the architecture decisions of bigger design systems. That turned out to be a mistake. It slowed us down.

Too early for a mono-repo

When I started building Evergreen I took a lot of inspiration from Atlassian’s AtlasKit. It is one of the most mature and comprehensive enterprise design systems out there. We used the same mono-repo architecture for Evergreen, but it turns out there is quite a lot of overhead to when using a mono-repo.

Our developers were not happy with the large number of different imports in each file. There were over 20 different package dependencies. Maintaining these dependencies was painful. Besides unhappy developers, it was time-consuming to add new components.

A single dependency is better for us (for now)

I wanted to remove as much friction for our developers using Evergreen as possible, which is why I wanted to migrate away from the mono-repo. Instead, a single package would export all of our components as a single dependency.

Migrate our codebase in a single command

When we decided to migrate to a single package, it required updating the imports in all the places Evergreen was used in our application. At this point Evergreen was used in over 200 source code files in the Segment application. It seemed like a pretty daunting challenge, not something anyone got excited about doing manually. We started exploring our options and ways to automate the process, and to our surprise it was easier than we thought.

Babel parser to the rescue

We created a command line tool for our application that could migrate the hundreds of files of source code using Evergreen with one command. The syntax was transformed using a tool called b. Now it’s a much better experience for our developers in the application. In the end, our developers were happy.

Lesson Learned, Face the challenge

A big change like this can feel intimidating, and give you second doubts. Although I wish I started Evergreen with the architecture it has right now — sometimes the right choice isn’t clear up front. The most important thing is to learn and move forward. 

Driving adoption of a design system is very challenging. It is hard to understand progress. We came up with a quite nifty way to visualize the adoption in our application — and in turn make data-driven decisions about the future of Evergreen.

How to get to 100% adoption

Within our company, teams operate on key metrics to get resources and show they are being successful to the rest of the company. One of the key metrics for Evergreen is 100% adoption in our application. What does 100% even mean? And how can we report on this progress?

What does 100% adoption even mean?

100% adoption at Segment means building any new products with Evergreen and deprecating our legacy UI components in favor of Evergreen components. The first part is the easiest as most teams are already using Evergreen to build new products. The second part is harder. How do we migrate all of our legacy UI components to Evergreen components?

What legacy UI components are in our app?

Active code bases will accrue a large number of components over time. In our case this comes in the form of legacy component libraries that live in the application codebase.

In our case it comes in the following two legacy libraries:

  • React UI Library, precursor to Evergreen.

  • Legacy UI folder, literally a folder called ui in our codebase that holds some very old components.

Evergreen versions

In addition to the legacy libraries, the application is able to leverage multiple versions of Evergreen. This allows gradual migration from one version to another.

  • Evergreen v4, the latest and greatest version of Evergreen. We want 100% of this.

  • Evergreen v3, previous version of Evergreen. We are actively working on migrating this over to v4

How can we report on the progress of adoption?

The solution we came up with to report on the adoption of Evergreen is an adoption dashboard. At any single point in time the dashboard shows the following metrics:

  • Global Adoption, the current global state of adoption

  • Adoption Week Over Week, the usage of Evergreen (and other libraries) week over week

  • Component Usage, a treemap of each component sorted by framework. Each square is sized by how many times the component is imported in our codebase.

The Component Usage Treemap

Besides the aggregates, we know exactly which files import a component. To visualize this, a Treemap chart on the dashboard shows each component with the size of the square representing how many times it imported in our application.

Understand exactly where you are using a component

Clicking on one of the squares in the treemap shows a side sheet with a list of all the files which import that component. This information allows us to confidently deprecate components.

Filter down a list of low hanging fruit to deprecate

The adoption dashboard also helps to prioritize the adoption roadmap. For example, legacy components that are only imported once or twice are easy to deprecate.

How it works

Earlier I shared how we used babel-parser to migrate to the new import structure.  Being true to our roots, we realized the same technique could be used to collect analytics for our design system! To get to the final adoption dashboard there are a few steps involved.

Step 1. Create a report by analyzing the codebase

We wrote a command line utility that returns a report by analyzing the import statements at the top of each file in our codebase. An index is built that maps these files to their dependencies. Then the index can be queried by package and optionally the export.  Here is an example:

Command

Output

We open-sourced this tool if you are interested in learning more or want to build out your own adoption dashboard see https://github.com/segmentio/dependency-report

Step 2. Create and save a report on every app deploy

  • Every time we deploy our application, the codebase is analyzed and a JSON report is generated using the dependency-report tool.

  • Once the report is generated, it is persisted to object storage (S3).

  • After persisting the report, a webhook triggers the rebuild of our dashboard via the Gatsby static site generator.

Step 3. Build the dashboard and load the data

To reduce the number of reports on the dashboard, the generator only retrieves the most current report as well as a sample report from each previous week. The latest report is used to show the current state. The reports of the previous weeks are used to calculate an aggregate for the week over week adoption chart.

How the adoption dashboard is pushing Evergreen forward

The adoption dashboard was the final piece in making Evergreen a success as it helped us migrate over old parts of our app systematically and with full confidence. It was easy to identify usage of legacy components in the codebase and know when it was safe to deprecate them. Our developers were also excited to see a visual representation of the progress. These days it helps us make data-driven decisions about the future of Evergreen and prioritize our roadmap. And honestly, it is pretty cool.

Conclusion

To those of you who are considering setting out on this journey, I’ll leave you with a few closing thoughts:

  • Start small. It’s important to show the value of a potential design system by solving a small problem first.

  • Find a real place to start. A design system doesn’t have value by itself. It only works when applied to a real problem.

  • Drive adoption and measure your progress. The real work starts once the adoption begins. Don’t forget that the real value is in adoption. Design systems are only valuable once they are fully integrated into the team’s workflow.

This is only the start of our journey. There are still many challenges ahead. Remember, building a design system is not about reaching a single point in time. It’s an ongoing process of learning, building, evangelizing and driving adoption in your organization.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.