Go back to Blog

Engineering

Calvin French-Owen on June 5th 2020

This blog should not be construed as legal advice. Please discuss with your counsel what you need to do to comply with the GDPR, CCPA, and other similar laws.

Under the GDPR and CCPA, any company which serves users in the EU or users in California must allow its users to request that their data is either deleted or suppressed.

  • Deletion all identifying info related to the user must be properly deleted.

  • Suppression the user should be able to specify where their data is used and sent (e.g. for a marketing, advertising, or product use case).

When you get a deletion request, it doesn’t just mean deleting a few rows of data in your database. It’s your responsibility to purge data about your users from all of your tools – email, advertising, and push notifications.

Typically, this process is incredibly time-consuming. We have seen companies create custom JIRA workflows, in-depth checklists, and other manual work to comply with the law. 

In this article we’ll show you how to automate and easily respect user privacy by:

  • Managing consent with our open source consent manager.

  • Issuing DSAR (Data Subject Access Requests) on behalf of your users.

  • Federating those requests to downstream tools.

Let's dive in.

Step 1: Set up a Javascript source and identify calls

If you haven’t already, you’ll want to be sure you have a source data setup on your website, and collecting your user data through Segment.

The easiest way to do this is via our Javascript, and analytics.identify calls.

// when a user first logs in, identify them with name and email analytics.identify('my-user-id', { email: 'jkim@email.com', firstName: 'Jane', lastName: 'Kim' })

Generally, we recommend you first:

  • Generate user ID in your database a user ID should never change! It’s best to generate these in your database, so they can stay constant even if a user changes their email address. We’ll handle anonymous IDs automatically.

  • Collect the traits you have you don’t have to worry about collecting all traits with every call. We’ll automatically merge them for you, so just collect what you have.

  • Start with messaging if you’re trying to come up with a list of traits to collect, start with email personalization. Most customers start by collecting email, first and last name, age, phone, role, and company info so they can send personalized emails or push notifications.

Once you’ve collected data, you’re ready to start your compliance efforts.

Step 2: Enable the open-source consent manager

Giving users the ability to control what personal data is collected is a huge part of any privacy compliance regime. 

We’ve built an open source drop-in consent manager that automatically works with Analytics.js.

Adding it in is straightforward.

Updating the snippet

First, you’ll want to remove the two lines from your analytics.js snippet.

analytics.load("<Your Write Key") // <-- delete meanalytics.page() // <-- delete me

These will automatically be called by the consent manager.

Add in your config

We’ve included some boilerplate configuration, which dictates when the consent manager is shown and what the text looks like. You’ll want to add this somewhere and customize it to your liking.

You’ll also want to add a target container for the manager to load. <div id="target-container"></div>

You can and should also customize this to your liking.

Load the consent manager

Finally, we’re ready to load the consent manager.

<script  src="https://unpkg.com/@segment/consent-manager@5.0.0/standalone/consent-manager.js"  defer></script>

Once you’re done, it should look like this.

Great, now we can let users manage their preferences! They can opt-in to all data collection, or just the portion they want to. 

Step 3: Collecting deletion requests

Now it’s time to allow users to delete their data. The simplest way to do this is to start an Airtable sheet to keep track of user requests, and then create a form from it.

At a minimum, you’ll want to have columns for:

  • The user identifier – either an email or user ID.

  • A confirmation if your page is public (making sure the user was authenticated).

  • A checkbox indicating that the deletion was submitted.

From there, we can automatically turn it into an Airtable form to collect this data.

To automate this you can use our GDPR Deletion APIs. You can automatically script these so that you don’t need to worry about public form submissions. We’ve done this internally at Segment. 

Tip: Make sure deletions are guarded by some sort of confirmation step, or only accessible when the user is logged in.

Step 4: Issuing deletions and receipts

Now we’re ready to put it all together. We can issue deletion requests within Segment for individual users.

This will remove user records from:

  • Segment archives

  • Your warehouses and data lakes

  • Downstream destinations that support deletion

To do so, simply go to the deletion manager under Workspace Settings > End User Privacy.

This will allow you to make a new request by ID.

Simply select “New Request”, and enter the user ID from your database.

This will automatically kick off deletions in any end tools which support them. You’ll see receipts in Segment indicating that these deletions went through.

As your different destinations begin processing this data, they will send you notifications as well.

And just like that, we’ve built deletion and suppression into our pipeline, all with minimal work!

Wrapping up

Here’s what we’ve accomplished in this article. We’ve:

  • Collected our user data thoughtfully and responsibly by asking for consent with the Segment open source consent manager.

  • Accepted deletion requests via Airtable or the Segment deletion API.

  • Automated that deletion in downstream tools with the deletion requests.

Try this recipe for yourself...

Get help implementing this use case by talking with a Segment Team member or by signing up for a free Segment workspace here.

All Engineering articles

Chris Sperandio on December 12th 2018

Today, we announced the availability of our Config API for developers to programmatically provision, audit, and maintain the Sources, Destinations, and Tracking Plans in their Segment workspaces. This is one step forward in Segment’s greater strategy to transition from API-driven to API-first development and become infinitely interoperable with companies’ internal infrastructure.

Our shift reflects a greater market shift over the past 30 years in how technology has impacted where and how companies create value. In the 80s, most industries were horizontally integrated, and few companies could afford to interact directly with customers. They created competitive advantage through operations and logistics and relied on additional layers of the value chain to reach customers. Software has made it easier to deliver services and goods more efficiently all the way to end consumers. As a result, today’s companies crave APIs that are extensible and responsive to their modular infrastructure and enable them to differentiate on customer experience for the first time. 

In this post, we’re excited to share our motivations for becoming an API-first company and the historical context for how to think about why APIs are eating software that is eating the world.  

Identifying where companies create value

So why go API-first? Because in every industry, the value chain is transforming, and APIs are the only way to keep up.

The idea of a value chain isn’t new. Businesses have been using this tool, first coined by Michael Porter of HBS, since 1985. He decomposed businesses into their various functions and arranged those functions as a pipeline, separating the “primary” activities of a firm from “supporting” ones. Primary activities are how you create and deliver value to the market, and supporting activities are those that, well, support these endeavors. 

The value chain of businesses in 1985

Thinking of a firm or business unit as a value chain is helpful for understanding where a firm has or can create a meaningful competitive advantage. In other words, it’s a system for determining where to double down on building unique, differentiated value and where to outsource to create cost advantages. 

Businesses themselves are only one link in a broader market or industry value system: the outputs of a firm’s value pipeline will subsequently pass through additional links in that chain, such as distributors and/or retailers, before they’re purchased by “end user” customers. 

The value system from supplier to end user

Before the internet — and still today in heavily industrialized or regulated industries — vertically integrating your business to own the end customer experience incurred high marginal costs and was prohibitively expensive. For consumer goods or healthcare, conventional wisdom holds that this is largely still true, though companies like Dollar Shave Club or Spruce Health might beg to differ!

The skills and competencies to differentiate in retail are different from distribution, which are different from manufacturing, etc. In focusing along these logistical steps, companies become horizontally focused, and become distant and removed from their true end customers. All too often, our everyday customer experiences still reflect this!

The critical path in a pipeline business

Such businesses, links in a linear value system from raw material to a real-world product in the hands of customers, might best be described as “pipeline businesses.” For these pipeline businesses, the links in their chain where they could best differentiate — where their moats were widest — were inbound logistics (sourcing inputs), operations, and outbound logistics (delivering outputs). Together these comprised the “critical path,” or chain of primary activities, that created value for a pipeline business.

Porter was careful to put customer-facing functions, including sales, marketing, and customer support, inside what he called primary activities. However, only very few large companies, and generally only the luxury brands affordable to the few — think Nordstrom, Mercedes, Four Seasons, or American Express — actually differentiated on these dimensions. For most large companies, these customer-facing activities were better described as secondary activities, and they expanded their profit pools by viewing them as cost centers and outsourcing or deferring to further specialized firms. (Hello, Dunder Mifflin). 

But when the internet happened, the critical path was reshaped forever.

Digital reformation: software enters the value chain

When software first emerged as a viable business tool, most enterprises considered the technology an opportunity to do what they already did more efficiently. Hence the inclusion of “technology development” as a supporting activity in the original value chain composition. 

As vendors popped up to offer software products to help support these value chain reformations, pipeline companies were most open to buying applications that could streamline their secondary activities like sales, marketing, and support. These were less risky, and most of the direct investment in building technology was thought to be better allocated in further differentiating the existing primary activities in the critical path. Because the software buyers were less invested in results — these were secondary activities, after all — they had low expectations for app usability.

The B2B vendors got away with long, onerous implementations and forced their customers to adapt the way they work to the vendor’s way of doing things. They charged extra for services that were needed to extract any value from their software. Because APIs made it easier to work with and integrate their software, these vendors saw APIs as a “nice-to-have.” Or, they charged extra for the use of these APIs to capture more from the IT budget.

Platforms over pipelines: software eats the value chain

But today, software is no longer viewed only as a tool to optimize existing things; it’s combinatorially interconnected, and it permeates everything. In this networked world, customer experience is the only true competitive advantage.

As the marginal cost of customer interactions trends to zero, companies can now afford to reach large audiences at scale and integrate their value proposition around customer experience. And in order to provide excellent customer experiences, what we used to think of as secondary activities are better framed as belonging right in the critical path through integration.

The predominant model of how businesses are organized shifts from “Pipeline” to “Platform,” and the mental model of a request/response lifecycle becomes more useful than that of a value chain.

In consumer-facing businesses, the embodiment of the request/response model is an omnipresent “mobile, on demand” company like Uber or Instacart.

In B2B, it’s an API-first one like AWS, Stripe, Plaid, or Twilio.  

These companies have digitized and vertically integrated every link of their value chain. They have slick websites and apps — on every platform — on the inbound side, and free, two-day shipping with no-worries returns on the outbound side.  

"Apps are increasingly becoming thin wrappers around use cases, not weighty shells around brands." — Chris Maddern, Co-Founder of Button

Because inbound and outbound logistics are ever “thinning” experiences, increasingly mediated via HTTP requests from mobile phones, tablets, laptops or servers, operations become everything behind those applications, and APIs make those experiences effective, relevant, worthwhile, and endearing. Request/response becomes the new pipeline.

The new critical path: Customer experience is the new logistics, and rapid learning, iteration, and integration are the new operations.

Regression models in excel supporting inventory planning? Support activity.

Data Science permeating every facet of the customer experience? Primary activity.

Traditional, reactive “Business Intelligence”? Support activity.

ML-powered supply and demand forecasting to drive real-time marketplace optimization? Primary Activity.

For consumer companies to differentiate on customer experience, they have to integrate their sales, marketing, and customer support functions — links that were once thought of as secondary. These customer-facing departments and customer-facing digital experiences should converge on a shared, ever-updating understanding of who their customer is to tailor their experiences accordingly. Moreover, companies must operationalize the learnings and insights from these interactions to contextualize and tailor subsequent experiences. 

For firms that do this right, everything from content, to product recommendations, to promotions should be based on a real-time, integrated understanding of the factors that drive great customer experiences. This process of self-tuning requires indexing massive amounts of data and then the infrastructure to iterate, optimize, and personalize on the basis of it.

Our humble revision of the Value Chain model for 2018 — the company as a request/response lifecycle

While this model of a request/response firm may not look surprising to platforms, aggregators, digital native retailers, or API-driven middleware in B2B, the stalwart companies who drive the economy are catching on. And as the modern enterprise looks more like these request/response firms every day, the nature of enterprise software is changing with them to fit the model.

Streamlining the critical path: the emergence of API-first for a request/response world

As software became networked, and those networks hit a critical density in the 2000s, technology shifted the value chain composition again. After adopting new technology in secondary business units, consumer companies realized that software could improve processes and margins by outsourcing in their primary focus areas, as well. At this point, they started to introduce technology to their critical paths

This is where the first B2B API-first companies emerged. They turned the “pipeline” model on its head by removing the heft, ceremony, and friction associated with their own critical path. They optimized this experience with software, then productized the software itself.  As a result, they helped B2C companies outsource micro-components of their value chain and enabled these companies to enter into new primary focus areas. 

The API-first companies’ entire end-to-end value proposition is integrated between the lifecycle of an HTTP request and response. Need to process a payment? Just make a request to Stripe, and by the time they respond—a few hundred milliseconds later—they’ve handled a ton of complexity under the hood to issue the charge. Send a text to your customer? 

Companies like Stripe and Twilio set themselves apart not only by the sheer amount of operative complexity they’re able to put behind an API, but because of how elegant, simple, and downright pleasant their APIs are to use for developers. In doing so, they give developers literal superpowers. 

As these companies became the de facto mechanism for accomplishing these operative tasks, they’ve aggregated happy customers along the way. What started as humble request/response companies, have morphed into juggernaut platforms, expanding the scope of their missions and offerings. Before we knew it, “payment processing” became “empowering global commerce,” and “send an SMS” became “infrastructure for better communication.” 

Reducing the cost of integrating these functions via APIs propelled the creation of countless startups with lower barriers to entry.

Building B2B software in a request/response world

For B2B companies selling into enterprises that are increasingly embodying the request/response model, modularity and recognizing that you’re only a part of a much greater whole is key. 

IT is an increasingly embedded function driving interconnection and integration. Companies and their partners — be they base infrastructure providers like AWS and GCP, advertising platforms like Facebook and Google Ads, or the smartest players in the SaaS space — are embracing interoperability through common infrastructure, APIs, and technical co-investment.

Rather than view the software they buy as end-to-end solutions that they’re going to train their teams up on, these enterprises are shifting to a “build and buy” model of private and public networked applications, where security and privacy are necessarily viewed as a shared responsibility. 

The components of their infrastructure that they do choose to buy are part of a broader, sprawling network composed of on-prem deployments, as well as private, public, and third-party cloud services. As a result, they emphasize the need for data portability and the ability to bring a new tool “into the fold” of their existing governance and change control policies and procedures. In fact, it’s generally preferred that the tool acquiesces to those existing procedures than to force the team to adapt their procedures to the tool.

The worst thing you can tell your customer is that they should conform to your opinions about how to do something.

Sure, you built a beautiful user experience atop the data in your SaaS tool. But there’s an edge case you didn’t think of. And without an API, your customers have no recourse. With one, they can channel their needs into an opportunity for them to further invest in your ecosystem. More importantly, they can take “enterprise readiness” into their own hands and enact it on their own terms. In fact, I’ve been personally involved in several of our enterprise-facing initiatives, such as SSO integration with SAML IDPs and fine-grained permissions. While developing requirements for these features, far and away the most common refrain I’ve heard is, “just give me an API.”

Why is that? Amongst software developers, operations practitioners, and IT administrators alike, the concept of Infrastructure as Code (IaC) has taken hold.  This means writing code to manage configurations and automate provisioning of the underlying infrastructure (servers, databases, etc.) in addition to application deployments. The reason we were so excited to adopt this practice ourselves at Segment is that IaC inserts proven software development practices, like version control, continuous testing, and small deployments, into the management lifecycle of the base infrastructure that applications run on.

In the past, “base infrastructure” had a relatively static and monolithic connotation. Today companies are deploying their application not just to “servers” or VMs in their VPCs, but to a dynamic network of cloud-agnostic container runtimes and managed “serverless” deployment targets. At the same time, they rely on a growing network of third-party, API-driven services that provide key functions such as payments, communications, shipping, identity verification, background checking, monitoring, alerting, and analytics. 

At Segment, our own engineers refuse to waste our time and increase our risk profile by clicking around in the AWS console, instead opting to use terraform for provisioning. They go so far as to home-roll applications, like specs for “peering into,” and station agent for querying our ECS clusters. None of these workflows or custom applications would be possible without the ECS control plane APIs. 

And it goes beyond AWS. We want to make it functionally impossible to deploy a service that doesn’t have metrics and monitoring. To do this, we threw together a terraform provider against the Datadog API and codified our baseline alerting thresholds right into our declarative service definitions. 

Now, we’re offering that same proposition to our customers through our Config API for provisioning integrations, workspaces, and configuring tracking plans. We’re excited to see a terraform provider pop up. (And, we have it on good authority the community is already working on it.) Using the Config API and terraform, customers can codify and automate their pre-configured integration settings and credentials when provisioning new domains or updating tracking plans. 

…and that’s where we get back to Segment

Because I know what you’re thinking. Wasn’t Segment already API-first?

Well, partially. Segment, historically, has been API-driven. Which is to say that we’ve been API-first, but only in a few key areas, and hopefully the models and context we explored above can help to explain why!

When we first launched analytics.js, we introduced an elegant and focused API for recording events about how your customers interact with your business. So you made requests to Segment — but did you wait on a response? No! You just let us handle sending the events to your chosen integrations.

That’s because, then, it was a better inbound link to a secondary value chain activity— “analytics.” Companies didn’t want to wait any milliseconds to hear back from Segment because we weren’t in the critical path of their value delivery. (Side note, we went to great lengths to avoid any waiting at all — all our collection libraries are entirely asynchronous and non-blocking.)

And while engineers loved the simplicity of our Data Collection API, the real reason they love Segment is that integrating with that API is the last analytics, marketing, sales, or support integration they ever have to do. That value proposition is what lies between our “API-driven” inbound and outbound value chain links.  The operative link in Segment’s Connections Product is the act of multiplexing, translating, and routing the data customers send us to wherever those customers want.

What exploded underneath our feet when we released analytics.js was the realization that the larger the organization, the more likely it is that the person who needs to access and analyze data is different from the person who can instrument their applications to collect that data. By adopting Segment, companies decoupled customer data instrumentation from analysis and automation, disentangling “what data do we need?” from “how are we going to use it?”

In effect, Segment became the “backbone network router” in charge of packet-switching customer data inside a company’s data network.

Becoming Customer Data Infrastructure

We got this far without thinking API-first when it came to our control plane. Even with all our high-minded prognostications about the end of traditional value chains! So why make the shift now?  

The reason to make such a change, as ever, is strong customer pull.

Since introducing our data router, Segment has evolved substantially. Today, the original Segment Data Collection API you know and love is the inbound link in the customer data infrastructure request/response lifecycle. 

With each big new product release this year, be it our GDPR functionality, Protocols, or Personas, we’ve heard emphatically from Customers that they want to “drive” these features programmatically, and we’ve shipped key APIs with each to deliver on those needs.

All the while, we’ve also noticed more than a few customers — and even partners looking to develop deeper, workflow-based integrations with Segment — poking around under the hood of the private control plane APIs that drive these products.

What’s clear is that while our original, “entry-level” job to be done — analytics instrumentation — may have been a “send-it-and-forget it” API interaction. However, companies have come to rely on their customer data in the critical path of delivering value through their applications, products, and experiences. Now, data collection has moved from fueling “secondary” links to a first-order priority. 

In fact, this thesis (and the accompanying customer pull) has driven Segment’s product portfolio expansion to help companies put clean, consented, synthesized customer data in the critical path of their customer experiences.

And this is where we bring it all together. Because it’s not just consuming the data that fits the mold for an API-first model. As our customers build and adopt applications that fit into a broader network, and they bring once-“supporting” value chain links into their critical path, they want to program the infrastructure that enables that as well.  

With the APIs, our customers have built Segment change management into their SDLC workflow. They run GDPR audits of data flow through their workspace with a button click. They’re keeping their privacy policies and consent management tools up-to-date in real-time with the latest tools they are using.

It’s incredibly humbling to have customers who push the boundaries of your product and are sufficiently invested to want to integrate it more deeply and more safely into their workflows. We’re proud to be enabling that by opening up our Config API, which we welcome you to explore here.

David Scrobonia on December 4th 2018

Security tools are not user friendly. This is a problem in a world where the security community is trying to “push security left” and integrate into development culture. Developers don’t want to use tools that have a poor user experience and add friction to their daily workflows. By ignoring UX, security tools are preventing teams from making their organizations more secure.

The team behind the Zed Attack Proxy (ZAP), a popular OSS attack proxy for testing web apps, worked on addressing this problem with our application. As a result we came up with three takeaways for improving the UX of security tools.

So What’s the Problem? 

Let’s walk through using ZAP to scan a web app for vulnerabilities. Our goal is to add the target application to our scope of attack, spider the site to discover all of the pages, and run the active scanner. So we would:

  1. Look in the Site Tree pane of ZAP to find our target (http://localhost:8000 in this case)

  2. Right click on the target and select the “Include in Context” 

  3. This opens a new menu asking you to select an existing context to add our app to scope (or create a new one)

  4. This opens another configuration menu that prompts you to define any regex rules before adding the application to scope 

  5. Now that our app is in scope, we click the target icon to hide other applications from the Site Tree

  6. To start the spider right click on our application and hover over the “Attack” option in this list to expose a sub-context menu

  7. Click the newly exposed “Spider” option to open a configuration menu and press “Start Scan”.

  8. To start an active scan, we again right click on our app and hover over “Attack” to expose a menu

  9. Click “Active Scan” to open a configuration menu and finally press “Start Scan” to begin

Scanning an app with ZAP

Whew! This is not a simple workflow. It requires us to hunt through hidden context-menus, click through several different menus, and assumes that the user understands all of the configuration options. And this is for the most commonly used feature!

While this is not a great experience, ZAP is far from the least usable security tool available. Ask anybody on your security team about their experience with enterprise static analysis tools and they’ll be sure to give you an earful about a product still stuck in a pre-Y2K user interface.  In contrast to these commercial tools, ZAP is free, open source, and maintained by a handful of people working part time. This presents a huge gap in time and money that left us wondering where we should start in order to improve our user experience. To keep our efforts efficient we focused on three things.

3 Ways to Improve the UX of Security Tools

1. Make it Native

When assessing the security of a web app, you're frequently switching between ZAP and your browser. This quickly becomes distracting. Let's look at how we intercept requests with ZAP, another common feature, to see why.

  1. In the browser, navigate and use the feature they want to test

  2. Go back to ZAP to observe the requests that were sent and turn on the “Break” feature to start intercepting messages

  3. Go back to the browser and reuse the feature they wanted to test

  4. Go back to ZAP to see the intercepted message, modify the request, and then press “Continue”

  5. Go back to the app to see if modified request changed the behaviour in the app

  6. Rinse and repeat for as many requests needed to test the feature.

Intercepting HTTP messages with ZAP

Everytime we just want to intercept a request we have to go back and forth between browser and ZAP five times! This may not seem like a lot at first, but considering that you may intercept hundreds of messages testing for vulnerabilities, this becomes a headache.

The problem is that we are constantly changing contexts between ZAP and the native context for testing, the browser. I like to compare this to how a fighter pilot operates a jet. Their native context for flying the plane is looking through the windshield, so all of the important information they need to make decisions is presented on a heads up display (HUD). Imagine trying to survive a dog fight if all of the feedback about altitude, acceleration, and weapons status was only available in a dashboard of knobs and gauges. It would be impossible if you were continually having to monitor a separate context. 

A fighter jet’s HUD efficiently displays information

To provide a natively integrated experience in ZAP, we took inspiration from a fighter jet’s heads up display.

The ZAP Heads Up Display is new UI overlaid on the target page you are testing, providing the functionality of ZAP in the browser.

Intercepting HTTP messages with the HUD

Making this change wasn’t a simple “lift and shift”, however. It required us to get creative with how we approached our design. We knew we didn’t want to implement the HUD as a browser plugin, which would require us to support multiple code bases that were bound to the restrictions of their plugin APIs. Even then we wouldn’t be able to support all browsers. This was a non-starter for a small development team with a global user base that uses a variety of browsers.

Instead of a browser plugin, we leveraged ZAP’s all powerful position as a proxy to inject the HUD into the target application. When ZAP intercepts an HTTP response from a server, it modifies the HTML to include an extra script tag. The source of this script is a javascript file that is served from ZAP. When this script is executed in the web app it adds several iframes to the DOM which make up the components of the ZAP HUD.

ZAP modifies the HTTP responses of the target application

This approach allows us to add the HUD to any target application running in any modern browser. By making the HUD native we’ve created a frictionless experience, which is essential if you want users to adopt your work. 

2. Sacrifice Power for Accessibility

When looking at ZAP there are a lot of powerful features, but where to find and how to access them isn’t immediately intuitive. To access a common feature like “Active Scan” we had to tediously crawl into a context menu, parse a large list, and click through a sub-menu. This inconvenience is multiplied when you consider there are multiple places in the UI you can find new features. To start a scan you can also navigate to the bottom pane, open a new tab, choose the “Active Scan” feature, and proceed through the configuration menus that way. 

Accessing features via context menus

Accessing features via the bottow drawer

Presenting features in multiple places in the UI is disorienting for a new user trying to figure out how to navigate the application. Will the next feature be found by opening a new tab, or will it be found in a sublayer of a context menu?

To address our complex UI we made a trade off - limit features by simplifying their interfaces. We tossed away the ability to configure features upfront, eliminated multiple entry points into features, and forced scattered ZAP features into consistent UX elements. The HUD now presents features as tools, the discrete buttons on either side of the Heads Up Display.

This consistency creates the same experience when using different tools. Now, when a user wants to find more features, they know exactly where to find it - in another tool! 

Remember how complicated the “Include in Context” flow was? Now with HUD tools, all of these features have been built into the “Scope” tool. Simply click tool, select “Add to Scope”, and we’re done! You’re now ready to attack the application. And how would we start spidering the site? You guessed it - click the Spider tool, select “Start Spider”, and it will start to run!

Scanning an app with the HUD

The simplified tools interface does not provide the same level of feature depth that ZAP traditionally provides. We can’t define a scope regex or define the maximum depth of a spider crawl right out of the box. This is the trade off we choose though: to forfeit feature power for accessibility.

There is a lot happening behind the scenes to enable all of the functionality of ZAP within our simpler “tool” interface. To keep the HUD responsive we aren’t loading all of the functionality into the iframes. Instead, the HUD leverages a service worker, a background javascript process similar to a web worker, to handle most of the heavy lifting so the iframes can be lightweight and responsive. Service workers are more privileged than web workers and can persist across multiple page loads, meaning we only have to load our javascript once and keep it tucked away in the background.* The service worker hosts all of tool logic and exposes it to the different UI iframes via the postMessage API, a browser API for inter-window (and cross origin) communication. 

Not all of the functionality of the tools is stored in the service worker, though. Because we’re already running ZAP we make heavy use of its existing features via the ZAP API. ZAP’s API is very thorough and enables developers to run almost the entire application via a simple REST API. The service worker communicates with this API along with a websocket API that streams events captured in ZAP so that the HUD can have live, up to date notifications. 

The Heads Up Display uses several different technologies

A great way to see all of these pieces in motion is with the new “Break” tool. When the Break tool is clicked in the HUD, a postMessage is sent from the UI frame to the service worker where the tool logic is running, which sends an API request to ZAP to start intercepting messages. When the user tries to open a new page ZAP will intercept the message, use the websocket API to notify the service worker that it just intercepted a new message, and then the service worker will notify an iframe to display the intercepted message. 

All of these technologies help to keep the HUD accessible. If a user can’t figure out how to use your software, it doesn’t matter how good it is at solving a problem it won’t be used. 

3. Keep it Flexible

Even with an improved interface for accessing ZAP, we didn’t want to assume how users would interact with the HUD. To prevent locking our users into rigid workflows we made the UI as configurable as possible and made it easy for users to add more functionality to the HUD.

Applications come in all shapes and sizes and we don’t want the HUD’s display to get in the way. To prevent usability issues users can arrange the tools however they like, or remove tools that get in the way. The entire HUD can even be temporarily hidden from view. The ultimate goal is to have a fully customizable drag and drop interface where users can manage the HUD like its the home screen of their smartphone: changing fonts, adding widgets, and changing the layout.

Users can also quickly add features to the HUD, so they aren’t stuck using only the default tools. Developers often have custom testing or build scripts and connecting them to the HUD would make testing that much easier. That's something we can do in just a few minutes using only ZAP.

ZAP has a “Scripts” plugin that allows users to hook custom scripts into various points of the application. Scripts can be used to modify requests or responses, add active scanning rules, or change any other ZAP behaviour. The plugin also provides access to the HUD code, allowing users to quickly copy, paste, and modify the code for an existing tool. By tweaking just a few lines we can create a tool that uses the ZAP API to start running any user defined script: a developer’s custom testing scripts, a QA’s web automation script, or a hacker’s favorite tool.

Users can add custom functionality to the HUD in a few minutes

In the example above we have a script called “Hack it!” that replaces the text “Juice Shop” with “HACKED”. After quickly modifying an existing tool, we change a ZAP API call to enable our defined script, and when we restart the HUD you can see that the new “Hack it!” tool is now available to be added to our display.

Although this is a simple example, the scripting feature can be used to add any functionality to the HUD and supports several different scripting languages.

Conclusion

By focusing on three things we were able to make a powerful security tool much more accessible to a wider audience. By making it native we empower users in the environment their most comfortable in. By sacrificing power for accessibility we enable users of all levels to quickly start security testing. By keeping it flexible our tool adapts to a user’s specific needs.

While these design concepts aren’t revolutionary, or even that original, they highlight a fundamental gap between how the security community talks about security and how we practice security. If we honestly want to “push security left” we must leverage these principles to provide frictionless security for our users. 

Epilogue

The HUD is now in Alpha release! If you would like to test fly the HUD visit https://github.com/zaproxy/zap-hud to get up and running in a few minutes and see how it works. This is still a very early release so it may buggy, but please share your feedback on usability, reliability, and feature requests. If you’re interested in helping out with the project please reach out to us via the Github project or on Twitter at @david_scrobonia or @zaproxy.

* Service worker savvy readers will know that service workers are event driven, and that the lifecycle of a service worker expects them to be constantly terminated and activated, requiring all dependencies to be imported each time this happens. To prevent this we have hacked around this spec by sending the service worker a heart beat to keep it alive while the HUD is active.

Gurdas Nijor on November 16th 2018

At this point it's well-accepted that analytics data is the beating heart of a great customer experience. It's what allows us to understand our customer's journey through the product and pinpoint opportunities for improvement.

This all sounds great, but reality is a messy place. As a team becomes an organization, the structure used for recording this data can easily diverge. One part of the product might use userId and another user_id. There might be both a CartCheckout and a CheckoutCart event that mean the same thing. Given thousands of call sites and hundreds of kinds of records, this sort of thing is an inevitability.

Without an enforceable shared vocabulary for describing this data, an organization's ability to use this data in any meaningful way becomes crippled.

Downstream tools for analyzing data begin to lose value as redundant or unexpected data trickles in as a result of implementation errors at the source. Fixing these issues after they’ve made it downstream turns out to be a very expensive proposition, with estimates as high as 60% of a data scientist’s time being spent cleaning and organizing data.

At Segment, we’ve put a considerable amount of engineering effort into scaling our data pipeline to ensure low-latency, high throughput and reliable delivery of event data to power the customer data infrastructure for over 15 thousand companies.

We also recently launched Protocols to help our customers ensure high quality data at scale.

In this post, I want to explore some approaches we’re taking to tackle that dimension of scalability from a developer perspective, allowing organizations to scale the domain of their customer data with a shared, consistent representation of it.

Tracking Plans

To ensure a successful implementation of Segment, we’ll typically recommend that customers maintain something known as a “Tracking Plan.”

An example of a Tracking Plan that we would use internally for our own product

This spreadsheet gives structure and meaning to the events and fields that are present in every customer data payload.

A tracking plan (also known as an implementation spec) helps clarify what events to track, where those events need to go in the code base, and why those events are necessary from a business perspective.

https://segment.com/academy/intro/how-to-create-a-tracking-plan/

An example of where a Tracking Plan becomes a critical tool would be in any scenario involving multiple engineers working across products and platforms. If there are no standards around how one should represent a “Signed Up” event or what metadata would be important to capture, you’d eventually find every permutation of it when it comes time to make use of that business critical data, rendering it “worse than useless.”

This Tracking Plan serves as a living document for Central Analytics, Product, Engineering and other teams to agree on what’s important to measure, how those measures are represented and how to name them (the three hard problems in analytics).

Where it breaks down, and how to fix it

As a Tracking Plan evolves, the code that implements it will not often change accordingly. Tickets may get assigned, but oftentimes feature work will get prioritized over maintaining the tracking code, leading to a divergence between the Tracking Plan and implementation. In addition to this, natural human error is still a very real factor that can lead to an incorrect implementation.

This error is a pretty natural result of not having a system to provide validation to an implementor (both at implementation-time, and on an ongoing basis).

Validation that an implementation is correct relative to some idealized target sounds exactly like something that machines can help us with, and indeed they have — from compilers that enforce certain invariants of programs - to test frameworks that allow us to author scenarios in which to run our code and assert expected behaviors.

As alluded to above, an ideal system will provide feedback and validation at three critical places in the product development lifecycle:

  • At development time “What should I implement?”

  • At build-time “Is it right?”

  • At CI time “Has it stayed right?”

As a developer-focused company, these elements of aligning a great developer experience with the process improvements of a centralized tracking plan became a compelling problem to solve.

That’s why we built, and are now open sourcing Typewriter - a tool that lets developers “bring a tracking plan into their editor” by generating a strongly typed client library across a variety of languages from a centrally defined spec.

The developer experience of using a Typewriter generated library in Typescript

Typewriter delivers a higher degree of developer ergonomics over our more general purpose analytics libraries by providing a strongly typed API that speaks to a customer’s data domain. The events, their properties, types, and all associated documentation are present to inform product engineers that need to implement them perfectly to spec, all without leaving the comfort of their development environment.

Compile time (and runtime) validation is performed to ensure that tracking events are dispatched with the correct fields and types to give realtime validation that an implementation is correct.

This answers the questions of “What should I implement” and “Is it right” mentioned earlier. The remaining question of “Has it stayed right?” can be answered by integrating Typewriter as a task in your CI system.

How it works

Typewriter uses what amounts to a machine-readable Tracking Plan with a rich language built on JSON Schema for defining and validating events, their properties and associated types that can be compiled into a standalone library (making use of the excellent quicktype library to generate types for languages we target with static type systems).

This spec can be managed within your codebase, ensuring that any changes to it will result in a regenerated client library, and always up to date tracking code.

What comes next

Being avid Segment users ourselves, we’ve been migrating our mountains of hand written tracking code to Typewriter generated libraries and have been excited to realize the productivity gains of offloading that work to the tooling.

Typewriter will continue to evolve to support the needs of all Segment customers too — we’re continuing to expand it and are open to community PRs!

We’d like you to give Typewriter a shot in your own projects and feel free to open issues, submit PRs, or reach us on twitter @segment.


A special thanks to Colin King for all of his work in making Typewriter a reality, and the team at quicktype for producing an amazing library for us to build on top of.

Michael Fischer on October 30th 2018

“How should I secure my production environment, but still give developers access to my hosts?” It’s a simple question, but one which requires a fairly nuanced answer. You need a system which is secure, resilient to failure, and scalable to thousands of engineers–as well as being fully auditable.

There’s a number of solutions out there: from simply managing the set of authorized_keys to running your own LDAP or Kerberos setup. But none of them quite worked for us (for reasons I’ll share down below). We wanted a system that required little additional infrastructure, but also gave us security and resilience. 

This is the story of how we got rid of our shared accounts and brought secure, highly-available, per-person logins to our fleet of Linux EC2 instances without adding a bunch of extra infrastructure to our stack.

Limitations of shared logins

By default, AWS EC2 instances only have one login user available.  When launching an instance, you must select an SSH key pair from any of the ones that have been previously uploaded.  The EC2 control plane, typically in conjunction with cloud-init, will install the public key onto the instance and associate it with that user (ec2-user, ubuntu, centos, etc.).  That user typically has  sudo access so that it can perform privileged operations.

This design works well for individuals and very small organizations that have just a few people who can log into a host.  But when your organization begins to grow, the limitations of the single-key, single-user approach quickly becomes apparent.  Consider the following questions:

Who should have a copy of the private key?  Usually the person or people directly responsible for managing the EC2 instances should have it.  But what if they need to go on vacation?  Who should get a copy of the key in their absence?

What should you do when you no longer want a user to be able to log into an instance?  Suppose someone in possession of a shared private key leaves the company.  That user can still log into your instances until the public key is removed from them.   Do you continue to trust the user with this key?  If not, how do you generate and distribute a new key pair?  This poses a technical and logistical challenge.  Automation can help resolve that, but it doesn’t solve other issues.

What will you do if the private key is compromised? This is a similar question as the one above, but requires more urgent attention.  It might be reasonable to trust a departing user for awhile — but if you know your key is compromised, there’s little doubt you’ll want to replace it immediately.  If the automation to manage it doesn’t yet exist, you may find yourself in a very stressful situation; and stress and urgency often lead to automation errors that can make bad problems worse.

One solution that’s become increasingly popular in response to these issues has been to set up a Certificate Authority that signs temporary SSH credentials. Instead of trusting a private key, the server trusts the Certificate Authority.  Netflix’s BLESS is an open-source implementation of such a system.  

The short validity lifetime of the issued certificates does mitigate the above risks.  But it still doesn’t quite solve the following problems:

How do you provide for different privilege levels?  Suppose you want to allow some users to perform privileged operations, but want others to be able to log in in “read-only” mode.  With a shared login, that’s simply impossible: everyone who holds the key gets privileged access.

How do you audit activity on systems that have a shared login?  At Segment, we believe that in order to have best-in-class security and reliability, we must know the “Five Ws” of every material operation that is performed on our instances:

  • What happened?

  • Where did it take place?

  • When did it occur?

  • Why did it happen?

  • Who was involved?

Only with proper auditing can we know with certainty the answers to these questions.  As we’ve grown, our customer base has increasingly demanded that we have the ability to know, too.  And if you ever find yourself coveting approval from such compliance organizations such as ISO 27001, PCI-DSS, or SOC 2, you will be required to show you have an audit trail at hand.

We needed better information than this:

Goals

Our goals were the following:

  1. Be able to thoroughly audit activity on our servers;

  2. Have a single source of truth for user information;

  3. Work harmoniously with our single sign-on (SSO) providers; and

  4. Use two-factor authentication (2FA) to have top-notch security.

Here’s how we accomplished them.

Segment’s solution: LDAP with a twist

LDAP is an acronym for “Lightweight Directory Access Protocol.”  Put simply, it’s a service that provides information about users (people) and things (objects).  It has a rich query language, configurable schemas, and replication support.  If you’ve ever logged into a Windows domain, you probably used it (it’s at the heart of Active Directory) and never even knew it.  Linux supports LDAP as well, and there’s an Open Source server implementation called OpenLDAP.

You may ask: Isn’t LDAP notoriously complicated?  Yes, yes it is.  Running your own LDAP server isn’t for the faint of heart, and making it highly available is extremely challenging.

You may ask: Aren’t all the management interfaces for LDAP pretty poor? We weren’t sold on any of them we’d seen yet.  Active Directory is arguably the gold standard here — but we’re not a Windows shop, and we have no foreseeable plans to be one.  Besides, we didn’t want to be in the business of managing Yet Another User Database in the first place.

You may ask: Isn’t depending on a directory server to gain access to production resources risky?  It certainly can be.  Being locked out of our servers because of an LDAP server failure is an unacceptable risk.  But this risk can be significantly mitigated by decoupling the information — the user attributes we want to propagate — from the service itself.  We’ll discuss how we do that shortly.

Choosing an LDAP service

As we entered the planning process, we made a few early decisions that helped guide our choices.

First, we’re laser-focused on making Segment better every day, and we didn’t want to be distracted by having to maintain a dial-tone service that’s orthogonal to our product. We wanted a solution that was as “maintenance free” as possible. This quickly ruled out OpenLDAP, which is challenging to operate, particularly in a distributed fashion.

We also knew that we didn’t want to spend time maintaining the directory.  We already have a source of truth about our employees: Our HR system, BambooHR, is populated immediately upon hiring.  We didn’t want to have to re-enter data into another directory if we could avoid it.  Was such a thing possible?

Yes, it was! 

We turned to Foxpass to help us solve the problem.  Foxpass is a SaaS infrastructure provider that offers LDAP and RADIUS service, using an existing authentication provider as a source of truth for user information.  They support several authentication providers, including Okta, OneLogin, G Suite, and Office 365.

We use Okta to provide Single Sign-On for all our users, so this seemed perfect for us.  (Check out aws-okta if you haven’t already.) And better still, our Okta account is synchronized from BambooHR — so all we had to do was synchronize Foxpass with Okta.

The last bit of data Foxpass needs are our users’ SSH keys.  Fortunately, it’s simple for a new hire to upload their key on their first day: Just log into the web interface — which, of course, is protected by Okta SSO via G Suite — and add it.  SSH keys can also be rotated via the same interface.

Service Architecture

In addition to their SaaS offering, Foxpass also offers an on-premise solution in the form of a Docker image.  This appealed to us because we wanted to reduce our exposure to network-related issues, and we are already comfortable running containers using AWS ECS (Elastic Container Service).  So we decided to host it ourselves.  To do this, we:

  • Created a dedicated VPC for the cluster, with all the necessary subnets, security groups, and Internet Gateway

  • Created an RDS (Aurora MySQL) cluster used for data storage

  • Created a three-instance EC2 Auto Scaling Group of instances having the ECS Agent and Docker Engine installed - if an instance goes down, it’ll be automatically replaced

  • Created an ECS cluster for Foxpass

  • Created a ECS service pair for Foxpass to manage its containers on our EC2 instances (one service for its HTTP/LDAP services; one service for its maintenance worker)

  • Stored database passwords and TLS certificates in EC2 Parameter Store

We also modified the Foxpass Docker image with a custom ENTRYPOINT script that fetches the sensitive data from Parameter Store (via Chamber) before launching the Foxpass service:

Client instance configuration

On Linux, you need to configure two separate subsystems when you adopt LDAP authentication:

Authentication:  This is the responsibility of PAM (Pluggable Authentication Modules) and sshd (the ssh service).  These subsystems check the credentials of anyone who either logs in, or wants to switch user contexts (e.g. sudo).

User ID mappings: This is the realm of NSS, the Name Service Switch.  Even if you have authentication properly configured, Linux won’t be able to map user and group names to UIDs and GIDs (e.g. mifi is UID 1234) without it.

There are many options for setting these subsystems up.  Red Hat, for example, recommends using SSSD on RHEL.   You can also use pam_ldap and nss_ldap to configure Linux to authenticate directly against your LDAP servers.  But we chose neither of those options:  We didn’t want to leave ourselves unable to log in if the LDAP server was unavailable, and both of those solutions have cases where a denial of service is possible.  (SSSD does provide a read-through cache, but it’s only populated when a user logs in.  SSSD is also somewhat complex to set up and debug.)

nsscache

Ultimately we settled on nsscache.  nsscache (along with its companion NSS library, libnss-cache) is a tool that queries an entire LDAP directory and saves the matching results to a local cache file.    nsscache is run when an instance is first started, and about every 15 minutes thereafter via a systemd timer.   

This gives us a very strong guarantee: if nsscache runs successfully once at startup, every user who was in the directory at instance startup will be able to log in.  If some catastrophe occurs later, only new EC2 instances will be affected; and for existing instances, only modifications made after the failure will be deferred.  

To make it work, we changed the following lines in /etc/nsswitch.conf.  Note the cache keyword before compat:

So that users’ home directories are automatically created at login, we added the following to /etc/pam.d/common-session:

nsscache also ships with a program called nsscache-ssh-authorized-keys which takes a single username argument and returns the ssh key associated with the user.  The sshd configuration (/etc/ssh/sshd_config) is straightforward:

Emergency login

We haven’t had any reliability issues with nsscache or Foxpass since we rolled it out in late 2017.  But that doesn’t mean we’re not paranoid about losing access, especially during an incident! So just in case, we have a group of emergency users whose SSH keys live in a secret S3 bucket.   At instance startup, and regularly thereafter, a systemd unit reads the keys from the S3 bucket and appends them to the /etc/ssh/emergency_authorized_keys file.  

As with ordinary users, the emergency user requires two-factor authentication to log in.  For extra security, we’re also alerted whenever a new key is added to the S3 bucket via Event Notifications.

We also had to modify /etc/ssh/sshd_config to make it work:

Security

Bastion servers

Security is of utmost importance at Segment. Consistent with best practices, we protect our EC2 instances by forcing all access to them through a set of dedicated bastion servers.  

Our bastion servers are a bit different than some in that we don’t actually permit shell access to them: their sole purpose is to perform Two-Factor Authentication (2FA) and forward inbound connections via secure tunneling to our EC2 instances.

To enforce these restrictions, we made a few modifications to our configuration files.  

First, we published a patch to nsscache that optionally overrides the users’ shell when creating the local login database from LDAP.   On the bastion servers, the shell for each user is a program that prints message to stdout explaining why the bastion cannot be logged into, and exits nonzero.

Second, we enabled 2FA via Duo Security. Duo is an authentication provider who sends our team push notifications to their phones and requires confirmation before logging in.  Setting it up involved installing their client package and making a few configuration file changes.  

First, we had to update PAM to use their authentication module:

Then, we updated our /etc/ssh/sshd_config file to allow keyboard-interactive authentication (so that users could respond to the 2FA prompts):

On the client side, access to protected instances is managed through a custom SSH configuration file distributed through Git.  An example stanza that configures proxying through a bastion cluster looks like this:

Mutual TLS

To avoid an impersonation attack, we needed to ensure our servers connected to and received information only from the LDAP servers we established ourselves.  Otherwise, an attacker could provide their own authentication credentials and use them to gain access to our systems.

Our LDAP servers are centrally located and trusted in all regions.  Since Segment is 100% cloud-based, and cloud IP addresses are subject to change, we didn’t feel comfortable solely using network ACLs to protect us.  This is also known as the zero-trust network problem: How do you ensure application security in a potentially hostile network environment?

The most popular answer is to set up mutual TLS authentication, or mTLS.  mTLS validates the client from the server’s point of view, and the server from the client’s point of view.  If either validity check fails, the connection fails to establish.

We created a root certificate, then used it to sign the client and server certificates.  Both kinds of certificates are securely stored in EC2 Parameter Store and encrypted both at rest and in transit, and are installed at instance-start time on our clients and servers. In the future, we may use AWS Certificate Manager Private Certificate Authority to generate certificates for newly-launched instances.

Conclusion

That's it! Let's recap:

  • By leaning on Foxpass, we were able to integrate with our existing SSO provider, Okta, and avoided adding complicated new infrastructure to our stack.

  • We leveraged nsscache to prevent any upstream issues from locking us out of our hosts.  We use mutual TLS between nsscache and the Foxpass LDAP service to communicate securely in a zero-trust network. 

  • Just in case, we sync a small number of emergency SSH keys from S3 to each host at boot.

  • We don't allow shell access to our bastion hosts. They exist purely for establishing secure tunnels into our AWS environments.

In sharing this, we're hoping others find an easier path towards realizing secure, personalized logins for their compute instance fleets. Let us know if you have any thoughts or questions, feel free to tweet @Segment, and I’m michael@dynamine.net.

Jeroen Ransijn on October 16th 2018

Design systems are emerging as a vital tool for product design at scale. These systems are collections of components, styles, and processes to help teams design and build consistent user experiences. It seems like everyone is building one, but there is no playbook on how to take it from the first button to a production-ready system adopted across an organization. Much of the advice and examples out there are for teams that seem to have already figured it out.

Today I want to share my experience in bootstrapping a design system and driving adoption within our organization, Segment. I will share how we got started by creating something small and useful  first. Then I will share how I hijacked a project to build out that small thing into our full blown design system known as Evergreen. Finally, I will share how we continue to drive and track adoption of our design system.

What is a Design System?

A design system is a collection of components, styles and processes to help teams design and build consistent user experiences — faster and better. Design systems often contain components such as buttons, popovers and checkboxes, and foundational styles such as typography and colors. Teams that use the design system can focus on what’s unique to their product instead of reinventing common UI components.

What’s in our Design System

Before I share my experience bootstrapping our design system called Evergreen. I want to set some context first, and explain what is in our design system.

  • Design Resources

    • Sketch UI Kit

    • Design Guidelines

  • Code Resources

    • React UI Framework

    • Developer Documentation

  • Operational Resources

    • Roadmap documents

    • On-boarding process

Our design system didn’t start out with all of those resources. In fact, I built something small and useful first. In the next sections I will share the lessons I learned in bootstrapping a design system and driving adoption within our organization.

How We Got Started

About 2 years ago, I joined Segment as a product designer. I worked as a front-end developer in the past and I wanted to use my skillset to create interactive prototypes. To give you a bit of context, the Segment application allows our customers to collect data from your website or app, synthesize that data and integrate with over 200 integrations for marketing and analytics.

The prototypes I wanted to develop would live outside of our Segment application and would have no access to the application codebase. This means that I didn’t have access to the components already in the application — I had to create everything from scratch.

Most advice online talks about starting with a UI audit or trying to get executive buy-in. Those are all part of the long journey of a design system, but there are many ways to get started. If you set out to solve all of the problems in your product, you might be taking on too much at once. Instead, build something small and useful, provide value quickly, and iterate on what works.

Build something small and useful

One of the first challenges you run into when creating a component library is how to deal with styling and CSS. There are a few different ways to deal with this:

  • Traditional CSS: Verbose to write, hard to maintain at scale. Often relies on conventions.

  • CSS Preprocessor such as Sass or Less: Easier way to write CSS, chance of naming collisions. Often relies on conventions.

  • CSS-in-JS solutions: Write CSS in JavaScript. Powerful ways to abstract into components.

I wanted a solution that didn’t require any extra build steps or extra imports when using the component library. CSS-in-JS made this very easy. You can import a component in your code and it works out of the box.

I wanted to avoid having to create a ton of utility class names to override simple CSS properties on components such as dimensions, position and spacing. It turns out there is a way to achieve this in an elegant way — enter the React UI primitive.

Choosing React

There are many choices of frameworks for your component library. When I started building a component library, we were already using React, so it was the obvious choice.

React UI Primitive

After doing research, I found the concept of UI primitives. Instead of dealing with CSS directly, you deal with the properties on a React component. I bounced ideas off my coworkers and got excited about what this would mean. In the end we built UI-BOX.

UI-BOX

UI-BOX exports a single Box component that allows you to use React props for CSS properties. Instead of creating a class name, you pass the property to the Box component directly:

Why is this Box component useful?

The Box component is useful because it helps with 3 common use cases

  • Create layouts without helper classes.

  • Define components without worrying about CSS.

  • Override single properties when using components.

Create layouts without helper classes.

Define components without worrying about CSS

Override single properties when using components

Flexibility and composability

The Box component makes it easy to start writing new components that allows setting margin properties directly to the component. For example, quickly space out two buttons by adding marginRight={10} to the left button. Also, you can override CSS properties without adding new distinct properties to the component. For example, this is useful when full-width button is needed, or want to remove the border-radius on one side of a button. Furthermore, layouts can be created instantly by using the Box component directly.

Still a place for CSS

It is important to note that UI-BOX only solves some of the problems. A class is still needed to control the appearance of a component. For example, a button can add dimensions and spacing with UI-BOX, but a class defines the appearance: background color, box shadows, color as well as the hover, active and focus states. In our design system called Evergreen a CSS-in-JS library called Glamor is used to create appearance classes.

Why it drove adoption of Evergreen

A design system can start with something small and useful. In our case it was using a UI primitive that abstracted away dealing with CSS directly. Roland, one of our lead engineers said the following about UI-BOX.

UI-BOX really drove adoption of Evergreen…

…there is no need to consider every configuration when defining a new component. And no need to wrap components in divs for spacing.

— Roland Warmerdam, Lead Software Engineer, Segment

The lesson learned here is that it’s possible to start with something small and slowly grow that out to a full fledged design system. Don’t think you have time for that? Read the next section for some ideas.

How we started driving adoption

Up until now, I had built a tool for myself in my spare time, but it was still very much a side project. Smaller startups often can’t prioritize a design system as it doesn’t always directly align with business value. I will share how I hijacked a project, scaled out the system, and finally drove adoption across teams at Segment — and how you can do the same.

Hijack a project

About a year ago I switched teams within Segment. I joined a small team called Personas, which was almost like a small startup within Segment. With Personas we were building user profiles and audience capabilities on top of the Segment platform. It turned out to be a perfect opportunity to build out more of the design system.

Deadline in sight — our first user conference

The company wanted to announce the Personas product at our first ever user conference, with only 3 months of lead time to prepare. The idea was that our CEO and Head of Product would demo it on stage. However, there was no way we could finish a fully-baked consumer-facing product in time. We were pivoting too often based on customer feedback.


The company wanted to announce this product at our first ever user conference, with only 3 months of lead time to prepare.


Seize the opportunity

It seemed like an impossible deadline. Then it hit me: We could build a standalone prototype to power the on-stage demo. This prototype would be powered by fake data and only support just the functionality that was part of the demo.

This prototype would live outside of the confines of our application. This would allow us to build things quickly, but the downside is that there is no access to the code and components that live in our application codebase. Every component we want to use in the prototype needs to be built — a perfect opportunity to build out more of the design system. We decided it would be the lowest risk, highest reward option for us to pursue.

While we worked on the demo script for the on-stage demo, I was crunching away on the prototype and Evergreen. Having the prototype available and easily shareable made it easier for the team to practice and fine-tune the script. It was a great time at Segment; I could see the team and company growing closer while readying for launch.

Huge Success

The interactive prototype was a huge success. It helped us show the vision of what our product and Personas could be. It drove considerable interest to our newest product, Personas. I was happy, because not only did we have a interactive prototype, we also have the first parts of our design system.

Focus on the developer experience

So far, we built something small and useful and hijacked a project that allowed us to build out a big chunk of our design system, Evergreen. The prototype also proved to be a great way to drive adoption of Evergreen in our application. Our developers simply took code from the prototype and ported it over in our application. 

At that point, Evergreen components were adopted in over 200 source code files. Our team was happy about the components, but there were some pain points with the way Evergreen was structured. When we started building Evergreen, we copied some of the architecture decisions of bigger design systems. That turned out to be a mistake. It slowed us down.

Too early for a mono-repo

When I started building Evergreen I took a lot of inspiration from Atlassian’s AtlasKit. It is one of the most mature and comprehensive enterprise design systems out there. We used the same mono-repo architecture for Evergreen, but it turns out there is quite a lot of overhead to when using a mono-repo.

Our developers were not happy with the large number of different imports in each file. There were over 20 different package dependencies. Maintaining these dependencies was painful. Besides unhappy developers, it was time-consuming to add new components.

A single dependency is better for us (for now)

I wanted to remove as much friction for our developers using Evergreen as possible, which is why I wanted to migrate away from the mono-repo. Instead, a single package would export all of our components as a single dependency.

Migrate our codebase in a single command

When we decided to migrate to a single package, it required updating the imports in all the places Evergreen was used in our application. At this point Evergreen was used in over 200 source code files in the Segment application. It seemed like a pretty daunting challenge, not something anyone got excited about doing manually. We started exploring our options and ways to automate the process, and to our surprise it was easier than we thought.

Babel parser to the rescue

We created a command line tool for our application that could migrate the hundreds of files of source code using Evergreen with one command. The syntax was transformed using a tool called b. Now it’s a much better experience for our developers in the application. In the end, our developers were happy.

Lesson Learned, Face the challenge

A big change like this can feel intimidating, and give you second doubts. Although I wish I started Evergreen with the architecture it has right now — sometimes the right choice isn’t clear up front. The most important thing is to learn and move forward. 

Driving adoption of a design system is very challenging. It is hard to understand progress. We came up with a quite nifty way to visualize the adoption in our application — and in turn make data-driven decisions about the future of Evergreen.

How to get to 100% adoption

Within our company, teams operate on key metrics to get resources and show they are being successful to the rest of the company. One of the key metrics for Evergreen is 100% adoption in our application. What does 100% even mean? And how can we report on this progress?

What does 100% adoption even mean?

100% adoption at Segment means building any new products with Evergreen and deprecating our legacy UI components in favor of Evergreen components. The first part is the easiest as most teams are already using Evergreen to build new products. The second part is harder. How do we migrate all of our legacy UI components to Evergreen components?

What legacy UI components are in our app?

Active code bases will accrue a large number of components over time. In our case this comes in the form of legacy component libraries that live in the application codebase.

In our case it comes in the following two legacy libraries:

  • React UI Library, precursor to Evergreen.

  • Legacy UI folder, literally a folder called ui in our codebase that holds some very old components.

Evergreen versions

In addition to the legacy libraries, the application is able to leverage multiple versions of Evergreen. This allows gradual migration from one version to another.

  • Evergreen v4, the latest and greatest version of Evergreen. We want 100% of this.

  • Evergreen v3, previous version of Evergreen. We are actively working on migrating this over to v4

How can we report on the progress of adoption?

The solution we came up with to report on the adoption of Evergreen is an adoption dashboard. At any single point in time the dashboard shows the following metrics:

  • Global Adoption, the current global state of adoption

  • Adoption Week Over Week, the usage of Evergreen (and other libraries) week over week

  • Component Usage, a treemap of each component sorted by framework. Each square is sized by how many times the component is imported in our codebase.

The Component Usage Treemap

Besides the aggregates, we know exactly which files import a component. To visualize this, a Treemap chart on the dashboard shows each component with the size of the square representing how many times it imported in our application.

Understand exactly where you are using a component

Clicking on one of the squares in the treemap shows a side sheet with a list of all the files which import that component. This information allows us to confidently deprecate components.

Filter down a list of low hanging fruit to deprecate

The adoption dashboard also helps to prioritize the adoption roadmap. For example, legacy components that are only imported once or twice are easy to deprecate.

How it works

Earlier I shared how we used babel-parser to migrate to the new import structure.  Being true to our roots, we realized the same technique could be used to collect analytics for our design system! To get to the final adoption dashboard there are a few steps involved.

Step 1. Create a report by analyzing the codebase

We wrote a command line utility that returns a report by analyzing the import statements at the top of each file in our codebase. An index is built that maps these files to their dependencies. Then the index can be queried by package and optionally the export.  Here is an example:

Command

Output

We open-sourced this tool if you are interested in learning more or want to build out your own adoption dashboard see https://github.com/segmentio/dependency-report

Step 2. Create and save a report on every app deploy

  • Every time we deploy our application, the codebase is analyzed and a JSON report is generated using the dependency-report tool.

  • Once the report is generated, it is persisted to object storage (S3).

  • After persisting the report, a webhook triggers the rebuild of our dashboard via the Gatsby static site generator.

Step 3. Build the dashboard and load the data

To reduce the number of reports on the dashboard, the generator only retrieves the most current report as well as a sample report from each previous week. The latest report is used to show the current state. The reports of the previous weeks are used to calculate an aggregate for the week over week adoption chart.

How the adoption dashboard is pushing Evergreen forward

The adoption dashboard was the final piece in making Evergreen a success as it helped us migrate over old parts of our app systematically and with full confidence. It was easy to identify usage of legacy components in the codebase and know when it was safe to deprecate them. Our developers were also excited to see a visual representation of the progress. These days it helps us make data-driven decisions about the future of Evergreen and prioritize our roadmap. And honestly, it is pretty cool.

Conclusion

To those of you who are considering setting out on this journey, I’ll leave you with a few closing thoughts:

  • Start small. It’s important to show the value of a potential design system by solving a small problem first.

  • Find a real place to start. A design system doesn’t have value by itself. It only works when applied to a real problem.

  • Drive adoption and measure your progress. The real work starts once the adoption begins. Don’t forget that the real value is in adoption. Design systems are only valuable once they are fully integrated into the team’s workflow.

This is only the start of our journey. There are still many challenges ahead. Remember, building a design system is not about reaching a single point in time. It’s an ongoing process of learning, building, evangelizing and driving adoption in your organization.

Alan Braithwaite on September 27th 2018

“How should we test this?”

“Let’s just run it in production and monitor it closely.”

— You and your coworker, probably.

While often mocked, testing in production is the most definitive way to ensure that your system is operating as expected.  Segment has been on a journey for the last 18 months to include end-to-end testing in production as part of our broader testing strategy, so we wanted to share some of the work we’ve been doing in this area.

For those unfamiliar, Segment is a Customer Data Infrastructure which helps our customers route data about their users from various collection points (web, mobile, server-side) to hundreds of Destinations (partners which receive data from segment) and data warehouses.  

The numerous components which compose Segment’s backend creates a challenging environment for testing in general, but especially in production. To manage this complexity, we’ve decided to focus on two areas.

First, we’ve been building towards a staging environment that faithfully represents our production environment. Second, since we cannot cost-effectively operate a staging environment at the same scale as our production environment, we’ve been developing end-to-end tests for production.

Much has been written about other types of testing, so we’re going to focus on end-to-end testing in this post.

End-to-end tests are tests which run against the entire infrastructure. They are distinct from integration tests because they’re run on real infrastructure whereas integration tests are not. These tests are also distinct from unit tests, which only test a very small amount of code or even just one method.  End-to-end tests should also exercise the exact same code paths used by a customer sending data to Segment’s API.

So what does an end-to-end test look like for Segment?

  1. Send an event to the Segment Tracking API

  2. Process that event through our many streaming services (e.g. validation, deduplication, etc)

  3. Send the event into Centrifuge, which handles reliable delivery of events to Destinations in the presence of network timeouts or other failures outside of our control

  4. Verify that the event is received by a Webhook destination

  5. Emit latency and delivery metrics to alert on using segmentio/stats.

To implement this kind of test, we required an end-to-end testing framework that would make it easy for developers to build new tests.

When we started looking at solutions, we played around with some other end-to-end frameworks with varying degrees of success. They often incorporated ideas about contracts and assertions which were tightly coupled to the framework. This not only made it difficult to add new types of tests, but it also made them difficult to debug.

Before we had end-to-end tests, our staging environment wasn’t effective at preventing bugs from getting to production.  Software is updated more frequently in staging, often being a week or so ahead of the production version.  Additionally, configuration of the staging environment was haphazard and occasionally broke due to changes in the software. These breaks were often silent because we had not been monitoring them in staging.

Today we’re open sourcing Orbital, a framework which meets the requirements presented above and helped us reach our testing goals.

Orbital provides the means to define, register and run tests as part of a perpetually-running end-to-end test service. Additionally, it provides metrics (using segmentio/stats) around test latency and failure rates which we can monitor and alert on.

Design

Orbital is a lightweight test framework for running systems tests defined in Go.  Orbital is inspired by Go’s own testing library, specifically the testing.T abstraction.  testing.T is a struct that gets injected into each test which defines a set of methods to determine whether or not that test passed.  We like Go’s testing package for two reasons.

First, the package takes a users first approach in it’s design.  The API couldn’t be more simple!  Doing this greatly reduces friction when writing tests, increasing the likelihood that they’ll get written and maintained properly.

Second, modeling orbital.O  after testing.T gives us the flexibility we need to define our arbitrarily complex tests.  After trying to enumerate all the different things we’d like to support we found that there are just too many behavioral edge cases that needed actual code to describe properly.  For example, say you want to check events were received by a webhook and also that some counters were updated.  This was difficult to articulate with an assertion based framework like the one we were using before. With Orbital, we’re now only limited by what the Go language supports which is an improvement over the “mutation→assertion” style tests which we encountered before.

The following example exercises the above illustrated case: sending an event to our Tracking API produces a webhook to a configured endpoint. In this case, we’ve configured the webhook to be our own end-to-end service’s API for test verification.

As you can see, the code is very straightforward.  Each test runs in its own goroutine and blocks until it’s completed or the context is cancelled.  Modeling tests in this way allows us to check arbitrary side effects and allows for any kind of behavioral testing your imagination can come up with.

Orbital provides a Service struct which registers the tests and manages the process lifecycle. This struct allows you to set global timeouts for all tests as well as configure logging and metrics. During test registration, you set the period (how often the test is run), name, function and optional timeout override.

One key factor in the design of this framework is the embedded webhook package. This special webhook operates like a normal HTTP server which logs requests to an interface.  One implementation of this interface (RouteLogger) is configured such that after sending an event, you can block your goroutine waiting until that event is received by the webhook or a timeout occurs.

With this primitive, we can send requests to the API, then wait for them to be sent back after being processed in our pipeline. In the above example, we’re doing this on the line h.Waiter.Wait(ctx, evt.ID). To see a full example of both a tester and tested service, check out the examples directory on GitHub.

How do we use it?

Our Orbital tests are deployed as a service that runs inside of our staging and production infrastructure.  It sends events to the Segment Tracking API using our various library implementations.  We even fork out to headless Chrome to execute tests in the browser with analytics.js!  This framework generates metrics used for dashboards and alerting. Here you can see a comparison of our staging vs production environments.

Production

Staging

From the screenshots above you can see that something was broken in staging by looking at the top center graph.

This library was important to creating confidence that our stage environment behaves the same way as our production environment.  We’re now at the point where we can block a release if any of the tests fail in stage.  We know for certain that something did actually break and needs to be investigated. This is the testing strategy you need to have in your infrastructure to reach the ever elusive 5-9s of reliability.

What remains?

Orbital has already proven instrumental in reducing the number of bugs making it to production.  We’ve written numerous tests across multiple teams which exercise various known customer configurations.  However, the framework is not yet bulletproof.

Although you can scale on a single instance to tens of thousands of requests per second, eventually you’ll hit a bottleneck somewhere.  Unfortunately, this framework doesn’t elegantly scale out right now.  Currently, the “RouteLogger/Waiter” records messages sent and received in memory; not to a shared resource or DB.  So if you have multiple tasks running which are load balanced, those requests are unlikely to be sent to the right task and the tests will fail/timeout.  This is a non-trivial but ultimately a solvable problem.

If this is interesting to you, reach out to us!  We’d love to hear from you.  You can find us on twitter @Segment.  Check out our Open Source initiatives here.  We’ve also got many positions in Engineering which involve solving problems similar to this which you can see here.

If you’re interested in reading more about this topic, check out these other great resources on testing in production:

https://medium.com/@copyconstruct/testing-in-production-the-safe-way-18ca102d0ef1

https://opensource.com/article/17/8/testing-production

https://saucelabs.com/blog/why-you-should-be-testing-in-production

https://launchdarkly.com/blog/testing-in-production-the-netflix-way/

Noah Zoschke on August 7th 2018

Segment is a hub for a tremendous amount of data. It processes peaks of 230,000 events per second inbound, and 280,000 events per second outbound between more than 200 integration partners. You may think of Segment as black box for delivering all this data. You send data once to its tracking API, and it coordinates translating data and delivering it to many destinations.

When everything works perfectly, you don’t need to open the black box. Unfortunately, the world of data delivery at scale is far from perfect. Think of all the software, networks, databases and engineers behind Segment and our partners. You can imagine at any given time a database is failing, a network is unreachable, etc.

Segment engineering has taken lengths to operate reliably in this environment. Our latest efforts have been around visibility into the HTTP response codes from destinations. We spent the last few months adding hooks to measure everything from the volume of events, how quickly they are sent to destinations, and what HTTP status code and error response body, if any, occurred for every request.

This instrumentation is ultimately for Segment users to see into the black box to answer one question: how do delivery challenges affect my data? To this end, we built an event delivery dashboard around the data.

It turns out the data in aggregate is also tremendously useful to cloud service engineers at Segment and our destination partners alike. Looking at HTTP status codes alone has unveiled lots of insights on how data flows between services and how we can maximize delivery rates.

I’d like to share some of the things we have found in a day of HTTP responses that we see at Segment.

Success!

First the good news. 92.6% of events — 24.4 billion on the sample day — are delivered on the first attempt. In this happy path, Segment makes an HTTP request to a destination and receives a HTTP 2XX success status code response.

Terminal problems

Next, the bad news. 5.5% of events — 1.4 billion on the sample day — never make it to their destination. In this path, Segment makes an HTTP request to a destination and generally receives an HTTP 4XX client error status code response. These codes indicate the client — either Segment or the user it represents — made an error that the server can’t reconcile.

What’s the password?

The most common client errors Segment sees are HTTP 401 Unauthorized and HTTP 403 Forbidden on 3.8% of requests. In this case, the server doesn’t recognize the given username, password or token, and can’t accept any data. Neither Segment nor the destination server can resolve this automatically for a given request.

This is either due to wrong credentials configured in Segment in the first place or credentials that expired on a destination. Segment always attempts to send the latest events just in case the problem was resolved on either side.

No comprende

The next most common client error is HTTP 400 Bad Request on 0.51% of requests. In this case, the received the request payload but couldn’t make sense of it. These are generally validation errors. Again, Segment and the destination can’t do anything about it automatically, except show instructive error messages to the user.

Next steps…

These errors are considered fatal, but the qualitative data can inform ways to improve delivery over time. The first big step here was building the event delivery dashboard to surface these issue to users.

For authentication errors, a logical next step would be to send notifications when delivery begins to encounter 401 errors. We can also imagine a mechanism to disable event delivery after a threshold to spare partners the request overhead.

For validation errors, visibility into requests per-customer and per-destination can inform improvements to the Segment integration code. Segment can review partner API requirements and not attempt to deliver data it can determine is bad ahead of time, or automatically massage data to conform to the destination API.

Ephemeral problems

Now the interesting challenge… a large class of HTTP problems on the internet are not fatal. In fact, most of the HTTP 5XX server error status codes reflect an unexpected error and imply that the system may accept data at a later time, as does one critical 4XX status code.

Volume

The largest class of temporary problems seen by Segment are of the HTTP 429 -- Too Many Requests class. It’s not hard to imagine why… 

Segment itself has very high rate limits with the aim of accepting all of the data a customer throws at it. Not every downstream destination has the same capabilities, particularly those that are systems of record. Intercom, Zendesk, and Mailchimp, for example, all have well-designed and lower API rate limits.

Segment has to mediate between the customer data volume and the destination rate limits. A combination of internal metering, request batching, and retry with backoff get most of the data through.

But about 7.3% of requests — 2.1 billion a day — encounter a 429 response along the way. Retries help a lot, but if a customer is simply over their limits consistently over a long enough time frame, Segment has no choice but to drop some messages. At least we can quantify how much this is happening with the delivery data and report this to a customer.

Out of service

The next largest class of error — 1.3% of requests — is from destination servers. Segment often sees servers respond with an error like:

  • HTTP 502 Bad Gateway

  • HTTP 504 Gateway Timeout

  • HTTP 500 Internal Server Error

  • HTTP 503 Service Unavailable

Perhaps it’s a temporary glitch for a single request, or perhaps the destination service is experiencing an outage. But every day Segment encounters 371 million of these error responses.

Unreliable channel

Finally, 1.1% of requests error out because of the network layer. At scale, Segment sees a noticeable number of network errors, such as:

  • ENOTFOUND — hostname not found

  • ECONNREFUSED — connection refused

  • ECONNRESET — connection reset

  • ECONNABORTED — connection aborted

  • EHOSTUNREACH — host unreachable

  • EAI_AGAIN — DNS lookup timeout

Maybe it’s due to bad host, flaky network, or DNS error.

If at first, you don’t succeed…

As seen above, a significant number of HTTP or network status codes indicate transient problems. When Segment encounters these, it retries delivery over a 4-hour window with exponential backoff. We can see that this retry strategy is successful. We go from 92.6% success on the first attempt to 93.9% success after ten attempts, an extra 163 million events delivered, all thanks to the destination server sending proper HTTP status codes.

WTF webhooks

Finally, we see some bizarre errors. A very popular destination is webhooks — arbitrary HTTP addresses to POST events. The error codes we see from these destinations imply webhooks might not always follow best practices.

We see every number from 1  to 101 as HTTP status code, which is far outside the HTTP status code specification. Perhaps this is someone testing Segment delivery rates themselves?

We see HTTP 418 I'm a teapot which is in the HTTP spec as an April fools joke.

We see normal HTTP SSL errors like CERT_HAS_EXPIRED and more esoteric like UNABLE_TO_VERIFY_LEAF_SIGNATURE and DEPTH_ZERO_SELF_SIGNED_CERT.

Unfortunately, all of these strange responses are considered terminal errors by Segment. Sorry, webhooks!

Conclusion

It’s literally impossible to achieve 100% delivery on the first attempt over the internet. Transient network errors, unexpected server errors, and rate limiting all present challenges that add up to significant problems at scale. On top of that, encryption, authentication and data validation add another layer of challenges for perfect machine-to-machine delivery.

Retries are the primary strategy to improve delivery, and a retry strategy can only be as good as the destination service response codes.

As a service provider, returning status codes like 400, 403 or 501 is a powerful signal that Segment has no choice but to drop data. Inversely, returning status codes like 500, 502, and 504 is strong hint that Segment should try again. And 429 — rate limit exceeded — is an explicit sign that Segment needs to retry later.

If you’re running cloud service APIs or writing webhooks, think carefully about HTTP status codes. User data depends on it!

For more information about cloud service APIs, visit Segment’s Destinations catalog at https://segment.com/catalog#integrations/all

Tamar Ben-Shachar on August 1st 2018

Segment loads billions of rows of arbitrary events into our customers’ data warehouses every single day. How do we test a change that can corrupt only one field in millions, across thousands of warehouses? How can we verify the output when we don’t even control the input?

We recently migrated all our customers to a new data pipeline without any major customer impact. Through this process we had to solidify our testing story, to the point where we were able to compare billions of entries for row-per-row parity. In this post, I’d like to share a few of the techniques we used to make that process both fast and efficient.

Warehouses Overview 

Before going into our testing strategy, it’s worth understanding a little about the data we process. Segment has a single API for sending data that will be loaded into a customer’s data warehouse. Each API call generates what we call an ‘event’. We transform these events into rows in a database. 

Below, you can see an example of how an event from a website source gets transformed into a database row.

Note that the information in the event depends on the customer. We accept any number of properties and any type of value: number, string, timestamp, etc.

Migration

Our first iteration of the warehouse pipeline (v1) was just a scheduler and a task to process the data. This worked well until our customer base increased in both number and volume. At that point the simplicity caught up with us and we had to make a change.

We came up with a new architecture (v2) that gave us greater visibility into our pipeline and let us scale at more granular levels. Though this was the right way forward, it was a completely new pipeline, and we needed a migration plan. 

Our goal was to switch customers to the new pipeline without them noticing. All data should be written to the database exactly the same way in the v1 pipeline as in the v2 pipeline. If we found a bug in how we processed data in v1, we kept that bug in v2. Customers expect to receive our data in a certain way and have built tooling around it. We can’t just change course and expect all our customers to change their tooling accordingly.  

We also want minimize any potential for new bugs. When we have a bug with how we load data, deploying a fix will only fix future data. Previous data has already been loaded into a customer’s database that we don’t control. To fix the bad rows, we have to re-run the pipeline over the old events that were incorrectly loaded.

Testing Strategy 

In other parts of the Segment infrastructure, our normal testing strategies consist of some combination of code reviews and tests.

Code reviews are very useful but don’t guarantee that something won’t slip through the cracks. Unit and integration tests are great for testing basic functionality and edge cases. However, customers send us too much variance in data and we can’t exhaustively test, or know about, every case.  

For out first pass at testing the new pipeline, we resorted to manually sending events and looking for any data anomalies. But this quickly fell short. As we started processing more and more data through our pipeline, we realized we needed a new solution.

Large Scale Testing

We needed to come up with a testing solution that gave us confidence.

To solve this problem, our approach was straightforward: build a system to do what we were trying to do manually. In essence, we wanted to run the same dataset through both the v1 and v2 warehouse pipelines and make sure the result from both pipelines is exactly the same. We call this service Warehouse QA.

When a pull request becomes ready for testing, a request is made on that pull request to trigger a webhook, which begins the QA run.

Let’s walk through a concrete example of how this works. 

Step 1: Send a Request from Github

We trigger a QA request by adding a comment on a pull request in a format the service understands. We can request a specific dataset to run through the pipelines to verify if a bug is fixed or test the proposed changes under a certain type of load. If a dataset is not specified, we use one of our defaults. 

Step 2: Process the Request

Once the service receives a new request, it starts runs of both the v1 and v2 pipelines over the chosen dataset. Each pipeline writes data under a unique schema in our test warehouse so we can easily query the results. 

Step 3: Audit Results

The most important step is the validation or audit. We check every table, column, and value that was loaded from the v2 pipeline and make sure it matches what was loaded from the v1 pipeline. 

Here is the struct that represents the results for a given run. We first check that the exact same set of tables were created by both pipelines. 

For each table, we then dive deeper to populate the fields in the table struct below. This struct compares a given table from the v1 run to the table created by the v2 run. 

Note that we are checking more than just the counts of the two runs. Counts can give false positives if there are both extra rows and missing rows. Counts also don’t find differences in the field values themselves. This piece is critical to check since we have to be sure we aren’t processing data differently in the new pipeline.

Step 4: Reporting

Now that we’ve compared what we loaded from the v1 run with the v2 run we need to be able to report the data succinctly. When the run request is complete, we post the following overview on the pull request:

If we open the results file under detailed results above, this is what we see for each table that was identical under v1 and v2:

If the pipelines outputted different results, it looks like this:

Step 5: Debug Differing Results

At this point, we have all the information we need to figure out why the two pipelines wrote different values.

Below is a comparison of data that differed between the v1 and v2 pipelines, taken from the results of a QA run. Each row corresponds to a row in our test warehouse. The fields are the id of the row in the table, the column that differed in the two pipelines, and the differing values, respectively. The red value is the value from the v1 pipeline and the green from v2. 

Why are timestamps getting dropped in v1 here? We know that timestamps aren’t all getting dropped, and confirmed this with tests. It turns out that dates before the epoch (January 1st, 1970) were getting dropped because we converted timestamps to integers using Golang’s .Unix() method and dropped values <= 0.

Now that we found the root cause we can alter v2 accordingly. We then run QA again with the fix, and see that it passes. 

Extension

We found this tool so valuable that we still use it today, even though the migration to the v2 pipeline is complete. The QA system now compares the code running in production to the code on a pull request. We’ve even integrated with Github status checks. A QA run fo each warehouse type we support is automatically run on every pull request. 

Summary

Warehouse QA was crucial to our successful migration. It allowed us to test the unknown at scale and find any and all differences without any manual intervention. It continues to allow us to deploy quickly and confidently with a small team by thoroughly testing hundreds of thousands of events.

Alexandra Noonan on July 10th 2018

Unless you’ve been living under a rock, you probably already know that microservices is the architecture du jour. Coming of age alongside this trend, Segment adopted this as a best practice early-on, which served us well in some cases, and, as you’ll soon learn, not so well in others.

Briefly, microservices is a service-oriented software architecture in which server-side applications are constructed by combining many single-purpose, low-footprint network services. The touted benefits are improved modularity, reduced testing burden, better functional composition, environmental isolation, and development team autonomy. The opposite is a Monolithic architecture, where a large amount of functionality lives in a single service which is tested, deployed, and scaled as a single unit.

In early 2017 we reached a tipping point with a core piece of Segment’s product. It seemed as if we were falling from the microservices tree, hitting every branch on the way down. Instead of enabling us to move faster, the small team found themselves mired in exploding complexity. Essential benefits of this architecture became burdens. As our velocity plummeted, our defect rate exploded.

Eventually, the team found themselves unable to make headway, with 3 full-time engineers spending most of their time just keeping the system alive. Something had to change. This post is the story of how we took a step back and embraced an approach that aligned well with our product requirements and needs of the team.

Why Microservices work worked

Segment’s customer data infrastructure ingests hundreds of thousands of events per second and forwards them to partner APIs, what we refer to as server-side destinations. There are over one hundred types of these destinations, such as Google Analytics, Optimizely, or a custom webhook. 

Years back, when the product initially launched, the architecture was simple. There was an API that ingested events and forwarded them to a distributed message queue. An event, in this case, is a JSON object generated by a web or mobile app containing information about users and their actions. A sample payload looks like the following:

As events were consumed from the queue, customer-managed settings were checked to decide which destinations should receive the event. The event was then sent to each destination’s API, one after another, which was useful because developers only need to send their event to a single endpoint, Segment’s API, instead of building potentially dozens of integrations. Segment handles making the request to every destination endpoint.

If one of the requests to a destination fails, sometimes we’ll try sending that event again at a later time. Some failures are safe to retry while others are not. Retry-able errors are those that could potentially be accepted by the destination with no changes. For example, HTTP 500s, rate limits, and timeouts. Non-retry-able errors are requests that we can be sure will never be accepted by the destination. For example, requests which have invalid credentials or are missing required fields.

At this point, a single queue contained both the newest events as well as those which may have had several retry attempts, across all destinations, which resulted in head-of-line blocking. Meaning in this particular case, if one destination slowed or went down, retries would flood the queue, resulting in delays across all our destinations.

Imagine destination X is experiencing a temporary issue and every request errors with a timeout. Now, not only does this create a large backlog of requests which have yet to reach destination X, but also every failed event is put back to retry in the queue. While our systems would automatically scale in response to increased load, the sudden increase in queue depth would outpace our ability to scale up, resulting in delays for the newest events. Delivery times for all destinations would increase because destination X had a momentary outage. Customers rely on the timeliness of this delivery, so we can’t afford increases in wait times anywhere in our pipeline.

To solve the head-of-line blocking problem, the team created a separate service and queue for each destination. This new architecture consisted of an additional router process that receives the inbound events and distributes a copy of the event to each selected destination. Now if one destination experienced problems, only it’s queue would back up and no other destinations would be impacted. This microservice-style architecture isolated the destinations from one another, which was crucial when one destination experienced issues as they often do.

The Case for Individual Repos

Each destination API uses a different request format, requiring custom code to translate the event to match this format. A basic example is destination X requires sending birthday as traits.dob in the payload whereas our API accepts it as traits.birthday. The transformation code in destination X would look something like this:

Many modern destination endpoints have adopted Segment’s request format making some transforms relatively simple. However, these transforms can be very complex depending on the structure of the destination’s API. For example, for some of the older and most sprawling destinations, we find ourselves shoving values into hand-crafted XML payloads.

Initially, when the destinations were divided into separate services, all of the code lived in one repo. A huge point of frustration was that a single broken test caused tests to fail across all destinations. When we wanted to deploy a change, we had to spend time fixing the broken test even if the changes had nothing to do with the initial change. In response to this problem, it was decided to break out the code for each destination into their own repos. All the destinations were already broken out into their own service, so the transition was natural.

The split to separate repos allowed us to isolate the destination test suites easily. This isolation allowed the development team to move quickly when maintaining destinations.

Scaling Microservices and Repos

As time went on, we added over 50 new destinations, and that meant 50 new repos. To ease the burden of developing and maintaining these codebases, we created shared libraries to make common transforms and functionality, such as HTTP request handling, across our destinations easier and more uniform.

For example, if we want the name of a user from an event, event.name() can be called in any destination’s code. The shared library checks the event for the property key name and Name. If those don’t exist, it checks for a first name, checking the properties firstName, first_name, and FirstName. It does the same for the last name, checking the cases and combining the two to form the full name.

The shared libraries made building new destinations quick. The familiarity brought by a uniform set of shared functionality made maintenance less of a headache.

However, a new problem began to arise. Testing and deploying changes to these shared libraries impacted all of our destinations. It began to require considerable time and effort to maintain. Making changes to improve our libraries, knowing we’d have to test and deploy dozens of services, was a risky proposition. When pressed for time, engineers would only include the updated versions of these libraries on a single destination’s codebase.

Over time, the versions of these shared libraries began to diverge across the different destination codebases. The great benefit we once had of reduced customization between each destination codebase started to reverse. Eventually, all of them were using different versions of these shared libraries. We could’ve built tools to automate rolling out changes, but at this point, not only was developer productivity suffering but we began to encounter other issues with the microservice architecture.

The additional problem is that each service had a distinct load pattern. Some services would handle a handful of events per day while others handled thousands of events per second. For destinations that handled a small number of events, an operator would have to manually scale the service up to meet demand whenever there was an unexpected spike in load.

While we did have auto-scaling implemented, each service had a distinct blend of required CPU and memory resources, which made tuning the auto-scaling configuration more art than science.

The number of destinations continued to grow rapidly, with the team adding three destinations per month on average, which meant more repos, more queues, and more services. With our microservice architecture, our operational overhead increased linearly with each added destination. Therefore, we decided to take a step back and rethink the entire pipeline.

Ditching Microservices and Queues

The first item on the list was to consolidate the now over 140 services into a single service. The overhead from managing all of these services was a huge tax on our team. We were literally losing sleep over it since it was common for the on-call engineer to get paged to deal with load spikes.

However, the architecture at the time would have made moving to a single service challenging. With a separate queue per destination, each worker would have to check every queue for work, which would have added a layer of complexity to the destination service with which we weren’t comfortable. This was the main inspiration for Centrifuge. Centrifuge would replace all our individual queues and be responsible for sending events to the single monolithic service.

Moving to a Monorepo

Given that there would only be one service, it made sense to move all the destination code into one repo, which meant merging all the different dependencies and tests into a single repo. We knew this was going to be messy.

For each of the 120 unique dependencies, we committed to having one version for all our destinations. As we moved destinations over, we’d check the dependencies it was using and update them to the latest versions. We fixed anything in the destinations that broke with the newer versions.

With this transition, we no longer needed to keep track of the differences between dependency versions. All our destinations were using the same version, which significantly reduced the complexity across the codebase. Maintaining destinations now became less time consuming and less risky.

We also wanted a test suite that allowed us to quickly and easily run all our destination tests. Running all the tests was one of the main blockers when making updates to the shared libraries we discussed earlier.

Fortunately, the destination tests all had a similar structure. They had basic unit tests to verify our custom transform logic was correct and would execute HTTP requests to the partner’s endpoint to verify that events showed up in the destination as expected.

Recall that the original motivation for separating each destination codebase into its own repo was to isolate test failures. However, it turned out this was a false advantage. Tests that made HTTP requests were still failing with some frequency. With destinations separated into their own repos, there was little motivation to clean up failing tests. This poor hygiene led to a constant source of frustrating technical debt. Often a small change that should have only taken an hour or two would end up requiring a couple of days to a week to complete.

Building a Resilient Test Suite

The outbound HTTP requests to destination endpoints during the test run was the primary cause of failing tests. Unrelated issues like expired credentials shouldn’t fail tests. We also knew from experience that some destination endpoints were much slower than others. Some destinations took up to 5 minutes to run their tests. With over 140 destinations, our test suite could take up to an hour to run.

To solve for both of these, we created Traffic Recorder. Traffic Recorder is built on top of yakbak, and is responsible for recording and saving destinations’ test traffic. Whenever a test runs for the first time, any requests and their corresponding responses are recorded to a file. On subsequent test runs, the request and response in the file is played back instead requesting the destination’s endpoint. These files are checked into the repo so that the tests are consistent across every change. Now that the test suite is no longer dependent on these HTTP requests over the internet, our tests became significantly more resilient, a must-have for the migration to a single repo.

I remember running the tests for every destination for the first time, after we integrated Traffic Recorder. It took milliseconds to complete running the tests for all 140+ of our destinations. In the past, just one destination could have taken a couple of minutes to complete. It felt like magic.

Why a Monolith works

Once the code for all destinations lived in a single repo, they could be merged into a single service. With every destination living in one service, our developer productivity substantially improved. We no longer had to deploy 140+ services for a change to one of the shared libraries. One engineer can deploy the service in a matter of minutes.

The proof was in the improved velocity. In 2016, when our microservice architecture was still in place, we made 32 improvements to our shared libraries. Just this year we’ve made 46 improvements. We’ve made more improvements to our libraries in the past 6 months than in all of 2016.

The change also benefited our operational story. With every destination living in one service, we had a good mix of CPU and memory-intense destinations, which made scaling the service to meet demand significantly easier. The large worker pool can absorb spikes in load, so we no longer get paged for destinations that process small amounts of load.

Trade Offs

Moving from our microservice architecture to a monolith overall was huge improvement, however, there are trade-offs:

  1. Fault isolation is difficult. With everything running in a monolith, if a bug is introduced in one destination that causes the service to crash, the service will crash for all destinations. We have comprehensive automated testing in place, but tests can only get you so far. We are currently working on a much more robust way to prevent one destination from taking down the entire service while still keeping all the destinations in a monolith.

  2. In-memory caching is less effective. Previously, with one service per destination, our low traffic destinations only had a handful of processes, which meant their in-memory caches of control plane data would stay hot. Now that cache is spread thinly across 3000+ processes so it’s much less likely to be hit. We could use something like Redis to solve for this, but then that’s another point of scaling for which we’d have to account. In the end, we accepted this loss of efficiency given the substantial operational benefits.

  3. Updating the version of a dependency may break multiple destinations. While moving everything to one repo solved the previous dependency mess we were in, it means that if we want to use the newest version of a library, we’ll potentially have to update other destinations to work with the newer version. In our opinion though, the simplicity of this approach is worth the trade-off. And with our comprehensive automated test suite, we can quickly see what breaks with a newer dependency version.

Conclusion

Our initial microservice architecture worked for a time, solving the immediate performance issues in our pipeline by isolating the destinations from each other. However, we weren’t set up to scale. We lacked the proper tooling for testing and deploying the microservices when bulk updates were needed. As a result, our developer productivity quickly declined.

Moving to a monolith allowed us to rid our pipeline of operational issues while significantly increasing developer productivity. We didn’t make this transition lightly though and knew there were things we had to consider if it was going to work.

  1. We needed a rock solid testing suite to put everything into one repo. Without this, we would have been in the same situation as when we originally decided to break them apart. Constant failing tests hurt our productivity in the past, and we didn’t want that happening again.

  2. We accepted the trade-offs inherent in a monolithic architecture and made sure we had a good story around each. We had to be comfortable with some of the sacrifices that came with this change.

When deciding between microservices or a monolith, there are different factors to consider with each. In some parts of our infrastructure, microservices work well but our server-side destinations were a perfect example of how this popular trend can actually hurt productivity and performance. It turns out, the solution for us was a monolith.


The transition to a monolith was made possible by Stephen Mathieson, Rick Branson, Achille Roussel, Tom Holmes, and many more.

Special thanks to Rick Branson for helping review and edit this post at every stage.


See what Segment is all about...

Sign up for a free workspace or catch a demo here 👉

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.