Go back to Blog

Engineering

Benjamin Yolken on September 15th 2020

Today, we're releasing kubeapply, a lightweight tool for git-based management of Kubernetes configs. Here's why.

All Engineering articles

Benjamin Yolken, Julien Fabre on October 29th 2020

Network transfer costs are one of the biggest costs of running on the cloud. Here's how we optimized ours to save $100k month.

+2

Rachel Landers, Solon Aguiar, Drew Thompson on October 19th 2020

PII is automatically masked to all workspace users unless explicitly granted access by the workspace owners. This allows your company to raise the bar on end-user data protection while harnessing Segment's full power to achieve your business priorities.

Benjamin Yolken on September 15th 2020

Today, we're releasing kubeapply, a lightweight tool for git-based management of Kubernetes configs. Here's why.

Benjamin Yolken on August 24th 2020

Today, we're releasing Topicctl, a tool for easy, declarative management of Kafka topics. Here's why

Calvin French-Owen on June 5th 2020

This blog should not be construed as legal advice. Please discuss with your counsel what you need to do to comply with the GDPR, CCPA, and other similar laws.

Under the GDPR and CCPA, any company which serves users in the EU or users in California must allow its users to request that their data is either deleted or suppressed.

  • Deletion all identifying info related to the user must be properly deleted.

  • Suppression the user should be able to specify where their data is used and sent (e.g. for a marketing, advertising, or product use case).

When you get a deletion request, it doesn’t just mean deleting a few rows of data in your database. It’s your responsibility to purge data about your users from all of your tools – email, advertising, and push notifications.

Typically, this process is incredibly time-consuming. We have seen companies create custom JIRA workflows, in-depth checklists, and other manual work to comply with the law. 

In this article we’ll show you how to automate and easily respect user privacy by:

  • Managing consent with our open source consent manager.

  • Issuing DSAR (Data Subject Access Requests) on behalf of your users.

  • Federating those requests to downstream tools.

Let's dive in.

Step 1: Set up a Javascript source and identify calls

If you haven’t already, you’ll want to be sure you have a source data setup on your website, and collecting your user data through Segment.

The easiest way to do this is via our Javascript, and analytics.identify calls.

// when a user first logs in, identify them with name and email analytics.identify('my-user-id', { email: 'jkim@email.com', firstName: 'Jane', lastName: 'Kim' })

Generally, we recommend you first:

  • Generate user ID in your database a user ID should never change! It’s best to generate these in your database, so they can stay constant even if a user changes their email address. We’ll handle anonymous IDs automatically.

  • Collect the traits you have you don’t have to worry about collecting all traits with every call. We’ll automatically merge them for you, so just collect what you have.

  • Start with messaging if you’re trying to come up with a list of traits to collect, start with email personalization. Most customers start by collecting email, first and last name, age, phone, role, and company info so they can send personalized emails or push notifications.

Once you’ve collected data, you’re ready to start your compliance efforts.

Step 2: Enable the open-source consent manager

Giving users the ability to control what personal data is collected is a huge part of any privacy compliance regime. 

We’ve built an open source drop-in consent manager that automatically works with Analytics.js.

Adding it in is straightforward.

Updating the snippet

First, you’ll want to remove the two lines from your analytics.js snippet.

analytics.load("<Your Write Key") // <-- delete meanalytics.page() // <-- delete me

These will automatically be called by the consent manager.

Add in your config

We’ve included some boilerplate configuration, which dictates when the consent manager is shown and what the text looks like. You’ll want to add this somewhere and customize it to your liking.

You’ll also want to add a target container for the manager to load. <div id="target-container"></div>

You can and should also customize this to your liking.

Load the consent manager

Finally, we’re ready to load the consent manager.

<script  src="https://unpkg.com/@segment/consent-manager@5.0.0/standalone/consent-manager.js"  defer></script>

Once you’re done, it should look like this.

Great, now we can let users manage their preferences! They can opt-in to all data collection, or just the portion they want to. 

Step 3: Collecting deletion requests

Now it’s time to allow users to delete their data. The simplest way to do this is to start an Airtable sheet to keep track of user requests, and then create a form from it.

At a minimum, you’ll want to have columns for:

  • The user identifier – either an email or user ID.

  • A confirmation if your page is public (making sure the user was authenticated).

  • A checkbox indicating that the deletion was submitted.

From there, we can automatically turn it into an Airtable form to collect this data.

To automate this you can use our GDPR Deletion APIs. You can automatically script these so that you don’t need to worry about public form submissions. We’ve done this internally at Segment. 

Tip: Make sure deletions are guarded by some sort of confirmation step, or only accessible when the user is logged in.

Step 4: Issuing deletions and receipts

Now we’re ready to put it all together. We can issue deletion requests within Segment for individual users.

This will remove user records from:

  • Segment archives

  • Your warehouses and data lakes

  • Downstream destinations that support deletion

To do so, simply go to the deletion manager under Workspace Settings > End User Privacy.

This will allow you to make a new request by ID.

Simply select “New Request”, and enter the user ID from your database.

This will automatically kick off deletions in any end tools which support them. You’ll see receipts in Segment indicating that these deletions went through.

As your different destinations begin processing this data, they will send you notifications as well.

And just like that, we’ve built deletion and suppression into our pipeline, all with minimal work!

Wrapping up

Here’s what we’ve accomplished in this article. We’ve:

  • Collected our user data thoughtfully and responsibly by asking for consent with the Segment open source consent manager.

  • Accepted deletion requests via Airtable or the Segment deletion API.

  • Automated that deletion in downstream tools with the deletion requests.

Try this recipe for yourself...

Get help implementing this use case by talking with a Segment Team member or by signing up for a free Segment workspace here.

Stephen Grable on May 4th 2020

With Segment, you can gather in-depth analytics for your Shopify store — product information, checkout steps viewed, purchases and more.

Leif Dreizler on March 31st 2020

Segment receives billions of events every day from thousands of customers that trust Segment to keep their data safe. At Segment, we believe that good security is an essential part of building high-quality software, similar to reliability or scalability. In addition to the tools and processes developed by our Security Org to help software engineers make good security choices, we also rely on reviews from both traditional and crowdsourced security companies.

Bug bounties are often seen as a significant burden for the security teams that run them. You’ll hear horror stories about companies launching a bug bounty, their security team getting inundated with low-quality reports, duplicate submissions, and researchers going outside the scope of the program.

Shortly after, you'll hear about their engineering team being overwhelmed with newly discovered vulnerabilities to fix. From those that survive the opening salvo, you may hear complaints that, over time, they have stopped receiving impactful submissions.

A few years ago, when the space was less mature, critics questioned whether running a program was worth it. Now, it is expected that organizations of a certain size and maturity run a bug bounty program.

In late 2019, the Cybersecurity and Infrastructure Security Agency (CISA) of the United States published a draft directive recommending all agencies develop a mechanism to accept information related to security vulnerabilities from external parties.

In this blog post, we’ll break down how to start and manage a bug bounty program, consistently achieve good results, and maintain healthy relationships with the people that power the program.

If you’re short on time, check out the “Top Tips” section at the bottom of this post.

What’s a bug bounty?

A bug bounty program is like a Wanted Poster for security vulnerabilities. 

Companies running bug bounty programs pay independent security researchers from across the world for security vulnerabilities in their own products and infrastructure.

https://scottschober.com/bug-bounties-created-equal/

Bug bounty programs are run by hundreds of organizations, including MozillaFiat Chrysler, the U.S. Department of Defense, and Segment.

Why bug bounty?

I assume most readers are already bought into the benefits of running a bug bounty.

If you’re on the fence, there are tons of independent articles, marketing docs from companies like HackerOne and Bugcrowd, and recorded conference talks, so I’m not going to cover this in detail. 

Most companies that have an internet presence or make an internet-connected device should consider running a bounty, or at least have a way for security researchers to report security issues.

It is also part of the Vendor Security Alliance questionnaire, so it may be something your customers ask you about if you are in the B2B space. 

If you don’t have a way for researchers to report issues, they will email people at your company (or any alias they can find on your website), connect with you on LinkedIn, or just tweet about the issues they think they’ve found.

It’s a much better experience for researchers, your employees, and your customers if you give the security community a clear avenue to report vulnerabilities.

Your security and engineering orgs will be regularly impressed by the creativity of the researcher community. These are people that, without internal knowledge, can find critical vulnerabilities in organizations across the world. 

Why use a bug bounty platform?

I strongly recommend using a bug bounty platform like HackerOne or Bugcrowd (we use Bugcrowd here at Segment) to help manage this process. These companies provide a platform and services to help run an efficient program.

Severity baselines

Severity baselines make it easier to tell how serious a vulnerability is, and how much time you should be spending on review and remediation.

When running a program on your own, you’ll frequently have researchers overhyping vulnerabilities. Platforms have a guided submission form, which helps researchers pick the appropriate category and rating.

Reputation system

The reputation systems reward researchers that accurately rank vulnerabilities and creates a competitive environment that benefits both researchers and program owners.

It also helps reinforce good behavior. Any researcher discipline issues have stricter consequences. If a researcher misbehaves, they may be banned from the platform. 

Disclosure agreements

To submit vulnerabilities via these platforms, researchers have to agree not to disclose the vulnerability without approval from your company.

Triage services

Both platforms also provide triage services, which I highly recommend paying for. These are the first line of defense for your internal security resources. These globally distributed teams will help clean up researcher reports, mark submissions as duplicates, and filter out low-quality reports.

Management services

These companies also serve as a knowledge base for you to learn about running a program and ask questions. You can bounce ideas off of someone that works at a company running hundreds of programs.

Integrations

Platforms have structured input with required fields and integrations with popular tools, like Jira. These make it much easier to turn a submission into a ticket for your engineering org.

Private programs

For most companies, it isn’t possible to run a private program without the help of a bug bounty platform.

We’ll talk about private programs in more depth later, but this is the recommended starting point for companies launching a bug bounty for the first time.

All of the above features free your team to focus on the security challenges unique to your business.

Laying the groundwork

Having a successful program starts with a good foundation, and it’s your job as a program owner to help set your organization up for success.

Vulnerability management

Think about your current process for handling a security vulnerability. What happens when someone internally or externally finds a security bug?

You will inevitably get more of these after starting a bug bounty program, so you must have a good way to handle these reports.

Your vulnerability process doesn’t have to be perfect, but you should have a way to prioritize and assign bug reports to the appropriate engineering team without high overhead.

Engineering buy-in

As someone starting a program, you’ll also need to get buy-in from your Engineering org. They are the ones that will have to fix the issues and will likely be the ones responding to alerts triggered by researchers.

Your alerting story doesn’t need to be perfect. But you also don’t want engineers to be woken up every time someone triggers an error because some input validation was working correctly and stopped a researcher from submitting a < symbol into an email address field.

Remember, your team doesn't have to fix every valid vulnerability immediately.

Vulnerabilities are just bugs and should be prioritized appropriately. If you’re having trouble getting your engineering org to fix critical vulnerabilities in a timely manner, you may want to direct your efforts to job-hunting instead of starting a bug bounty program 🙃

Getting started

Once your organization is bought-in, you can focus on getting things ready for the researchers.

Bounty Brief

Your bounty brief is what researchers will read to determine if they’re interested in working on your program. It’s part instructions, part rules, and part advertisement.

Keep in mind you’re competing for researchers’ time; they don’t have to work on your program when there are so many other programs.

Your bounty brief should be clear, concise, and should set expectations with the researchers. You can find the Segment program hosted on Bugcrowd.

Scope

Where do you want researchers to focus their testing? What’s in scope? What’s out of scope?

I recommend starting with assets that are easy for the researchers to access, ideally something free, that anyone can sign up for.

You should also try to pick a target that has at least medium size, complexity, and business impact. This will help you show value early, which will help you expand the program. 

Access

How do researchers get access to the scope? Are there docs they can read to help them get up to speed? We instruct researchers to sign up for our app using their @bugcrowdninja.com email address and to include -bugcrowdninja as part of their workspace slug. 

This makes it easier for us to determine if someone is part of our Bugcrowd program when we review logs and alerts. If we notice someone causing problems in our app, we can ask Bugcrowd to provide researcher coaching.

Severity

How are you going to rate submissions? Consistent severity is important because it impacts how much the researcher gets paid for a submission. HackerOne uses Mitre’s Common Weakness Enumeration (CWE) and Bugcrowd uses the Bugcrowd Vulnerability Rating Taxonomy (VRT).

Rewards

How much are you going to pay for vulnerabilities? Researchers need to know upfront how much you’ll pay for vulnerabilities so they can assess if it is worth their time to hunt on your program. 

Think about using different reward ranges for different assets. This can help control costs and also helps researchers understand which targets are more important. For example, we describe specific objects that will net a higher reward:

https://bugcrowd.com/segment

Swag

A handful of years ago, getting a T-shirt as a reward was pretty standard. I’d strongly encourage anyone thinking about running a swag-based bounty to reconsider.

T-shirts don’t pay rent and are more work for your team than sending a researcher money. What do you do when that T-shirt you sent to another continent is lost in the mail or doesn’t fit?

We reserve swag for our top performers. Sending a T-shirt requires the researcher to trust us enough to give us their address and requires me to go to the post office.

Safe harbor

Researchers want to know that they’ll be safe from prosecution when operating within the rules of your program. Check out disclose.io for more info.

Researchers want to know that they’ll be safe from legal action and exempt from your end-user license agreement(EULA) when operating within the rules of your bounty brief.

Take the time to explain what a bug bounty is, why it's important, and have a few examples from recognizable organizations ready to show them.

Understand the platform

Learn a little bit about the platform you’re using. Your actions on the platform impact the researcher. If you mistreat researchers, they will go elsewhere; without researchers, your program isn’t providing value to your organization. 

The same report status can have different meanings and impact on different platforms.

For example, on HackerOne Not Applicable reduces the researcher’s site-wide score by 5 points, and should be used for reports that don’t contain a valid issue.

On Bugcrowd, Not Applicable does not impact the researcher’s score, and is commonly used for reports that should neither be accepted or rejected. To achieve this result on HackerOne, you would use the Informative status.

If you have any questions about the platform you’re using, I strongly recommend reviewing documentation or reaching out to your account manager for help.

Start Small

Regardless of how big your company’s internet footprint is, you can start with a small scope open only to a handful of individuals as part of a private program.

In mid-2017, Segment was running a private program with 25 researchers and a single target: our app.

The early researchers invited will be some of the platform’s most trusted, and they will generally be more accepting of companies that are learning how to manage a program, as long as you pay them fairly and treat them with respect.

Starting small allows your organization to learn how to run a program in a safer environment. If your vulnerability management program has some gaps, you can fix them; if your bounty brief is unclear, you can rewrite it; if your alerts aren’t tuned properly, you can invest time into improving them. If you need to pause your program, you can relaunch later with a less negative impact.

Even while we had a private program, we would direct researchers that reached out via email to our Bugcrowd program. This allowed us to receive the benefits of the platform and triage services for all submissions before running a public program.

It’s much easier to explain to a researcher why you won’t be paying for a low-effort submission when you have a prioritization system established and enforced by a third party.

Growth over time

Like any multi-year project, your bug bounty will evolve and require ongoing effort to keep it healthy.

Additional scope and researchers

Researchers love testing new features and assets; in most bug bounty programs, only the first person to find a vulnerability receives a monetary reward.

If you started with a small scope, your program is steady, and you’re ready for more submissions, this is a great time to add more targets to your brief. 

As you add scope, keep in mind that not all assets are of equal value to an adversary. It is encouraged to specify different reward ranges for different assets based on their security maturity and value.

Open scope

Over time, you should also consider an “open scope” bounty if it is appropriate for your organization. We have listed as a target, Any host/web property verified to be owned by Segment (domains/IP space/etc.), which serves as a catch-all for anything not explicitly listed in our “In Scope” or “Out of Scope” sections.

Having an open scope bounty is enticing to researchers. Not only does it show you take running a bug bounty program seriously. It also shows that regardless of where they find a vulnerability, it will likely be rewarded (assuming it is valid, unique, and not out of scope).

Many researchers specialize in finding forgotten internet-facing assets as part of open-scope programs, and have developed their own infrastructure to identify assets and vulnerabilities to be able to efficiently earn rewards.

It’s also worth noting that there is no scope for an attacker trying to compromise your company’s security. Working towards an open scope means that it is more likely a bug bounty researcher will find and report a vulnerability before an attacker exploits it.

Early access

Over time, you’ll build trust and form relationships with particular researchers. These are great people to give early access to upcoming features. Many times, these features require manual provisioning, making them less suitable for wide-scale testing.

Early access is a mutually beneficial system in which you will receive security vulnerabilities prior to release, which makes them easier to fix. Researchers will be able to test features with less competition, which makes them more likely to earn a reward and continue testing on your program.

If the effort to set up these features is medium or higher, consider paying the researcher a grant before they start working.

Clearly communicate what you're looking for and what you expect from them. When offering a researcher grant, we want to see a short write-up of what didn't work in addition to any findings they submit. Rewardable findings should be paid in addition to the grant.

Public program

Once you’ve been running a program for a while and are confident in your company’s ability to receive vulnerabilities from the global researcher community, you should consider evolving it into a public program. 

If you don’t have a wide scope, this is a great time to revisit that decision.

Maximizing your scope (while private) will reduce the uptick in submissions when your program is launched publicly. You should also invite as many researchers as possible to your private program before going public for the same reason.

Because public programs are open to anyone, you will inevitably receive testing from a lot of newer folks that will pay less attention to your bounty brief, so having a wide scope helps in this regard as well.

Segment has run a public bug bounty program since late 2018, roughly 18 months after launching our private program.

Train your triagers

Hopefully over time, you will think of your outsourced triage team as an extension of your internal team. Spending the time to let them know how you want your program run will pay dividends in the future. Any submission they can validate without asking your team questions saves time for everyone involved.

Here are some examples of guidance we’ve given to the Bugcrowd triage team:

Identify duplicates for non-rewardable submissions

Many programs do not bother to mark informational, out of scope, or other non-rewardable submissions as duplicates. We do this for two reasons: 

The first is that if we decide to fix one of these issues later, we can go back and mark the original submission as resolved and pay the researcher. Any duplicates of this issue will still receive points. 

The second is that when there is a false positive identified by a tool commonly used by bug bounty hunters, you will get this submitted to your program a lot.

If we repeatedly see an out-of-scope or not reproducible submission, we can add a specific item in our bounty brief to warn researchers; it will be marked as out-of-scope or not reproducible without a working proof of concept.

Don’t be afraid to deduct points for undesired behavior

While we are generally laid-back and understanding program owners, we aren’t afraid to deduct points from a researcher’s score for when it’s warranted.

Many programs shy away from deducting points, but we want to ensure participants in our program thoroughly read our brief and think that it helps the larger bug bounty community to slightly penalize those that disregard clearly documented rules. 

Two of the common arguments against bug bounty programs is that the submissions are often low-value and that researchers don’t respect scope.

For example, we have a very small out-of-scope section, which includes: CORS or crossdomain.xml issues on api.segment.io without proof of concept.

This is identified by a tool commonly used by bug bounty participants is a finding we have received dozens of times, but never with any impact.

We do this to save time for both researchers and our triage team. If a researcher submits this finding without a proof of concept, we encourage Bugcrowd to mark this as out-of-scope. If a researcher submits a finding that showed impact, we would be more than happy to reward, fix, and update our brief.

Explain deviations in ratings

If you need to deviate from the baseline rating established in your bounty brief, take the time to explain to the researcher why the rating and reward are higher or lower than they might expect. 

Researchers are generally understanding, as long as your rating, reward, and explanation are fair and make sense. If you find yourself commonly deviating from the ratings, it may be time to make changes to your bounty brief so that researchers know what to expect in advance. If you make severity or scope changes as the result of a submission, reward the researcher at whichever rate is more favorable to them.

In addition to explaining why something was rated lower than the baseline, take the time to explain why something was rated higher than the baseline. This is a great way to encourage further testing in these areas and is a great way to build trust with a researcher.

These explanations also help your triage team learn more about your program, and allow them to more accurately triage future submissions.

Build and maintain good relationships with researchers

Take time to build relationships and trust with researchers, especially those that repeatedly submit to your program. Treat researchers fairly, with respect, and consider paying for anything that brings value to your organization.

You’re competing for a researcher’s time, especially the ones that are the most talented. They can likely work on almost any bug bounty program available; think about ways you can encourage them to work on yours.

Keep in mind that all researchers, even those that are unskilled, are human beings. Assume that they want to help you secure your organization, learn more about security and technology, and get paid.

Pay for anything that brings value

If there is one sentence you remember from this blog, I hope it is “pay for anything that brings value.” 

Bug bounty hunters put in a lot of time and effort that doesn’t result in getting paid. This could be time spent developing tooling, hunting without finding any bugs, or having a valid bug marked as a duplicate.

Try to avoid thinking about individual bug costs. Instead, think about the overall value the program brings to your organization in terms of bugs found, time saved, and peace of mind. If you’re debating between two severities, pick the higher one and pay the researcher at that rate. You can always change the severity in your internal ticketing system later.

Once you’ve received a rewardable submission, try to triage and pay quickly. Sometimes determining the full impact takes time; if this is the case, add a comment letting the researcher know you appreciate their work but need some extra time to determine the appropriate reward.

Work collaboratively with the researcher

As an employee of your company, you should know more about the codebase and infrastructure than a security researcher outside your organization (although occasionally I question this based on the creative and impactful submissions we receive 😅). 

Sometimes when running a bug bounty program, you’ll get a submission that makes you think, “What’s the next level the researcher could get to?” If this is a researcher you trust, it may be appropriate to give them some hints to help further their testing. If you give them hints, you can also issue some cautionary advice to help them continue in a way that is safe for your organization and customers. 

Giving the researcher hints helps show them you value their testing and saves your team from spending time on something that may not be possible. If the hint is helpful, the researcher will be submitting a higher-severity finding, which positively impacts their researcher score and earns a higher monetary reward. It also allows you to get the vulnerability fixed faster due to the higher severity.

Sometimes, it isn’t appropriate to include the researcher in this phase of the process. If our team continues the investigation, and it leads to the discovery of a higher-impact weakness in our systems, we reward the researcher as if their report contained the full impact. We also explain why we paid them at this higher rate, but let them know we are unable to share the details. This is a great way to show the researcher you value their work and build trust.

Share progress with the researcher

If a researcher submits a vulnerability that leads to a systemic fix for a vulnerability class, share this with them! Researchers are generally excited to hear that their work led to meaningful change within your organization. It also is a cue for them to attempt to bypass the new protections. 

Pay for Dupes

At Segment, we commonly pay researchers for well-written duplicates, and frequently reach out to let them know that we appreciated their submission. We also let them know that we don’t always pay for duplicates to make sure that expectations are set appropriately. 

This has worked out incredibly well for us. All of our most critical submissions have come from researchers that were originally rewarded for a well-written duplicate. Segment is a complex product that takes time to set up and fully understand. Researchers that put in the effort to fully set up a Segment workspace have additional context and understanding that take time to acquire—these people are hard to replace, and you want to keep them happy.

Pay bonuses for well-written reports

We also pay extra for well-written reports. Valid submissions need to get turned into Jira tickets which are assigned to engineering teams. Reports that are concise, easy to follow, have clear impact, and are well-formatted take less time for us to turn into tickets. We want to encourage researchers to save us time so we make sure to reward appropriately and let them know that we appreciate their efforts.

Final thoughts

Running a successful bug bounty program requires consistent effort from your team, but can bring tremendous value to your company and customers. Any vulnerability reported and fixed is one fewer vulnerability an attacker could use to get a foothold in your organization. Bug bounty submissions can help illuminate vulnerability trends, which can help prioritize where you spend resources to fix systemic issues in your applications or infrastructure.

Bug bounty programs are people-powered. Spend the time to make those involved in your program feel valued, help them understand the motivations behind your decisions, and be excellent to each other!

Top tips

Thanks for taking the time to read my post! I hope you learned a few things to help your company run a successful program. Here are some of my top tips to reference later:

  • Pay for anything that brings value

  • Pay extra for well-written reports, even if they’re dupes

  • Avoid thinking about individual bug costs

  • Reward quickly

  • Partner with a bug bounty platform and pay for triage services

  • If you make changes to your bounty brief as the result of a submission, reward the researcher at the more favorable rate

  • Invest time into building and maintaining relationships with your researchers and triage team

  • Don’t be afraid to deduct points for bad behavior

  • Start small and partner early with Engineering

  • Write a clear and concise bounty brief to set expectations with the researchers

Thank you! 💚💙

A special thanks to the Segment Engineering organization for fixing vulnerabilities and responding to alerts.

To Edis from Bugcrowd for helping us triage three years of vulnerabilities and truly being an extension of our team.

To all the researchers that have helped keep our customers safe by looking for vulnerabilities as part of our program.

And finally, to researchers danieloizo and sheddow, you have both submitted numerous well-written and high impact findings and are an absolute pleasure to work with.

Calvin French-Owen on October 17th 2019

How we managed to reduce our infrastructure cost by 30%. And how you can too.

Rick Branson, Collin Van Dyck on June 25th 2019

This is the story of how we built ctlstore, a distributed multi-tenant data store that features effectively infinite read scalability, serves queries in 100µs, and can withstand the failure of any component.

Highly-reliable systems need highly-reliable data sources. Segment’s stream processing pipeline is no different. Pipeline components need not only the data that they process, but additional control data that specifies how the data is to be processed. End users configure some settings in a UI or via our API which in turn this manipulates the behavior of the pipeline.

In the initial design of Segment, the stream processing pipeline was tightly coupled to the control plane. Stream processors would directly query a set of control plane services to pull in data that directs their work. While redundancy generally kept these systems online, it wasn’t the 5-9s system we are aiming for. A common failure mode was a stampede of traffic from cold caches or code that didn’t cache at all. It was easy for developers to do the wrong thing, and we wanted to make it easy to do the right thing.

To better separate our data and control planes, we built ctlstore (pronounced “control store” or “cuttle store” as some like to call it), a multi-tenant distributed control data store that specifically addresses this problem space.

Low Latency & Read Scalability

At the center of the read path is a SQLite database called the LDB, which stands for Local Database. The LDB has a full copy of all of the data in ctlstore. This database exists on every container instance in our fleet, the AWS EC2 instances where our containerized services run. It’s made available to running containers using a shared mount. SQLite handles cross-process reads well with WAL mode enabled so that readers are never blocked by writers. The kernel page cache keeps frequently read data in memory. By storing a copy of the data on every instance, reads are low latency, scale with size of the fleet, and are always available.

A daemon called the Reflector, which runs on each container instance, continuously applies a ledger of sequential mutation statements to the LDB. This ledger is stored in a central MySQL database called the ctldb. These ledger entries are SQL DML and DDL statements like REPLACE and CREATE TABLE.  The LDB tracks its position in the ledger using a special table containing the last applied statement’s sequence number, which is updated transactionally as mutation statements are applied. This allows resuming the application of ledger statements in the event of a crash or a restart.

The implications of this decoupling is that the data at each instance is usually slightly out-of-date (by 1-2 seconds). This trade-off of consistency for availability on the read path is perfect for our use cases. Some readers do want to monitor this staleness. The reader API provides a way to fetch an approximate staleness measurement that is accurate to within ~5 seconds. It sources this information from the timestamp attached to ledger statements which indicates when the statement was inserted. A heartbeat service sends a mutation every few seconds to ensure there’s always a relatively fresh ledger statement for measurement purposes.

The ctldb is an AWS Aurora cluster. Reflectors connect to and poll one of the cluster’s read replicas for new ledger statements every second (with some jitter). A publish-subscribe model would be more efficient and lower latency, but polling was simple and ended up being quite suitable for our use case. Scaling up our current measurements, a single Aurora cluster should be able to support tens of thousands of Reflectors at once. Aurora clusters can be connected together using the MySQL replication protocol, which would support scaling beyond a single cluster’s limitations, implementing multi-region support efficiently, and even multi-cloud if that is ever in the cards.

Data Model

ctlstore exposes a relational model to readers. Control data is stored as rows in tables that have a defined schema, and these tables are grouped into families. A Reader library, which wraps access to the LDB, provides primary key oriented queries. This layer allows potentially switching the underlying implementation of the read path, and focuses queries on those that will be efficient for production use cases. Currently, only primary key queries are supported, but adding secondary key support is being considered.

Getting Data In

During design, we knew we wanted to pull in data from many systems of record, so a single monolithic source of truth was off the table. This means the master records for ctlstore data actually live outside of the system itself, in an origin. A loader ingests change stream from the origin, and applies these changes to ctlstore via the HTTP API exposed by the Executive service.

In practice, it’s a bit more complicated. Our production setup uses the open-source change data capture system Debezium. Debezium streams MySQL replication logs from the origin database and emits the changes in JSON format to a Kafka topic. A loader process consumes this topic, batches the changes, and applies them to ctlstore. The HTTP API provides a transactional offset tracking mechanism alongside the write path to ensure exactly-once delivery. In ctlstore, all mutations are either “upserts” or deletes, so replays are idempotent.

To ensure that data passes integrity checks, ledger statements are applied to the ctldb as they’re inserted into the ledger. This is done transactionally so that a failure will rollback the ledger insert atomically. For example, if a string failed to validate as UTF-8, it would be rejected, preventing bad ledger entries from halting ledger processing at the Reflector side. This safety mechanism caught an early bug: field names in the ledger statements weren’t being escaped properly, and a developer used the name “type” for a field, a reserved word. MySQL rejected this table creation statement as invalid before it poisoned the ledger.

ctlstore is a multi-tenant system, so it is necessary to limit resource usage to protect the health of the system overall. A large influx of mutations would not only crowd out other writers, but could also have damaging effects across our fleet. To avoid this, there are limits on the rate of mutations over time for each loader. The other resource to manage is disk usage. LDB space is precious because every instance must store a full copy of the data. The Executive service monitors disk usage for each table, alerts when a soft limit is reached, and enforces a hard limit once a table reaches a certain size. Both rate limits and table size limits can be adjusted on a per-resource basis.

Schema Management

ctlstore exposes a relational model, and as such, it requires setting up and managing a schema. We chose a structured, relational approach as opposed to a semi-structured, document approach to eliminate various edge cases that would lead to incorrect behavior and/or failures in production. Teams share the same tables so a schema helps developers understand the data they are handling.

Schema is managed using the HTTP API exposed by the Executive service. Endpoints are available for creating families and tables as well as adding fields to existing tables. Fields are specified with simple types (string, bytestring, integer, decimal, text, and binary) that map to compatible MySQL and SQLite internal types. Due to constraints of the underlying databases, only a subset of types are supported as primary key fields: string, bytestring, and integer.

Removing tables is currently not supported by the API, to prevent inadvertent disasters. We’re considering a safe way to implement this functionality. Removing fields will likely never be supported due to the implications downstream, such as breaking existing production deployments that depend on that field.

Snapshots & Bootstrapping

Our primary compute fleet is constantly churning. Instances are coming and going all the time. Typically, an instance lasts less than 72 hours. One of the requirements for ctlstore is that a freshly launched instance can be “caught up” within minutes. Replaying the entire ledger from the beginning wouldn’t cut it.

So instead, new instances bootstrap themselves by pulling a snapshot of the data from S3. The snapshot is just an LDB frozen in time. Using the same mechanism for crash recovery, a freshly booted Reflector can “resume” processing of the ledger from the snapshot. Once enough of the mutation statements have been applied, the container instance is marked as caught up. Services can specify that their tasks are only scheduled on container instances which are caught up.

Snapshots are constructed continuously by a dedicated service called the Supervisor. The Supervisor builds it’s own private LDB by running an internal Reflector instance. It pauses periodically to create a new snapshot. This process involves flushing the WAL to ensure all writes are captured, vacuuming the private LDB file to trim any extra unused space, compressing it to reduce its size, and uploading it to S3.

Consistency Model

Deciding which consistency model fits a system is complicated. In terms of the CAP theorem, ctlstore is a CP system because writes go offline if the ctldb fails or is partitioned. A copy of all of the data runs on every node, so reads stay available even in the face of the most severe partitions.

In terms of data consistency and isolation, it’s a bit hard to pin down. MySQL provides REPEATABLE READ isolation and SQLite provides SERIALIZABLE isolation, so which is it? Perhaps it’s neither, because ctlstore doesn’t provide similar read-write transactional semantics. Transactions are either a batch of writes or a single read operation.

What we do know is that ctlstore has the following high-level consistency attributes:

  • It has no real-time constraint, so readers can read stale data.

  • All ledger statements are applied in-order, so all readers will eventually observe all writes in the same order.

  • Batches of mutations are atomic and isolated.

  • All readers observe the latest committed data (there are no multi-read transactions).

  • Readers never encounter phantom reads.

ctlstore applies batched mutations transactionally, even down to the LDB. The ledger contains special marker statements that indicate transactional boundaries. In theory this provides strong isolation. In practice, Debezium streams changes outside of transactional context, so they’re applied incrementally to ctlstore. While they usually wind up within the boundaries of a batch, upstream transactions can and do straddle batches applied to ctlstore. So while ctlstore provides this isolation, in use we aren’t currently propagating transactional isolation from the origin to the reader.

Future Work

Here are some of the things we’re eyeing for the future:

  • We’re currently experimenting with a “sidecar” read path that uses RPC instead of accessing the LDB directly. This could make it simpler to interface with ctlstore on the read path.

  • Currently ledger statements are kept forever. Ledger pruning might be necessary in the future to keep the ledger compact. This is complicated to implement in the general case, but there are some classes of ledger statements that would be low-hanging fruit, such as heartbeat entries generated by our monitoring system.

  • No data or schema inspection is exposed via the HTTP API. Reads via the HTTP API would be consistent with the write path, making it possible to implement systems that use ctlstore as their source of truth. Schema inspection helps developers understand the system, and should be coming soon.

  • While we don’t anticipate this anytime soon, it might be necessary in the future to split up the LDB into groups of table families. Each cluster of container instances would be able to “subscribe” to a subset of families, limiting the amount of disk and memory for cache required.

  • As mentioned above, secondary indices might become very valuable for some use cases. MySQL and SQLite both support them, but in general we are very conservative on the read path, to protect the key performance characteristics of ctlstore.

  • While we’d prefer that the shared mount is read-only, it currently is mounted read-write to work because one of our early users experienced intermittent read errors due to an obscure bug in SQLite. Switching the mount to read-write was the workaround. In the future we’d love to find a way to switch this back to a read-only mount for data integrity purposes.

In Closing

ctlstore is a distributed, relational data store intended for small-but-critical data sets. While we’re still in the process of transitioning many data sets to ctlstore, it is now a hardened, production, business critical system deployed for a number of use cases. We’ve layered systems on-top of ctlstore, such as flagon, our feature flagging system. The architecture allows sourcing data from multiple systems of record, critical for adoption across teams. Developers no longer need to be concerned with read scalability or availability of their control data. It has been incredibly reliable in practice — we have yet to experience downtime on the read path.

We’d like to thank the trailblazing people that were involved in the early testing and deployment of ctlstore: Ray Jenkins, Daniel St. Jules, Archana Ramachandran, and Albert Strasheim.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.