Mallika Sahay on September 8th 2020

Today we’re excited to announce the launch of Segment Data Lakes, a new turnkey customer data lake that provides the data engineering foundation needed to power data science and advanced analytics use cases.

Benjamin Yolken on August 24th 2020

Today, we're releasing Topicctl, a tool for easy, declarative management of Kafka topics. Here's why

Leif Dreizler on March 31st 2020

Segment receives billions of events every day from thousands of customers that trust Segment to keep their data safe. At Segment, we believe that good security is an essential part of building high-quality software, similar to reliability or scalability. In addition to the tools and processes developed by our Security Org to help software engineers make good security choices, we also rely on reviews from both traditional and crowdsourced security companies.

Bug bounties are often seen as a significant burden for the security teams that run them. You’ll hear horror stories about companies launching a bug bounty, their security team getting inundated with low-quality reports, duplicate submissions, and researchers going outside the scope of the program.

Shortly after, you'll hear about their engineering team being overwhelmed with newly discovered vulnerabilities to fix. From those that survive the opening salvo, you may hear complaints that, over time, they have stopped receiving impactful submissions.

A few years ago, when the space was less mature, critics questioned whether running a program was worth it. Now, it is expected that organizations of a certain size and maturity run a bug bounty program.

In late 2019, the Cybersecurity and Infrastructure Security Agency (CISA) of the United States published a draft directive recommending all agencies develop a mechanism to accept information related to security vulnerabilities from external parties.

In this blog post, we’ll break down how to start and manage a bug bounty program, consistently achieve good results, and maintain healthy relationships with the people that power the program.

If you’re short on time, check out the “Top Tips” section at the bottom of this post.

What’s a bug bounty?

A bug bounty program is like a Wanted Poster for security vulnerabilities. 

Companies running bug bounty programs pay independent security researchers from across the world for security vulnerabilities in their own products and infrastructure.

https://scottschober.com/bug-bounties-created-equal/

Bug bounty programs are run by hundreds of organizations, including MozillaFiat Chrysler, the U.S. Department of Defense, and Segment.

Why bug bounty?

I assume most readers are already bought into the benefits of running a bug bounty.

If you’re on the fence, there are tons of independent articles, marketing docs from companies like HackerOne and Bugcrowd, and recorded conference talks, so I’m not going to cover this in detail. 

Most companies that have an internet presence or make an internet-connected device should consider running a bounty, or at least have a way for security researchers to report security issues.

It is also part of the Vendor Security Alliance questionnaire, so it may be something your customers ask you about if you are in the B2B space. 

If you don’t have a way for researchers to report issues, they will email people at your company (or any alias they can find on your website), connect with you on LinkedIn, or just tweet about the issues they think they’ve found.

It’s a much better experience for researchers, your employees, and your customers if you give the security community a clear avenue to report vulnerabilities.

Your security and engineering orgs will be regularly impressed by the creativity of the researcher community. These are people that, without internal knowledge, can find critical vulnerabilities in organizations across the world. 

Why use a bug bounty platform?

I strongly recommend using a bug bounty platform like HackerOne or Bugcrowd (we use Bugcrowd here at Segment) to help manage this process. These companies provide a platform and services to help run an efficient program.

Severity baselines

Severity baselines make it easier to tell how serious a vulnerability is, and how much time you should be spending on review and remediation.

When running a program on your own, you’ll frequently have researchers overhyping vulnerabilities. Platforms have a guided submission form, which helps researchers pick the appropriate category and rating.

Reputation system

The reputation systems reward researchers that accurately rank vulnerabilities and creates a competitive environment that benefits both researchers and program owners.

It also helps reinforce good behavior. Any researcher discipline issues have stricter consequences. If a researcher misbehaves, they may be banned from the platform. 

Disclosure agreements

To submit vulnerabilities via these platforms, researchers have to agree not to disclose the vulnerability without approval from your company.

Triage services

Both platforms also provide triage services, which I highly recommend paying for. These are the first line of defense for your internal security resources. These globally distributed teams will help clean up researcher reports, mark submissions as duplicates, and filter out low-quality reports.

Management services

These companies also serve as a knowledge base for you to learn about running a program and ask questions. You can bounce ideas off of someone that works at a company running hundreds of programs.

Integrations

Platforms have structured input with required fields and integrations with popular tools, like Jira. These make it much easier to turn a submission into a ticket for your engineering org.

Private programs

For most companies, it isn’t possible to run a private program without the help of a bug bounty platform.

We’ll talk about private programs in more depth later, but this is the recommended starting point for companies launching a bug bounty for the first time.

All of the above features free your team to focus on the security challenges unique to your business.

Laying the groundwork

Having a successful program starts with a good foundation, and it’s your job as a program owner to help set your organization up for success.

Vulnerability management

Think about your current process for handling a security vulnerability. What happens when someone internally or externally finds a security bug?

You will inevitably get more of these after starting a bug bounty program, so you must have a good way to handle these reports.

Your vulnerability process doesn’t have to be perfect, but you should have a way to prioritize and assign bug reports to the appropriate engineering team without high overhead.

Engineering buy-in

As someone starting a program, you’ll also need to get buy-in from your Engineering org. They are the ones that will have to fix the issues and will likely be the ones responding to alerts triggered by researchers.

Your alerting story doesn’t need to be perfect. But you also don’t want engineers to be woken up every time someone triggers an error because some input validation was working correctly and stopped a researcher from submitting a < symbol into an email address field.

Remember, your team doesn't have to fix every valid vulnerability immediately.

Vulnerabilities are just bugs and should be prioritized appropriately. If you’re having trouble getting your engineering org to fix critical vulnerabilities in a timely manner, you may want to direct your efforts to job-hunting instead of starting a bug bounty program 🙃

Getting started

Once your organization is bought-in, you can focus on getting things ready for the researchers.

Bounty Brief

Your bounty brief is what researchers will read to determine if they’re interested in working on your program. It’s part instructions, part rules, and part advertisement.

Keep in mind you’re competing for researchers’ time; they don’t have to work on your program when there are so many other programs.

Your bounty brief should be clear, concise, and should set expectations with the researchers. You can find the Segment program hosted on Bugcrowd.

Scope

Where do you want researchers to focus their testing? What’s in scope? What’s out of scope?

I recommend starting with assets that are easy for the researchers to access, ideally something free, that anyone can sign up for.

You should also try to pick a target that has at least medium size, complexity, and business impact. This will help you show value early, which will help you expand the program. 

Access

How do researchers get access to the scope? Are there docs they can read to help them get up to speed? We instruct researchers to sign up for our app using their @bugcrowdninja.com email address and to include -bugcrowdninja as part of their workspace slug. 

This makes it easier for us to determine if someone is part of our Bugcrowd program when we review logs and alerts. If we notice someone causing problems in our app, we can ask Bugcrowd to provide researcher coaching.

Severity

How are you going to rate submissions? Consistent severity is important because it impacts how much the researcher gets paid for a submission. HackerOne uses Mitre’s Common Weakness Enumeration (CWE) and Bugcrowd uses the Bugcrowd Vulnerability Rating Taxonomy (VRT).

Rewards

How much are you going to pay for vulnerabilities? Researchers need to know upfront how much you’ll pay for vulnerabilities so they can assess if it is worth their time to hunt on your program. 

Think about using different reward ranges for different assets. This can help control costs and also helps researchers understand which targets are more important. For example, we describe specific objects that will net a higher reward:

https://bugcrowd.com/segment

Swag

A handful of years ago, getting a T-shirt as a reward was pretty standard. I’d strongly encourage anyone thinking about running a swag-based bounty to reconsider.

T-shirts don’t pay rent and are more work for your team than sending a researcher money. What do you do when that T-shirt you sent to another continent is lost in the mail or doesn’t fit?

We reserve swag for our top performers. Sending a T-shirt requires the researcher to trust us enough to give us their address and requires me to go to the post office.

Safe harbor

Researchers want to know that they’ll be safe from prosecution when operating within the rules of your program. Check out disclose.io for more info.

Researchers want to know that they’ll be safe from legal action and exempt from your end-user license agreement(EULA) when operating within the rules of your bounty brief.

Take the time to explain what a bug bounty is, why it's important, and have a few examples from recognizable organizations ready to show them.

Understand the platform

Learn a little bit about the platform you’re using. Your actions on the platform impact the researcher. If you mistreat researchers, they will go elsewhere; without researchers, your program isn’t providing value to your organization. 

The same report status can have different meanings and impact on different platforms.

For example, on HackerOne Not Applicable reduces the researcher’s site-wide score by 5 points, and should be used for reports that don’t contain a valid issue.

On Bugcrowd, Not Applicable does not impact the researcher’s score, and is commonly used for reports that should neither be accepted or rejected. To achieve this result on HackerOne, you would use the Informative status.

If you have any questions about the platform you’re using, I strongly recommend reviewing documentation or reaching out to your account manager for help.

Start Small

Regardless of how big your company’s internet footprint is, you can start with a small scope open only to a handful of individuals as part of a private program.

In mid-2017, Segment was running a private program with 25 researchers and a single target: our app.

The early researchers invited will be some of the platform’s most trusted, and they will generally be more accepting of companies that are learning how to manage a program, as long as you pay them fairly and treat them with respect.

Starting small allows your organization to learn how to run a program in a safer environment. If your vulnerability management program has some gaps, you can fix them; if your bounty brief is unclear, you can rewrite it; if your alerts aren’t tuned properly, you can invest time into improving them. If you need to pause your program, you can relaunch later with a less negative impact.

Even while we had a private program, we would direct researchers that reached out via email to our Bugcrowd program. This allowed us to receive the benefits of the platform and triage services for all submissions before running a public program.

It’s much easier to explain to a researcher why you won’t be paying for a low-effort submission when you have a prioritization system established and enforced by a third party.

Growth over time

Like any multi-year project, your bug bounty will evolve and require ongoing effort to keep it healthy.

Additional scope and researchers

Researchers love testing new features and assets; in most bug bounty programs, only the first person to find a vulnerability receives a monetary reward.

If you started with a small scope, your program is steady, and you’re ready for more submissions, this is a great time to add more targets to your brief. 

As you add scope, keep in mind that not all assets are of equal value to an adversary. It is encouraged to specify different reward ranges for different assets based on their security maturity and value.

Open scope

Over time, you should also consider an “open scope” bounty if it is appropriate for your organization. We have listed as a target, Any host/web property verified to be owned by Segment (domains/IP space/etc.), which serves as a catch-all for anything not explicitly listed in our “In Scope” or “Out of Scope” sections.

Having an open scope bounty is enticing to researchers. Not only does it show you take running a bug bounty program seriously. It also shows that regardless of where they find a vulnerability, it will likely be rewarded (assuming it is valid, unique, and not out of scope).

Many researchers specialize in finding forgotten internet-facing assets as part of open-scope programs, and have developed their own infrastructure to identify assets and vulnerabilities to be able to efficiently earn rewards.

It’s also worth noting that there is no scope for an attacker trying to compromise your company’s security. Working towards an open scope means that it is more likely a bug bounty researcher will find and report a vulnerability before an attacker exploits it.

Early access

Over time, you’ll build trust and form relationships with particular researchers. These are great people to give early access to upcoming features. Many times, these features require manual provisioning, making them less suitable for wide-scale testing.

Early access is a mutually beneficial system in which you will receive security vulnerabilities prior to release, which makes them easier to fix. Researchers will be able to test features with less competition, which makes them more likely to earn a reward and continue testing on your program.

If the effort to set up these features is medium or higher, consider paying the researcher a grant before they start working.

Clearly communicate what you're looking for and what you expect from them. When offering a researcher grant, we want to see a short write-up of what didn't work in addition to any findings they submit. Rewardable findings should be paid in addition to the grant.

Public program

Once you’ve been running a program for a while and are confident in your company’s ability to receive vulnerabilities from the global researcher community, you should consider evolving it into a public program. 

If you don’t have a wide scope, this is a great time to revisit that decision.

Maximizing your scope (while private) will reduce the uptick in submissions when your program is launched publicly. You should also invite as many researchers as possible to your private program before going public for the same reason.

Because public programs are open to anyone, you will inevitably receive testing from a lot of newer folks that will pay less attention to your bounty brief, so having a wide scope helps in this regard as well.

Segment has run a public bug bounty program since late 2018, roughly 18 months after launching our private program.

Train your triagers

Hopefully over time, you will think of your outsourced triage team as an extension of your internal team. Spending the time to let them know how you want your program run will pay dividends in the future. Any submission they can validate without asking your team questions saves time for everyone involved.

Here are some examples of guidance we’ve given to the Bugcrowd triage team:

Identify duplicates for non-rewardable submissions

Many programs do not bother to mark informational, out of scope, or other non-rewardable submissions as duplicates. We do this for two reasons: 

The first is that if we decide to fix one of these issues later, we can go back and mark the original submission as resolved and pay the researcher. Any duplicates of this issue will still receive points. 

The second is that when there is a false positive identified by a tool commonly used by bug bounty hunters, you will get this submitted to your program a lot.

If we repeatedly see an out-of-scope or not reproducible submission, we can add a specific item in our bounty brief to warn researchers; it will be marked as out-of-scope or not reproducible without a working proof of concept.

Don’t be afraid to deduct points for undesired behavior

While we are generally laid-back and understanding program owners, we aren’t afraid to deduct points from a researcher’s score for when it’s warranted.

Many programs shy away from deducting points, but we want to ensure participants in our program thoroughly read our brief and think that it helps the larger bug bounty community to slightly penalize those that disregard clearly documented rules. 

Two of the common arguments against bug bounty programs is that the submissions are often low-value and that researchers don’t respect scope.

For example, we have a very small out-of-scope section, which includes: CORS or crossdomain.xml issues on api.segment.io without proof of concept.

This is identified by a tool commonly used by bug bounty participants is a finding we have received dozens of times, but never with any impact.

We do this to save time for both researchers and our triage team. If a researcher submits this finding without a proof of concept, we encourage Bugcrowd to mark this as out-of-scope. If a researcher submits a finding that showed impact, we would be more than happy to reward, fix, and update our brief.

Explain deviations in ratings

If you need to deviate from the baseline rating established in your bounty brief, take the time to explain to the researcher why the rating and reward are higher or lower than they might expect. 

Researchers are generally understanding, as long as your rating, reward, and explanation are fair and make sense. If you find yourself commonly deviating from the ratings, it may be time to make changes to your bounty brief so that researchers know what to expect in advance. If you make severity or scope changes as the result of a submission, reward the researcher at whichever rate is more favorable to them.

In addition to explaining why something was rated lower than the baseline, take the time to explain why something was rated higher than the baseline. This is a great way to encourage further testing in these areas and is a great way to build trust with a researcher.

These explanations also help your triage team learn more about your program, and allow them to more accurately triage future submissions.

Build and maintain good relationships with researchers

Take time to build relationships and trust with researchers, especially those that repeatedly submit to your program. Treat researchers fairly, with respect, and consider paying for anything that brings value to your organization.

You’re competing for a researcher’s time, especially the ones that are the most talented. They can likely work on almost any bug bounty program available; think about ways you can encourage them to work on yours.

Keep in mind that all researchers, even those that are unskilled, are human beings. Assume that they want to help you secure your organization, learn more about security and technology, and get paid.

Pay for anything that brings value

If there is one sentence you remember from this blog, I hope it is “pay for anything that brings value.” 

Bug bounty hunters put in a lot of time and effort that doesn’t result in getting paid. This could be time spent developing tooling, hunting without finding any bugs, or having a valid bug marked as a duplicate.

Try to avoid thinking about individual bug costs. Instead, think about the overall value the program brings to your organization in terms of bugs found, time saved, and peace of mind. If you’re debating between two severities, pick the higher one and pay the researcher at that rate. You can always change the severity in your internal ticketing system later.

Once you’ve received a rewardable submission, try to triage and pay quickly. Sometimes determining the full impact takes time; if this is the case, add a comment letting the researcher know you appreciate their work but need some extra time to determine the appropriate reward.

Work collaboratively with the researcher

As an employee of your company, you should know more about the codebase and infrastructure than a security researcher outside your organization (although occasionally I question this based on the creative and impactful submissions we receive 😅). 

Sometimes when running a bug bounty program, you’ll get a submission that makes you think, “What’s the next level the researcher could get to?” If this is a researcher you trust, it may be appropriate to give them some hints to help further their testing. If you give them hints, you can also issue some cautionary advice to help them continue in a way that is safe for your organization and customers. 

Giving the researcher hints helps show them you value their testing and saves your team from spending time on something that may not be possible. If the hint is helpful, the researcher will be submitting a higher-severity finding, which positively impacts their researcher score and earns a higher monetary reward. It also allows you to get the vulnerability fixed faster due to the higher severity.

Sometimes, it isn’t appropriate to include the researcher in this phase of the process. If our team continues the investigation, and it leads to the discovery of a higher-impact weakness in our systems, we reward the researcher as if their report contained the full impact. We also explain why we paid them at this higher rate, but let them know we are unable to share the details. This is a great way to show the researcher you value their work and build trust.

Share progress with the researcher

If a researcher submits a vulnerability that leads to a systemic fix for a vulnerability class, share this with them! Researchers are generally excited to hear that their work led to meaningful change within your organization. It also is a cue for them to attempt to bypass the new protections. 

Pay for Dupes

At Segment, we commonly pay researchers for well-written duplicates, and frequently reach out to let them know that we appreciated their submission. We also let them know that we don’t always pay for duplicates to make sure that expectations are set appropriately. 

This has worked out incredibly well for us. All of our most critical submissions have come from researchers that were originally rewarded for a well-written duplicate. Segment is a complex product that takes time to set up and fully understand. Researchers that put in the effort to fully set up a Segment workspace have additional context and understanding that take time to acquire—these people are hard to replace, and you want to keep them happy.

Pay bonuses for well-written reports

We also pay extra for well-written reports. Valid submissions need to get turned into Jira tickets which are assigned to engineering teams. Reports that are concise, easy to follow, have clear impact, and are well-formatted take less time for us to turn into tickets. We want to encourage researchers to save us time so we make sure to reward appropriately and let them know that we appreciate their efforts.

Final thoughts

Running a successful bug bounty program requires consistent effort from your team, but can bring tremendous value to your company and customers. Any vulnerability reported and fixed is one fewer vulnerability an attacker could use to get a foothold in your organization. Bug bounty submissions can help illuminate vulnerability trends, which can help prioritize where you spend resources to fix systemic issues in your applications or infrastructure.

Bug bounty programs are people-powered. Spend the time to make those involved in your program feel valued, help them understand the motivations behind your decisions, and be excellent to each other!

Top tips

Thanks for taking the time to read my post! I hope you learned a few things to help your company run a successful program. Here are some of my top tips to reference later:

  • Pay for anything that brings value

  • Pay extra for well-written reports, even if they’re dupes

  • Avoid thinking about individual bug costs

  • Reward quickly

  • Partner with a bug bounty platform and pay for triage services

  • If you make changes to your bounty brief as the result of a submission, reward the researcher at the more favorable rate

  • Invest time into building and maintaining relationships with your researchers and triage team

  • Don’t be afraid to deduct points for bad behavior

  • Start small and partner early with Engineering

  • Write a clear and concise bounty brief to set expectations with the researchers

Thank you! 💚💙

A special thanks to the Segment Engineering organization for fixing vulnerabilities and responding to alerts.

To Edis from Bugcrowd for helping us triage three years of vulnerabilities and truly being an extension of our team.

To all the researchers that have helped keep our customers safe by looking for vulnerabilities as part of our program.

And finally, to researchers danieloizo and sheddow, you have both submitted numerous well-written and high impact findings and are an absolute pleasure to work with.

Alexandra Noonan on July 10th 2018

Unless you’ve been living under a rock, you probably already know that microservices is the architecture du jour. Coming of age alongside this trend, Segment adopted this as a best practice early-on, which served us well in some cases, and, as you’ll soon learn, not so well in others.

Briefly, microservices is a service-oriented software architecture in which server-side applications are constructed by combining many single-purpose, low-footprint network services. The touted benefits are improved modularity, reduced testing burden, better functional composition, environmental isolation, and development team autonomy. The opposite is a Monolithic architecture, where a large amount of functionality lives in a single service which is tested, deployed, and scaled as a single unit.

In early 2017 we reached a tipping point with a core piece of Segment’s product. It seemed as if we were falling from the microservices tree, hitting every branch on the way down. Instead of enabling us to move faster, the small team found themselves mired in exploding complexity. Essential benefits of this architecture became burdens. As our velocity plummeted, our defect rate exploded.

Eventually, the team found themselves unable to make headway, with 3 full-time engineers spending most of their time just keeping the system alive. Something had to change. This post is the story of how we took a step back and embraced an approach that aligned well with our product requirements and needs of the team.

Why Microservices work worked

Segment’s customer data infrastructure ingests hundreds of thousands of events per second and forwards them to partner APIs, what we refer to as server-side destinations. There are over one hundred types of these destinations, such as Google Analytics, Optimizely, or a custom webhook. 

Years back, when the product initially launched, the architecture was simple. There was an API that ingested events and forwarded them to a distributed message queue. An event, in this case, is a JSON object generated by a web or mobile app containing information about users and their actions. A sample payload looks like the following:

As events were consumed from the queue, customer-managed settings were checked to decide which destinations should receive the event. The event was then sent to each destination’s API, one after another, which was useful because developers only need to send their event to a single endpoint, Segment’s API, instead of building potentially dozens of integrations. Segment handles making the request to every destination endpoint.

If one of the requests to a destination fails, sometimes we’ll try sending that event again at a later time. Some failures are safe to retry while others are not. Retry-able errors are those that could potentially be accepted by the destination with no changes. For example, HTTP 500s, rate limits, and timeouts. Non-retry-able errors are requests that we can be sure will never be accepted by the destination. For example, requests which have invalid credentials or are missing required fields.

At this point, a single queue contained both the newest events as well as those which may have had several retry attempts, across all destinations, which resulted in head-of-line blocking. Meaning in this particular case, if one destination slowed or went down, retries would flood the queue, resulting in delays across all our destinations.

Imagine destination X is experiencing a temporary issue and every request errors with a timeout. Now, not only does this create a large backlog of requests which have yet to reach destination X, but also every failed event is put back to retry in the queue. While our systems would automatically scale in response to increased load, the sudden increase in queue depth would outpace our ability to scale up, resulting in delays for the newest events. Delivery times for all destinations would increase because destination X had a momentary outage. Customers rely on the timeliness of this delivery, so we can’t afford increases in wait times anywhere in our pipeline.

To solve the head-of-line blocking problem, the team created a separate service and queue for each destination. This new architecture consisted of an additional router process that receives the inbound events and distributes a copy of the event to each selected destination. Now if one destination experienced problems, only it’s queue would back up and no other destinations would be impacted. This microservice-style architecture isolated the destinations from one another, which was crucial when one destination experienced issues as they often do.

The Case for Individual Repos

Each destination API uses a different request format, requiring custom code to translate the event to match this format. A basic example is destination X requires sending birthday as traits.dob in the payload whereas our API accepts it as traits.birthday. The transformation code in destination X would look something like this:

Many modern destination endpoints have adopted Segment’s request format making some transforms relatively simple. However, these transforms can be very complex depending on the structure of the destination’s API. For example, for some of the older and most sprawling destinations, we find ourselves shoving values into hand-crafted XML payloads.

Initially, when the destinations were divided into separate services, all of the code lived in one repo. A huge point of frustration was that a single broken test caused tests to fail across all destinations. When we wanted to deploy a change, we had to spend time fixing the broken test even if the changes had nothing to do with the initial change. In response to this problem, it was decided to break out the code for each destination into their own repos. All the destinations were already broken out into their own service, so the transition was natural.

The split to separate repos allowed us to isolate the destination test suites easily. This isolation allowed the development team to move quickly when maintaining destinations.

Scaling Microservices and Repos

As time went on, we added over 50 new destinations, and that meant 50 new repos. To ease the burden of developing and maintaining these codebases, we created shared libraries to make common transforms and functionality, such as HTTP request handling, across our destinations easier and more uniform.

For example, if we want the name of a user from an event, event.name() can be called in any destination’s code. The shared library checks the event for the property key name and Name. If those don’t exist, it checks for a first name, checking the properties firstName, first_name, and FirstName. It does the same for the last name, checking the cases and combining the two to form the full name.

The shared libraries made building new destinations quick. The familiarity brought by a uniform set of shared functionality made maintenance less of a headache.

However, a new problem began to arise. Testing and deploying changes to these shared libraries impacted all of our destinations. It began to require considerable time and effort to maintain. Making changes to improve our libraries, knowing we’d have to test and deploy dozens of services, was a risky proposition. When pressed for time, engineers would only include the updated versions of these libraries on a single destination’s codebase.

Over time, the versions of these shared libraries began to diverge across the different destination codebases. The great benefit we once had of reduced customization between each destination codebase started to reverse. Eventually, all of them were using different versions of these shared libraries. We could’ve built tools to automate rolling out changes, but at this point, not only was developer productivity suffering but we began to encounter other issues with the microservice architecture.

The additional problem is that each service had a distinct load pattern. Some services would handle a handful of events per day while others handled thousands of events per second. For destinations that handled a small number of events, an operator would have to manually scale the service up to meet demand whenever there was an unexpected spike in load.

While we did have auto-scaling implemented, each service had a distinct blend of required CPU and memory resources, which made tuning the auto-scaling configuration more art than science.

The number of destinations continued to grow rapidly, with the team adding three destinations per month on average, which meant more repos, more queues, and more services. With our microservice architecture, our operational overhead increased linearly with each added destination. Therefore, we decided to take a step back and rethink the entire pipeline.

Ditching Microservices and Queues

The first item on the list was to consolidate the now over 140 services into a single service. The overhead from managing all of these services was a huge tax on our team. We were literally losing sleep over it since it was common for the on-call engineer to get paged to deal with load spikes.

However, the architecture at the time would have made moving to a single service challenging. With a separate queue per destination, each worker would have to check every queue for work, which would have added a layer of complexity to the destination service with which we weren’t comfortable. This was the main inspiration for Centrifuge. Centrifuge would replace all our individual queues and be responsible for sending events to the single monolithic service.

Moving to a Monorepo

Given that there would only be one service, it made sense to move all the destination code into one repo, which meant merging all the different dependencies and tests into a single repo. We knew this was going to be messy.

For each of the 120 unique dependencies, we committed to having one version for all our destinations. As we moved destinations over, we’d check the dependencies it was using and update them to the latest versions. We fixed anything in the destinations that broke with the newer versions.

With this transition, we no longer needed to keep track of the differences between dependency versions. All our destinations were using the same version, which significantly reduced the complexity across the codebase. Maintaining destinations now became less time consuming and less risky.

We also wanted a test suite that allowed us to quickly and easily run all our destination tests. Running all the tests was one of the main blockers when making updates to the shared libraries we discussed earlier.

Fortunately, the destination tests all had a similar structure. They had basic unit tests to verify our custom transform logic was correct and would execute HTTP requests to the partner’s endpoint to verify that events showed up in the destination as expected.

Recall that the original motivation for separating each destination codebase into its own repo was to isolate test failures. However, it turned out this was a false advantage. Tests that made HTTP requests were still failing with some frequency. With destinations separated into their own repos, there was little motivation to clean up failing tests. This poor hygiene led to a constant source of frustrating technical debt. Often a small change that should have only taken an hour or two would end up requiring a couple of days to a week to complete.

Building a Resilient Test Suite

The outbound HTTP requests to destination endpoints during the test run was the primary cause of failing tests. Unrelated issues like expired credentials shouldn’t fail tests. We also knew from experience that some destination endpoints were much slower than others. Some destinations took up to 5 minutes to run their tests. With over 140 destinations, our test suite could take up to an hour to run.

To solve for both of these, we created Traffic Recorder. Traffic Recorder is built on top of yakbak, and is responsible for recording and saving destinations’ test traffic. Whenever a test runs for the first time, any requests and their corresponding responses are recorded to a file. On subsequent test runs, the request and response in the file is played back instead requesting the destination’s endpoint. These files are checked into the repo so that the tests are consistent across every change. Now that the test suite is no longer dependent on these HTTP requests over the internet, our tests became significantly more resilient, a must-have for the migration to a single repo.

I remember running the tests for every destination for the first time, after we integrated Traffic Recorder. It took milliseconds to complete running the tests for all 140+ of our destinations. In the past, just one destination could have taken a couple of minutes to complete. It felt like magic.

Why a Monolith works

Once the code for all destinations lived in a single repo, they could be merged into a single service. With every destination living in one service, our developer productivity substantially improved. We no longer had to deploy 140+ services for a change to one of the shared libraries. One engineer can deploy the service in a matter of minutes.

The proof was in the improved velocity. In 2016, when our microservice architecture was still in place, we made 32 improvements to our shared libraries. Just this year we’ve made 46 improvements. We’ve made more improvements to our libraries in the past 6 months than in all of 2016.

The change also benefited our operational story. With every destination living in one service, we had a good mix of CPU and memory-intense destinations, which made scaling the service to meet demand significantly easier. The large worker pool can absorb spikes in load, so we no longer get paged for destinations that process small amounts of load.

Trade Offs

Moving from our microservice architecture to a monolith overall was huge improvement, however, there are trade-offs:

  1. Fault isolation is difficult. With everything running in a monolith, if a bug is introduced in one destination that causes the service to crash, the service will crash for all destinations. We have comprehensive automated testing in place, but tests can only get you so far. We are currently working on a much more robust way to prevent one destination from taking down the entire service while still keeping all the destinations in a monolith.

  2. In-memory caching is less effective. Previously, with one service per destination, our low traffic destinations only had a handful of processes, which meant their in-memory caches of control plane data would stay hot. Now that cache is spread thinly across 3000+ processes so it’s much less likely to be hit. We could use something like Redis to solve for this, but then that’s another point of scaling for which we’d have to account. In the end, we accepted this loss of efficiency given the substantial operational benefits.

  3. Updating the version of a dependency may break multiple destinations. While moving everything to one repo solved the previous dependency mess we were in, it means that if we want to use the newest version of a library, we’ll potentially have to update other destinations to work with the newer version. In our opinion though, the simplicity of this approach is worth the trade-off. And with our comprehensive automated test suite, we can quickly see what breaks with a newer dependency version.

Conclusion

Our initial microservice architecture worked for a time, solving the immediate performance issues in our pipeline by isolating the destinations from each other. However, we weren’t set up to scale. We lacked the proper tooling for testing and deploying the microservices when bulk updates were needed. As a result, our developer productivity quickly declined.

Moving to a monolith allowed us to rid our pipeline of operational issues while significantly increasing developer productivity. We didn’t make this transition lightly though and knew there were things we had to consider if it was going to work.

  1. We needed a rock solid testing suite to put everything into one repo. Without this, we would have been in the same situation as when we originally decided to break them apart. Constant failing tests hurt our productivity in the past, and we didn’t want that happening again.

  2. We accepted the trade-offs inherent in a monolithic architecture and made sure we had a good story around each. We had to be comfortable with some of the sacrifices that came with this change.

When deciding between microservices or a monolith, there are different factors to consider with each. In some parts of our infrastructure, microservices work well but our server-side destinations were a perfect example of how this popular trend can actually hurt productivity and performance. It turns out, the solution for us was a monolith.


The transition to a monolith was made possible by Stephen Mathieson, Rick Branson, Achille Roussel, Tom Holmes, and many more.

Special thanks to Rick Branson for helping review and edit this post at every stage.


See what Segment is all about...

Sign up for a free workspace or catch a demo here 👉

Geoffrey Keating on May 27th 2020

In today’s COVID-19-induced economy, should you be reducing or increasing your advertising?

Chipper Nicodemus on July 6th 2020

How to choose the best attribution tool for your business, based on your role, company size, and objective.

Geoffrey Keating on April 19th 2020

If you want to grow in a scalable and profitable way, look beyond customer acquisition.

Calvin French-Owen, Geoffrey Keating on June 10th 2020

In our new data report, we take an inside look into how tech stacks have evolved since the onset of COVID-19. 

Andy Yeo on July 16th 2020

More of the customer experience is online than ever before. While this offers your business new opportunities to tailor each user’s journey, it also creates a lot of complexity below the hood. Most companies are using 80+ vendor tools to run their business, and half of the enterprises we surveyed maintain 7+ disconnected islands of customer data.

With so many tools both generating and demanding customer data, your business needs a flexible way to unlock new use cases without reinventing the wheel at every turn. Last year at Synapse, our user conference, we announced the developer preview for Functions to allow early users to connect any tool to Segment using custom JavaScript. And since that introduction, more than 200 companies have built almost 2000 functions.

Today, we’re proud to announce that Functions is generally available to all customers, and you can start building now. If you want to learn more, here are a few of the most common ways customers are using Functions.

Build your own sources and destinations

While Segment already supports 300+ sources and destinations, you can use Functions to create your own sources and destinations directly within your workspace to bring new types of data into Segment and send data to new tools:

We’ve seen customers primarily use Functions to build a more complete customer view, with a specific lens toward connecting an internal service or an industry-specific tool to Segment. For example, Adversus is an outbound call center platform that used Functions to create a new data source and destination for their new financial automation tool, Fenerum

By integrating Segment to Fenerum using Functions, the Adversus team is able to keep track of changes in user subscriptions and related invoices in real-time. These integrations have also empowered their Finance team to use their data warehouse as a source of truth for their own analysis. 

At Adversus, we need the freedom to control our data and Functions lets us do just that. Integration projects are now done in a few days instead of weeks, and our development team can focus on the core product development instead of internal IT projects. We’re now looking to move even more of our data integrations into the Segment product. Mads Jepsen, Co Founder, Adversus

Balance data quality and flexibility

When your business has multiple business units or products, it can be tough to balance keeping data clean and reliable across the company, while also making it work for every tool your teams use.

For Adevinta, a marketplace specialist with over 30 brands, this had led to serious data inconsistencies in order to optimize their ad performance.

One of their real estate brands, Fotocasa, wanted to advertise to users via Google, Facebook, and Criteo with personalized ads based on whether users want to buy or rent. Their data team wanted to keep the data clean with a single event for House Lead, with properties for type: sell and type: rent to help them avoid reporting on redundant events. However, their ad tools required them to send two separate events for Sell House Lead and Rent House Lead for personalization.

In order to better balance data quality and ad performance, Adevinta created a Source Function to collect a single House Lead event and create separate events downstream so their ad platforms could properly use the data. This quick solution helped them keep their global analytics clean while unlocking better personalization for Fotocasa.

Not only does this save them hours of engineering and reporting time, but it also helped them further optimize their ads to create a 4-5x improvement in campaign performance.

Automate away complex workflows

With Functions, you can go far beyond implementing new tools by creating functions that allow you to automate away complex workflows. This is especially true for industries like consumer packaged goods where offline and online data need to interplay to create great experiences.

A multinational CPG business is using Functions to provide delivery visibility for the store owners who distribute their products. They use Foxtrot to manage deliveries in Latin America, and every time a delivery is about to go out, Foxtrot triggers a route created event that contains the IDs of all the stores that are going to get hit on that route.

By plugging that same data into Segment using a Source Function, they can use Segment to match Foxtrot data, customer ID, and store ID. They can then pass that data from Segment to Braze, which triggers a message to the store owners informing them their items are out for delivery. 

Even more impressive is that they were able to complete the entire project within four days, giving them a huge personalization win in less than a week. 

Ready to take your data pipeline to the next level?

Functions is built on AWS Lambda, and we’ve continued to rapidly expand the feature set since our developer preview. We’ve partnered with early users to build in more ease-of-use and reliability to ensure the functions you build can truly become core to your data pipeline. 

Here’s what we’ve improved over the last few months:

  • Error tracking: You can find useful information about errors triggered within your function in both our debugger and in event delivery.

  • Advanced permissions: You now have more control over who can create, edit, or deploy functions in your workspace.

  • Data replay for Source Functions: We’ve built a tool to help establish more reliability for your event pipeline and to help respect GDPR/CCPA compliance.

  • Custom settings for Destination Functions: You can now include configurable fields either required or encrypted for things like API keys, secrets, and event mapping.

  • Autofill for Destination Functions: Test your function with real events from any of the active sources within your Segment workspace.

Functions is a paid product, but every customer is granted a generous set of compute hours to build, test, and use Functions for your business. We invite you to start building within your workspace today

If you want to learn more first, check out our docs, request a personalized demo, or sign up for our upcoming webinar

Geoffrey Keating, Andy Schumeister on April 29th 2020

Our mission at Segment is to help every company in the world use good data to power their growth. And over the past eight years, we’ve been lucky enough to help thousands of businesses do exactly that.

During that time, most of the people implementing these customer data strategies have come from specific, and largely technical teams, like engineering and analytics. This is not only because of the technical effort involved in data collection, but also user comprehension around what they should collect and why.

But imagine a world where a product manager didn’t have to ping an engineer to instrument a tracking plan for the new feature they just launched. A marketer didn’t have to submit a Jira ticket to track whether an A/B test was successful or not.

Today we’re introducing Visual Tagger — to help everyone in your company collect the website data they need in minutes, without relying on technical teams. 

With Visual Tagger, you can capture all the events that you want to capture without having to touch your code. - Neeraj Sinha, Chief Technology Officer, ScoutMyTrip


In less than 10 years, the amount of customer data being tracked by businesses has grown exponentially. According to Gartner, 82% of organizations have seen an increase in the types of customer data they are tracking over the past five years. 

Just as the overall volume of customer data has increased, so too has its influence across the business. Nowadays, product and marketing teams are as eager to collect and use customer data as engineers and analysts.

How customer data is changing in the organization. Image source: Gartner

The desire to collect customer data may be universal, but so too is the struggle to execute.

Whether you’re an early-stage startup or Fortune 500 company, engineering teams are facing increasingly limited bandwidth. When implementing a customer data tracking plan has to compete for attention with dozens of other priorities, it can be deprioritized, or worse, ignored. 

Driven by these constraints, an entire industry has started to flourish.  Known as the “no-code” or “low-code” movement, tools like Webflow, Airtable, and Zapier (and Segment customers like Buildfire and Hotjar) are democratizing access to tasks that previously required technical expertise.

We now live in a world where you no longer need to become a developer to build a website or build an app. At Segment, we believe you shouldn’t have to be an engineer to start collecting the data you need to understand your customers either.

How Visual Tagger works

Visual Tagger is available free to all Segment customers, so long as you have a Segment workspace set up and the analytics.js snippet installed on your website. While the functionality of Visual Tagger is powerful, the user experience itself is incredibly simple and intuitive. In fact, it’s as easy as 1, 2, 3.

1. Build your events

If you’ve ever used “WYSIWYG” software, Segment’s Visual Tagger will feel familiar. Simply point and click the different parts of your website that you’d like to track.Out of the box, Visual Tagger lets you create three types of events:

  • Button or Link Clicked: Create an event every time a user clicks a button or link on your website.

  • Form Submitted: Create an event every time a user fills out a form on your website.

  • Any Clicked: Similar to button or link clicked, create an event for any HTML element you would like to track (e.g. a banner).

The Visual Tagger interface

From there, you can then add rich, contextual information in the form of properties to help you understand more about the specific action that the user took.

For example, if you wanted to track a button on your homepage, you could add the property Location: Homepage. If you wanted to track a demo request on your pricing page, you could add the properties Location: Pricing Page and Feature: Demo Request. 

If you aren’t sure which events or properties to track, we’ll provide recommended events to track based on your industry right in your workspace. For additional recommendations, you can also check out our industry-standard tracking plans for ecommerce, B2B SaaS, media, and more.

2. Test your events

Your data-driven decisions are only as good as the quality and accuracy of the data you collect. With Visual Tagger, you can preview the results before the event gets created and starts flowing in production.

 In test mode, you can:

  • Target URLs: Specify which URLs or group of URLs that an event is fired from.

  • Preview your events: Test the actions that would trigger the event and verify they are firing e.g. if your event gets fired, then a green checkmark will display.

  • Check all the details: Get granular feedback on where, how, and what event gets fired so that you can fix any issues before you deploy.

This means that you can have complete confidence before you deploy.

3. Publish

That’s it! Once you hit publish, the events you created will start flowing into Segment and will show up in your Debugger. You can use the Visual Tagger console to see what events are live, along with additional context for you and your team on who created the event, when it was created, and any recent updates.

Check out the docs to learn more. 

How customers are using Visual Tagger

Whether it’s helped them to experiment faster or save valuable engineering resources, Visual Tagger has become integral to the workflows of technical, and even non-technical, users alike.

Here are use cases from a few of our early customers and the positive impact Visual Tagger has had on their businesses.

Increase your agility

ScoutMyTrip, an AI-driven road trip planner and travel expert marketplace, relies heavily on quick, iterative experiments to help drive customer acquisition and retention.  However, with a small engineering team, they struggled to make a strong business case for launching and analyzing these experiments. Investing precious engineering resources in capturing data for an experiment (that might not even work) felt wasteful, given other demands from across the business.

With Visual Tagger, ScoutMyTrip can track all the experiments they need without having to touch a line of code. Whereas tracking events for an experiment usually took the team six days on average, they were able to track the same number of events in two hours.

“It took me about two hours to capture 30 events. If I didn't have Visual Tagger, I would have decided not to capture half of the events altogether because that would be too much of a hassle.” - Neeraj Sinha, Chief Technology Officer, ScoutMyTrip

Empower every team

Nationbuilder, a nonprofit and political campaign software platform that combines website, communication, and donation tools, wanted to understand and optimize their onboarding experience.

Unfortunately, getting the data they needed was a highly manual process. Tracking events had to be manually instrumented with engineering, and, quite often, was ignored in favor of more pressing matters. For Alex Stevens, Director of Growth at Nationbuilder, they needed to start tracking with minimal investment of resources, so they could understand product adoption and onboard customers as easily and quickly as possible.

With Visual Tagger, the team now feels comfortable implementing event tracking themselves. 

“It used to take weeks for an analytics request to get prioritized. With Visual Tagger, I can quickly collect the data I need without having to create tons of requests for our engineering team.” - Alex Stevens, Growth Director, Nationbuilder

Make any tool codeless

With Visual Tagger, customers can configure events in minutes and then activate the data in 300+ tools in the Segment catalog.

For example, Voopty, a business management software that connects educators, tutors, and clubs, with families searching for their services, needed better visibility into engagement metrics. 

Voopty’s Founder and Software Developer, Taja Kholodova, needed access to product analytics to provide reporting to the hundreds of partner businesses they work with. Unfortunately, collecting this data was manual and required time to learn every additional tool and platform.

Kholodova turned to Visual Tagger and was able to quickly set up event tracking to understand user engagement in a matter of minutes. From there, Kholodova instantly activated the data across their entire tech stack, including tools like Indicative, FullStory, and Google Tag Manager. 

“We were able to start sending events to destinations in 5 minutes.” - Taja Kholodova, Founder and Software Developer, Voopty


Visual Tagger is a win for both technical and non-technical users alike. Engineers are free from menial and repetitive tasks that take away time they could spend on building product. Product managers and marketers can unlock the myriad of use cases within the Segment product, without writing a single line of code.

But we think the true benefits will be felt by your end customer. By making analytics instrumentation unbelievably easy, data is accessible to everyone in your company. This empowers every individual to use that data to inform their day-to-day work.

And ultimately, that means better products and better customer experiences for all of us. 

Ready to get started with Visual Tagger? Log in to your workspace to set up new events in minutes. 


Want to learn more about Visual Tagger? Join us for a webinar on May 21st at 1pm PST.

Become a data expert. Subscribe to our newsletter.

Recent articles

Madelyn Mullen on September 17th 2020

For the retail revolution, it’s essential that each team understands how customer data is collected—and how it impacts the bottom line. Get a B2B insider perspective from Krystal Lau, Global Director of Product Data, on this Data Council episode.

Benjamin Yolken on September 15th 2020

Today, we're releasing kubeapply, a lightweight tool for git-based management of Kubernetes configs. Here's why.

Geoffrey Keating on September 10th 2020

Learn how to power your A/B tests with clean, quality, and hyper-specific customer data.

Madelyn Mullen on September 17th 2020

For the retail revolution, it’s essential that each team understands how customer data is collected—and how it impacts the bottom line. Get a B2B insider perspective from Krystal Lau, Global Director of Product Data, on this Data Council episode.

Benjamin Yolken on September 15th 2020

Today, we're releasing kubeapply, a lightweight tool for git-based management of Kubernetes configs. Here's why.

Geoffrey Keating on September 10th 2020

Learn how to power your A/B tests with clean, quality, and hyper-specific customer data.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.