Leif Dreizler on March 31st 2020
Segment receives billions of events every day from thousands of customers that trust Segment to keep their data safe. At Segment, we believe that good security is an essential part of building high-quality software, similar to reliability or scalability. In addition to the tools and processes developed by our Security Org to help software engineers make good security choices, we also rely on reviews from both traditional and crowdsourced security companies.
Bug bounties are often seen as a significant burden for the security teams that run them. You’ll hear horror stories about companies launching a bug bounty, their security team getting inundated with low-quality reports, duplicate submissions, and researchers going outside the scope of the program.
Shortly after, you'll hear about their engineering team being overwhelmed with newly discovered vulnerabilities to fix. From those that survive the opening salvo, you may hear complaints that, over time, they have stopped receiving impactful submissions.
A few years ago, when the space was less mature, critics questioned whether running a program was worth it. Now, it is expected that organizations of a certain size and maturity run a bug bounty program.
In late 2019, the Cybersecurity and Infrastructure Security Agency (CISA) of the United States published a draft directive recommending all agencies develop a mechanism to accept information related to security vulnerabilities from external parties.
In this blog post, we’ll break down how to start and manage a bug bounty program, consistently achieve good results, and maintain healthy relationships with the people that power the program.
If you’re short on time, check out the “Top Tips” section at the bottom of this post.
A bug bounty program is like a Wanted Poster for security vulnerabilities.
Companies running bug bounty programs pay independent security researchers from across the world for security vulnerabilities in their own products and infrastructure.
I assume most readers are already bought into the benefits of running a bug bounty.
Most companies that have an internet presence or make an internet-connected device should consider running a bounty, or at least have a way for security researchers to report security issues.
It is also part of the Vendor Security Alliance questionnaire, so it may be something your customers ask you about if you are in the B2B space.
If you don’t have a way for researchers to report issues, they will email people at your company (or any alias they can find on your website), connect with you on LinkedIn, or just tweet about the issues they think they’ve found.
It’s a much better experience for researchers, your employees, and your customers if you give the security community a clear avenue to report vulnerabilities.
Your security and engineering orgs will be regularly impressed by the creativity of the researcher community. These are people that, without internal knowledge, can find critical vulnerabilities in organizations across the world.
I strongly recommend using a bug bounty platform like HackerOne or Bugcrowd (we use Bugcrowd here at Segment) to help manage this process. These companies provide a platform and services to help run an efficient program.
Severity baselines make it easier to tell how serious a vulnerability is, and how much time you should be spending on review and remediation.
When running a program on your own, you’ll frequently have researchers overhyping vulnerabilities. Platforms have a guided submission form, which helps researchers pick the appropriate category and rating.
The reputation systems reward researchers that accurately rank vulnerabilities and creates a competitive environment that benefits both researchers and program owners.
It also helps reinforce good behavior. Any researcher discipline issues have stricter consequences. If a researcher misbehaves, they may be banned from the platform.
To submit vulnerabilities via these platforms, researchers have to agree not to disclose the vulnerability without approval from your company.
Both platforms also provide triage services, which I highly recommend paying for. These are the first line of defense for your internal security resources. These globally distributed teams will help clean up researcher reports, mark submissions as duplicates, and filter out low-quality reports.
These companies also serve as a knowledge base for you to learn about running a program and ask questions. You can bounce ideas off of someone that works at a company running hundreds of programs.
Platforms have structured input with required fields and integrations with popular tools, like Jira. These make it much easier to turn a submission into a ticket for your engineering org.
For most companies, it isn’t possible to run a private program without the help of a bug bounty platform.
We’ll talk about private programs in more depth later, but this is the recommended starting point for companies launching a bug bounty for the first time.
All of the above features free your team to focus on the security challenges unique to your business.
Having a successful program starts with a good foundation, and it’s your job as a program owner to help set your organization up for success.
Think about your current process for handling a security vulnerability. What happens when someone internally or externally finds a security bug?
You will inevitably get more of these after starting a bug bounty program, so you must have a good way to handle these reports.
Your vulnerability process doesn’t have to be perfect, but you should have a way to prioritize and assign bug reports to the appropriate engineering team without high overhead.
As someone starting a program, you’ll also need to get buy-in from your Engineering org. They are the ones that will have to fix the issues and will likely be the ones responding to alerts triggered by researchers.
Your alerting story doesn’t need to be perfect. But you also don’t want engineers to be woken up every time someone triggers an error because some input validation was working correctly and stopped a researcher from submitting a
< symbol into an email address field.
Remember, your team doesn't have to fix every valid vulnerability immediately.
Vulnerabilities are just bugs and should be prioritized appropriately. If you’re having trouble getting your engineering org to fix critical vulnerabilities in a timely manner, you may want to direct your efforts to job-hunting instead of starting a bug bounty program 🙃
Once your organization is bought-in, you can focus on getting things ready for the researchers.
Your bounty brief is what researchers will read to determine if they’re interested in working on your program. It’s part instructions, part rules, and part advertisement.
Keep in mind you’re competing for researchers’ time; they don’t have to work on your program when there are so many other programs.
Your bounty brief should be clear, concise, and should set expectations with the researchers. You can find the Segment program hosted on Bugcrowd.
Where do you want researchers to focus their testing? What’s in scope? What’s out of scope?
I recommend starting with assets that are easy for the researchers to access, ideally something free, that anyone can sign up for.
You should also try to pick a target that has at least medium size, complexity, and business impact. This will help you show value early, which will help you expand the program.
How do researchers get access to the scope? Are there docs they can read to help them get up to speed? We instruct researchers to sign up for our app using their @bugcrowdninja.com email address and to include
-bugcrowdninja as part of their workspace slug.
This makes it easier for us to determine if someone is part of our Bugcrowd program when we review logs and alerts. If we notice someone causing problems in our app, we can ask Bugcrowd to provide researcher coaching.
How are you going to rate submissions? Consistent severity is important because it impacts how much the researcher gets paid for a submission. HackerOne uses Mitre’s Common Weakness Enumeration (CWE) and Bugcrowd uses the Bugcrowd Vulnerability Rating Taxonomy (VRT).
How much are you going to pay for vulnerabilities? Researchers need to know upfront how much you’ll pay for vulnerabilities so they can assess if it is worth their time to hunt on your program.
Think about using different reward ranges for different assets. This can help control costs and also helps researchers understand which targets are more important. For example, we describe specific objects that will net a higher reward:
A handful of years ago, getting a T-shirt as a reward was pretty standard. I’d strongly encourage anyone thinking about running a swag-based bounty to reconsider.
T-shirts don’t pay rent and are more work for your team than sending a researcher money. What do you do when that T-shirt you sent to another continent is lost in the mail or doesn’t fit?
We reserve swag for our top performers. Sending a T-shirt requires the researcher to trust us enough to give us their address and requires me to go to the post office.
Take the time to explain what a bug bounty is, why it's important, and have a few examples from recognizable organizations ready to show them.
Learn a little bit about the platform you’re using. Your actions on the platform impact the researcher. If you mistreat researchers, they will go elsewhere; without researchers, your program isn’t providing value to your organization.
The same report status can have different meanings and impact on different platforms.
For example, on HackerOne
Not Applicable reduces the researcher’s site-wide score by 5 points, and should be used for reports that don’t contain a valid issue.
Not Applicable does not impact the researcher’s score, and is commonly used for reports that should neither be accepted or rejected. To achieve this result on HackerOne, you would use the
If you have any questions about the platform you’re using, I strongly recommend reviewing documentation or reaching out to your account manager for help.
Regardless of how big your company’s internet footprint is, you can start with a small scope open only to a handful of individuals as part of a private program.
In mid-2017, Segment was running a private program with 25 researchers and a single target: our app.
The early researchers invited will be some of the platform’s most trusted, and they will generally be more accepting of companies that are learning how to manage a program, as long as you pay them fairly and treat them with respect.
Starting small allows your organization to learn how to run a program in a safer environment. If your vulnerability management program has some gaps, you can fix them; if your bounty brief is unclear, you can rewrite it; if your alerts aren’t tuned properly, you can invest time into improving them. If you need to pause your program, you can relaunch later with a less negative impact.
Even while we had a private program, we would direct researchers that reached out via email to our Bugcrowd program. This allowed us to receive the benefits of the platform and triage services for all submissions before running a public program.
It’s much easier to explain to a researcher why you won’t be paying for a low-effort submission when you have a prioritization system established and enforced by a third party.
Like any multi-year project, your bug bounty will evolve and require ongoing effort to keep it healthy.
Researchers love testing new features and assets; in most bug bounty programs, only the first person to find a vulnerability receives a monetary reward.
If you started with a small scope, your program is steady, and you’re ready for more submissions, this is a great time to add more targets to your brief.
As you add scope, keep in mind that not all assets are of equal value to an adversary. It is encouraged to specify different reward ranges for different assets based on their security maturity and value.
Over time, you should also consider an “open scope” bounty if it is appropriate for your organization. We have listed as a target,
Any host/web property verified to be owned by Segment (domains/IP space/etc.), which serves as a catch-all for anything not explicitly listed in our “In Scope” or “Out of Scope” sections.
Having an open scope bounty is enticing to researchers. Not only does it show you take running a bug bounty program seriously. It also shows that regardless of where they find a vulnerability, it will likely be rewarded (assuming it is valid, unique, and not out of scope).
Many researchers specialize in finding forgotten internet-facing assets as part of open-scope programs, and have developed their own infrastructure to identify assets and vulnerabilities to be able to efficiently earn rewards.
It’s also worth noting that there is no scope for an attacker trying to compromise your company’s security. Working towards an open scope means that it is more likely a bug bounty researcher will find and report a vulnerability before an attacker exploits it.
Over time, you’ll build trust and form relationships with particular researchers. These are great people to give early access to upcoming features. Many times, these features require manual provisioning, making them less suitable for wide-scale testing.
Early access is a mutually beneficial system in which you will receive security vulnerabilities prior to release, which makes them easier to fix. Researchers will be able to test features with less competition, which makes them more likely to earn a reward and continue testing on your program.
If the effort to set up these features is medium or higher, consider paying the researcher a grant before they start working.
Clearly communicate what you're looking for and what you expect from them. When offering a researcher grant, we want to see a short write-up of what didn't work in addition to any findings they submit. Rewardable findings should be paid in addition to the grant.
Once you’ve been running a program for a while and are confident in your company’s ability to receive vulnerabilities from the global researcher community, you should consider evolving it into a public program.
If you don’t have a wide scope, this is a great time to revisit that decision.
Maximizing your scope (while private) will reduce the uptick in submissions when your program is launched publicly. You should also invite as many researchers as possible to your private program before going public for the same reason.
Because public programs are open to anyone, you will inevitably receive testing from a lot of newer folks that will pay less attention to your bounty brief, so having a wide scope helps in this regard as well.
Segment has run a public bug bounty program since late 2018, roughly 18 months after launching our private program.
Hopefully over time, you will think of your outsourced triage team as an extension of your internal team. Spending the time to let them know how you want your program run will pay dividends in the future. Any submission they can validate without asking your team questions saves time for everyone involved.
Here are some examples of guidance we’ve given to the Bugcrowd triage team:
Identify duplicates for non-rewardable submissions
Many programs do not bother to mark informational, out of scope, or other non-rewardable submissions as duplicates. We do this for two reasons:
The first is that if we decide to fix one of these issues later, we can go back and mark the original submission as resolved and pay the researcher. Any duplicates of this issue will still receive points.
The second is that when there is a false positive identified by a tool commonly used by bug bounty hunters, you will get this submitted to your program a lot.
If we repeatedly see an out-of-scope or not reproducible submission, we can add a specific item in our bounty brief to warn researchers; it will be marked as out-of-scope or not reproducible without a working proof of concept.
Don’t be afraid to deduct points for undesired behavior
While we are generally laid-back and understanding program owners, we aren’t afraid to deduct points from a researcher’s score for when it’s warranted.
Many programs shy away from deducting points, but we want to ensure participants in our program thoroughly read our brief and think that it helps the larger bug bounty community to slightly penalize those that disregard clearly documented rules.
Two of the common arguments against bug bounty programs is that the submissions are often low-value and that researchers don’t respect scope.
For example, we have a very small out-of-scope section, which includes:
CORS or crossdomain.xml issues on api.segment.io without proof of concept.
This is identified by a tool commonly used by bug bounty participants is a finding we have received dozens of times, but never with any impact.
We do this to save time for both researchers and our triage team. If a researcher submits this finding without a proof of concept, we encourage Bugcrowd to mark this as out-of-scope. If a researcher submits a finding that showed impact, we would be more than happy to reward, fix, and update our brief.
If you need to deviate from the baseline rating established in your bounty brief, take the time to explain to the researcher why the rating and reward are higher or lower than they might expect.
Researchers are generally understanding, as long as your rating, reward, and explanation are fair and make sense. If you find yourself commonly deviating from the ratings, it may be time to make changes to your bounty brief so that researchers know what to expect in advance. If you make severity or scope changes as the result of a submission, reward the researcher at whichever rate is more favorable to them.
In addition to explaining why something was rated lower than the baseline, take the time to explain why something was rated higher than the baseline. This is a great way to encourage further testing in these areas and is a great way to build trust with a researcher.
These explanations also help your triage team learn more about your program, and allow them to more accurately triage future submissions.
Take time to build relationships and trust with researchers, especially those that repeatedly submit to your program. Treat researchers fairly, with respect, and consider paying for anything that brings value to your organization.
You’re competing for a researcher’s time, especially the ones that are the most talented. They can likely work on almost any bug bounty program available; think about ways you can encourage them to work on yours.
Keep in mind that all researchers, even those that are unskilled, are human beings. Assume that they want to help you secure your organization, learn more about security and technology, and get paid.
If there is one sentence you remember from this blog, I hope it is “pay for anything that brings value.”
Bug bounty hunters put in a lot of time and effort that doesn’t result in getting paid. This could be time spent developing tooling, hunting without finding any bugs, or having a valid bug marked as a duplicate.
Try to avoid thinking about individual bug costs. Instead, think about the overall value the program brings to your organization in terms of bugs found, time saved, and peace of mind. If you’re debating between two severities, pick the higher one and pay the researcher at that rate. You can always change the severity in your internal ticketing system later.
Once you’ve received a rewardable submission, try to triage and pay quickly. Sometimes determining the full impact takes time; if this is the case, add a comment letting the researcher know you appreciate their work but need some extra time to determine the appropriate reward.
Work collaboratively with the researcher
As an employee of your company, you should know more about the codebase and infrastructure than a security researcher outside your organization (although occasionally I question this based on the creative and impactful submissions we receive 😅).
Sometimes when running a bug bounty program, you’ll get a submission that makes you think, “What’s the next level the researcher could get to?” If this is a researcher you trust, it may be appropriate to give them some hints to help further their testing. If you give them hints, you can also issue some cautionary advice to help them continue in a way that is safe for your organization and customers.
Giving the researcher hints helps show them you value their testing and saves your team from spending time on something that may not be possible. If the hint is helpful, the researcher will be submitting a higher-severity finding, which positively impacts their researcher score and earns a higher monetary reward. It also allows you to get the vulnerability fixed faster due to the higher severity.
Sometimes, it isn’t appropriate to include the researcher in this phase of the process. If our team continues the investigation, and it leads to the discovery of a higher-impact weakness in our systems, we reward the researcher as if their report contained the full impact. We also explain why we paid them at this higher rate, but let them know we are unable to share the details. This is a great way to show the researcher you value their work and build trust.
Share progress with the researcher
If a researcher submits a vulnerability that leads to a systemic fix for a vulnerability class, share this with them! Researchers are generally excited to hear that their work led to meaningful change within your organization. It also is a cue for them to attempt to bypass the new protections.
Pay for Dupes
At Segment, we commonly pay researchers for well-written duplicates, and frequently reach out to let them know that we appreciated their submission. We also let them know that we don’t always pay for duplicates to make sure that expectations are set appropriately.
This has worked out incredibly well for us. All of our most critical submissions have come from researchers that were originally rewarded for a well-written duplicate. Segment is a complex product that takes time to set up and fully understand. Researchers that put in the effort to fully set up a Segment workspace have additional context and understanding that take time to acquire—these people are hard to replace, and you want to keep them happy.
Pay bonuses for well-written reports
We also pay extra for well-written reports. Valid submissions need to get turned into Jira tickets which are assigned to engineering teams. Reports that are concise, easy to follow, have clear impact, and are well-formatted take less time for us to turn into tickets. We want to encourage researchers to save us time so we make sure to reward appropriately and let them know that we appreciate their efforts.
Running a successful bug bounty program requires consistent effort from your team, but can bring tremendous value to your company and customers. Any vulnerability reported and fixed is one fewer vulnerability an attacker could use to get a foothold in your organization. Bug bounty submissions can help illuminate vulnerability trends, which can help prioritize where you spend resources to fix systemic issues in your applications or infrastructure.
Bug bounty programs are people-powered. Spend the time to make those involved in your program feel valued, help them understand the motivations behind your decisions, and be excellent to each other!
Thanks for taking the time to read my post! I hope you learned a few things to help your company run a successful program. Here are some of my top tips to reference later:
Pay for anything that brings value
Pay extra for well-written reports, even if they’re dupes
Avoid thinking about individual bug costs
Partner with a bug bounty platform and pay for triage services
If you make changes to your bounty brief as the result of a submission, reward the researcher at the more favorable rate
Invest time into building and maintaining relationships with your researchers and triage team
Don’t be afraid to deduct points for bad behavior
Start small and partner early with Engineering
Write a clear and concise bounty brief to set expectations with the researchers
A special thanks to the Segment Engineering organization for fixing vulnerabilities and responding to alerts.
To Edis from Bugcrowd for helping us triage three years of vulnerabilities and truly being an extension of our team.
To all the researchers that have helped keep our customers safe by looking for vulnerabilities as part of our program.
And finally, to researchers danieloizo and sheddow, you have both submitted numerous well-written and high impact findings and are an absolute pleasure to work with.
Alexandra Noonan on July 10th 2018
Unless you’ve been living under a rock, you probably already know that microservices is the architecture du jour. Coming of age alongside this trend, Segment adopted this as a best practice early-on, which served us well in some cases, and, as you’ll soon learn, not so well in others.
Briefly, microservices is a service-oriented software architecture in which server-side applications are constructed by combining many single-purpose, low-footprint network services. The touted benefits are improved modularity, reduced testing burden, better functional composition, environmental isolation, and development team autonomy. The opposite is a Monolithic architecture, where a large amount of functionality lives in a single service which is tested, deployed, and scaled as a single unit.
In early 2017 we reached a tipping point with a core piece of Segment’s product. It seemed as if we were falling from the microservices tree, hitting every branch on the way down. Instead of enabling us to move faster, the small team found themselves mired in exploding complexity. Essential benefits of this architecture became burdens. As our velocity plummeted, our defect rate exploded.
Eventually, the team found themselves unable to make headway, with 3 full-time engineers spending most of their time just keeping the system alive. Something had to change. This post is the story of how we took a step back and embraced an approach that aligned well with our product requirements and needs of the team.
Segment’s customer data infrastructure ingests hundreds of thousands of events per second and forwards them to partner APIs, what we refer to as server-side destinations. There are over one hundred types of these destinations, such as Google Analytics, Optimizely, or a custom webhook.
Years back, when the product initially launched, the architecture was simple. There was an API that ingested events and forwarded them to a distributed message queue. An event, in this case, is a JSON object generated by a web or mobile app containing information about users and their actions. A sample payload looks like the following:
As events were consumed from the queue, customer-managed settings were checked to decide which destinations should receive the event. The event was then sent to each destination’s API, one after another, which was useful because developers only need to send their event to a single endpoint, Segment’s API, instead of building potentially dozens of integrations. Segment handles making the request to every destination endpoint.
If one of the requests to a destination fails, sometimes we’ll try sending that event again at a later time. Some failures are safe to retry while others are not. Retry-able errors are those that could potentially be accepted by the destination with no changes. For example, HTTP 500s, rate limits, and timeouts. Non-retry-able errors are requests that we can be sure will never be accepted by the destination. For example, requests which have invalid credentials or are missing required fields.
At this point, a single queue contained both the newest events as well as those which may have had several retry attempts, across all destinations, which resulted in head-of-line blocking. Meaning in this particular case, if one destination slowed or went down, retries would flood the queue, resulting in delays across all our destinations.
Imagine destination X is experiencing a temporary issue and every request errors with a timeout. Now, not only does this create a large backlog of requests which have yet to reach destination X, but also every failed event is put back to retry in the queue. While our systems would automatically scale in response to increased load, the sudden increase in queue depth would outpace our ability to scale up, resulting in delays for the newest events. Delivery times for all destinations would increase because destination X had a momentary outage. Customers rely on the timeliness of this delivery, so we can’t afford increases in wait times anywhere in our pipeline.
To solve the head-of-line blocking problem, the team created a separate service and queue for each destination. This new architecture consisted of an additional router process that receives the inbound events and distributes a copy of the event to each selected destination. Now if one destination experienced problems, only it’s queue would back up and no other destinations would be impacted. This microservice-style architecture isolated the destinations from one another, which was crucial when one destination experienced issues as they often do.
Each destination API uses a different request format, requiring custom code to translate the event to match this format. A basic example is destination X requires sending birthday as
traits.dob in the payload whereas our API accepts it as
traits.birthday. The transformation code in destination X would look something like this:
Many modern destination endpoints have adopted Segment’s request format making some transforms relatively simple. However, these transforms can be very complex depending on the structure of the destination’s API. For example, for some of the older and most sprawling destinations, we find ourselves shoving values into hand-crafted XML payloads.
Initially, when the destinations were divided into separate services, all of the code lived in one repo. A huge point of frustration was that a single broken test caused tests to fail across all destinations. When we wanted to deploy a change, we had to spend time fixing the broken test even if the changes had nothing to do with the initial change. In response to this problem, it was decided to break out the code for each destination into their own repos. All the destinations were already broken out into their own service, so the transition was natural.
The split to separate repos allowed us to isolate the destination test suites easily. This isolation allowed the development team to move quickly when maintaining destinations.
As time went on, we added over 50 new destinations, and that meant 50 new repos. To ease the burden of developing and maintaining these codebases, we created shared libraries to make common transforms and functionality, such as HTTP request handling, across our destinations easier and more uniform.
For example, if we want the name of a user from an event,
event.name() can be called in any destination’s code. The shared library checks the event for the property key
Name. If those don’t exist, it checks for a first name, checking the properties
FirstName. It does the same for the last name, checking the cases and combining the two to form the full name.
The shared libraries made building new destinations quick. The familiarity brought by a uniform set of shared functionality made maintenance less of a headache.
However, a new problem began to arise. Testing and deploying changes to these shared libraries impacted all of our destinations. It began to require considerable time and effort to maintain. Making changes to improve our libraries, knowing we’d have to test and deploy dozens of services, was a risky proposition. When pressed for time, engineers would only include the updated versions of these libraries on a single destination’s codebase.
Over time, the versions of these shared libraries began to diverge across the different destination codebases. The great benefit we once had of reduced customization between each destination codebase started to reverse. Eventually, all of them were using different versions of these shared libraries. We could’ve built tools to automate rolling out changes, but at this point, not only was developer productivity suffering but we began to encounter other issues with the microservice architecture.
The additional problem is that each service had a distinct load pattern. Some services would handle a handful of events per day while others handled thousands of events per second. For destinations that handled a small number of events, an operator would have to manually scale the service up to meet demand whenever there was an unexpected spike in load.
While we did have auto-scaling implemented, each service had a distinct blend of required CPU and memory resources, which made tuning the auto-scaling configuration more art than science.
The number of destinations continued to grow rapidly, with the team adding three destinations per month on average, which meant more repos, more queues, and more services. With our microservice architecture, our operational overhead increased linearly with each added destination. Therefore, we decided to take a step back and rethink the entire pipeline.
The first item on the list was to consolidate the now over 140 services into a single service. The overhead from managing all of these services was a huge tax on our team. We were literally losing sleep over it since it was common for the on-call engineer to get paged to deal with load spikes.
However, the architecture at the time would have made moving to a single service challenging. With a separate queue per destination, each worker would have to check every queue for work, which would have added a layer of complexity to the destination service with which we weren’t comfortable. This was the main inspiration for Centrifuge. Centrifuge would replace all our individual queues and be responsible for sending events to the single monolithic service.
Given that there would only be one service, it made sense to move all the destination code into one repo, which meant merging all the different dependencies and tests into a single repo. We knew this was going to be messy.
For each of the 120 unique dependencies, we committed to having one version for all our destinations. As we moved destinations over, we’d check the dependencies it was using and update them to the latest versions. We fixed anything in the destinations that broke with the newer versions.
With this transition, we no longer needed to keep track of the differences between dependency versions. All our destinations were using the same version, which significantly reduced the complexity across the codebase. Maintaining destinations now became less time consuming and less risky.
We also wanted a test suite that allowed us to quickly and easily run all our destination tests. Running all the tests was one of the main blockers when making updates to the shared libraries we discussed earlier.
Fortunately, the destination tests all had a similar structure. They had basic unit tests to verify our custom transform logic was correct and would execute HTTP requests to the partner’s endpoint to verify that events showed up in the destination as expected.
Recall that the original motivation for separating each destination codebase into its own repo was to isolate test failures. However, it turned out this was a false advantage. Tests that made HTTP requests were still failing with some frequency. With destinations separated into their own repos, there was little motivation to clean up failing tests. This poor hygiene led to a constant source of frustrating technical debt. Often a small change that should have only taken an hour or two would end up requiring a couple of days to a week to complete.
The outbound HTTP requests to destination endpoints during the test run was the primary cause of failing tests. Unrelated issues like expired credentials shouldn’t fail tests. We also knew from experience that some destination endpoints were much slower than others. Some destinations took up to 5 minutes to run their tests. With over 140 destinations, our test suite could take up to an hour to run.
To solve for both of these, we created Traffic Recorder. Traffic Recorder is built on top of yakbak, and is responsible for recording and saving destinations’ test traffic. Whenever a test runs for the first time, any requests and their corresponding responses are recorded to a file. On subsequent test runs, the request and response in the file is played back instead requesting the destination’s endpoint. These files are checked into the repo so that the tests are consistent across every change. Now that the test suite is no longer dependent on these HTTP requests over the internet, our tests became significantly more resilient, a must-have for the migration to a single repo.
I remember running the tests for every destination for the first time, after we integrated Traffic Recorder. It took milliseconds to complete running the tests for all 140+ of our destinations. In the past, just one destination could have taken a couple of minutes to complete. It felt like magic.
Once the code for all destinations lived in a single repo, they could be merged into a single service. With every destination living in one service, our developer productivity substantially improved. We no longer had to deploy 140+ services for a change to one of the shared libraries. One engineer can deploy the service in a matter of minutes.
The proof was in the improved velocity. In 2016, when our microservice architecture was still in place, we made 32 improvements to our shared libraries. Just this year we’ve made 46 improvements. We’ve made more improvements to our libraries in the past 6 months than in all of 2016.
The change also benefited our operational story. With every destination living in one service, we had a good mix of CPU and memory-intense destinations, which made scaling the service to meet demand significantly easier. The large worker pool can absorb spikes in load, so we no longer get paged for destinations that process small amounts of load.
Moving from our microservice architecture to a monolith overall was huge improvement, however, there are trade-offs:
Fault isolation is difficult. With everything running in a monolith, if a bug is introduced in one destination that causes the service to crash, the service will crash for all destinations. We have comprehensive automated testing in place, but tests can only get you so far. We are currently working on a much more robust way to prevent one destination from taking down the entire service while still keeping all the destinations in a monolith.
In-memory caching is less effective. Previously, with one service per destination, our low traffic destinations only had a handful of processes, which meant their in-memory caches of control plane data would stay hot. Now that cache is spread thinly across 3000+ processes so it’s much less likely to be hit. We could use something like Redis to solve for this, but then that’s another point of scaling for which we’d have to account. In the end, we accepted this loss of efficiency given the substantial operational benefits.
Updating the version of a dependency may break multiple destinations. While moving everything to one repo solved the previous dependency mess we were in, it means that if we want to use the newest version of a library, we’ll potentially have to update other destinations to work with the newer version. In our opinion though, the simplicity of this approach is worth the trade-off. And with our comprehensive automated test suite, we can quickly see what breaks with a newer dependency version.
Our initial microservice architecture worked for a time, solving the immediate performance issues in our pipeline by isolating the destinations from each other. However, we weren’t set up to scale. We lacked the proper tooling for testing and deploying the microservices when bulk updates were needed. As a result, our developer productivity quickly declined.
Moving to a monolith allowed us to rid our pipeline of operational issues while significantly increasing developer productivity. We didn’t make this transition lightly though and knew there were things we had to consider if it was going to work.
We needed a rock solid testing suite to put everything into one repo. Without this, we would have been in the same situation as when we originally decided to break them apart. Constant failing tests hurt our productivity in the past, and we didn’t want that happening again.
We accepted the trade-offs inherent in a monolithic architecture and made sure we had a good story around each. We had to be comfortable with some of the sacrifices that came with this change.
When deciding between microservices or a monolith, there are different factors to consider with each. In some parts of our infrastructure, microservices work well but our server-side destinations were a perfect example of how this popular trend can actually hurt productivity and performance. It turns out, the solution for us was a monolith.
Special thanks to Rick Branson for helping review and edit this post at every stage.
Calvin French-Owen on May 23rd 2018
Today, we’re excited to share the architecture for Centrifuge–Segment’s system for reliably sending billions of messages per day to hundreds of public APIs. This post explores the problems Centrifuge solves, as well as the data model we use to run it in production.
At Segment, our core product collects, processes, and delivers hundreds of thousands of analytics events per second. These events consist of user actions like viewing a page, buying an item from Amazon, or liking a friend’s playlist. No matter what the event is, it’s almost always the result of some person on the internet doing something.
At any point in time, dozens of these endpoints will be in a state of failure. We’ll see 10x increases in response latency, spikes in 5xx status codes, and aggressive rate limiting for single large customers.
To give you a flavor, here are the sorts of latencies and uptimes I pulled from our internal monitoring earlier today.
In the best case, these API failures cause delays. In the worst case, data loss.
As it turns out, ‘scheduling’ that many requests in a faulty environment is a complex problem. You have to think hard about fairness (what data should you prioritize?), buffering semantics (how should you enqueue data?), and retry behavior (does retrying now add unwanted load to the system?).
Across all of the literature, we couldn't find a lot of good 'prior art' for delivering messages reliably in high-failure environments. The closest thing is network scheduling and routing, but that discipline has very different strategies concerning buffer allocation (very small) and backpressure strategies (adaptive, and usually routing to a single place).
So we decided to build our own general-purpose, fully distributed job scheduler to schedule and execute HTTP requests reliably. We've called it Centrifuge.
You can think of Centrifuge as the layer that sits between our infrastructure and the outside world–it's the system responsible for sending data to all of our customers destinations. When third-party APIs fail, Centrifuge is there to absorb the traffic.
Under normal operation, Centrifuge has three responsibilities: it delivers messages to third-party endpoints, retries messages upon failure, and archives any undelivered messages.
We've written this first post as a guide to understand the problems Centrifuge solves, its data model, and the building blocks we’ve used to operate it in production. In subsequent posts, we’ll share how we’ve verified the system’s correctness and made it blindingly fast.
Let’s dive in.
Before discussing Centrifuge itself, you might be thinking "why not just use a queue here? Building a fully distributed job scheduler seems a bit overdone”.
We've asked ourselves the same question. We already use Kafka extensively at Segment (we're passing nearly 1M messages per second through it), and it’s been the core building block of all of our streaming pipelines.
The problem with using any sort of queue is that you are fundamentally limited in terms of how you access data. After all, a queue only supports two operations (push and pop).
To see where queues break down, let's walk through a series of queueing topologies that we’ve implemented at Segment.
To start, let’s first consider a naive approach. We can run a group of workers that read jobs from a single queue.
Workers will read a single message off the queue, send to whatever third-party APIs are required and then acknowledge the message. It seems like it should protect us from failure, right?
This works okay for awhile, but what happens when we start seeing a single endpoint get slow? Unfortunately, it creates backpressure on the entire message flow.
Clearly, this isn’t ideal. If a single endpoint can bring down the entire pipeline, and each endpoint has an hour-long downtime each year (99.9% available), then with 200+ endpoints, we’ll be seeing hour-long outages once per day.
After seeing repeated slowdowns on our ingestion pipeline, we decided to re-architect. We updated our queueing topology to route events into separate queues based upon the downstream endpoints they would hit.
To do this, we added a router in front of each queue. The router would only publish messages to a queue destined for a specific API endpoint.
Suppose you had three destinations enabled: Google Analytics, Mixpanel, and Salesforce. The router would publish three messages, one to each dedicated queue for Google Analytics, Mixpanel, and Salesforce, respectively.
The benefit of this approach is that a single failing API will only affect messages bound for a single endpoint (which is what we want!).
Unfortunately, this approach has problems in practice. If we look at the distribution of messages which should be delivered to a single endpoint, things become a little more nuanced.
Segment is a large, multi-tenant system, so some sources of data will generate substantially more load than others. As you might imagine, among our customer base, this follows a fairly consistent power law:
When that translates to messages within our queues, the breakdown looks more like this:
In this case, we have data for customers A, B, and C, all trying to send to the same downstream endpoint. Customer A dominates the load, but B and C have a handful of calls mixed in.
Let’s suppose that the API endpoint we are sending to is rated to 1,000 calls per second, per customer. When the endpoint receives more than 1,000 calls in a second for a given customer API key, it will respond with a 429 HTTP Header (rate limit exceeded).
Now let’s assume that customer A is trying to send 50,000 messages to the API. Those messages are all ordered contiguously in our queue.
At this point we have a few options:
we can try and send a hard cap of 1,000 messages per second, but this delays traffic for B and C by 50 seconds.
we can try and send more messages to the API for customer A, but we will see 429 (rate limit exceeded) errors. We’ll want to retry those failed messages, possibly causing more slowdowns for B and C.
we can detect that we are nearing a rate-limit after sending 1,000 messages for customer A in the first second, so we can then copy the next 49,000 messages for customer A into a dead-letter queue, and allow the traffic for B and C to proceed.
None of these options are ideal. We’ll either end up blocking the queue for all customers in the case where a single customer sends a large batch of data, or we’ll end up copying terabytes of data between dead-letter queues.
Instead, we want an architecture that looks more like the following diagram, where we have separate queues per combination of customer and endpoint. This architecture gives us much better isolation, as well as the ability to dynamically adjust throughput on a per-customer basis.
However, in a large, multi-tenant system, like Segment, this number of queues becomes difficult to manage.
We have hundreds of thousands of these source-destination pairs. Today, we have 42,000 active sources of data sending to an average of 2.1 downstream endpoints. That’s 88,000 total queues that we’d like to support (and we’re growing quickly).
To implement per source-destination queues with full isolation, we’d need hundreds of thousands of different queues. Across Kafka, RabbitMQ, NSQ, or Kinesis–we haven’t seen any queues which support that level of cardinality with simple scaling primitives. SQS is the only queue we’ve found which manages to do this, but is totally cost prohibitive. We need a new primitive to solve this problem of high-cardinality isolation.
We now have our ideal end-state: tens of thousands of little queues. Amongst those queues, we can easily decide to consume messages at different rates from Customers A, B, and C.
But when we start to think about implementation, how do we actually manage that many queues?
We started with a few core requirements for our virtual queue system:
1) Provide per-customer isolation
First and foremost, we need to provide per-customer isolation. One customer sending a significant amount of failing traffic shouldn’t slow down any other data delivery. Our system must absorb failures without slowing the global delivery rate.
2) Allow us to re-order messages without copying terabytes of data
Our second constraint is that our system must be able to quickly shuffle its delivery order without copying terabytes of data over the network.
In our experience working with large datasets, having the ability to immediately re-order messages for delivery is essential. We’ve frequently run into cases which create large backlogs in data processing, where our consumers are spinning on a set of consistently failing messages.
Traditionally there are two ways to handle a large set of bad messages. The first is to stop your consumers and retry the same set of messages after a backoff period. This is clearly unacceptable in a multi-tenant architecture, where valid messages should still be delivered.
The second technique is to publish failed messages to a dead-letter queue and re-consume them later. Unfortunately, re-publishing messages to dead-letter queues or ‘tiers’ of topics with copies of the same event incurs significant storage and network overhead.
In either case, if your data is sitting in Kafka–the delivery order for your messages is effectively ‘set’ by the producer to the topic:
We want the ability to quickly recover from errors without having to shuffle terabytes of data around the network. So neither of these approaches works efficiently for us.
3) Evenly distribute the workload between many different workers
Finally, we need a system which cleanly scales as we increase our event volume. We don’t want to be continually adding partitions or doing additional sharding as we add customers. Our system should scale out horizontally based upon the throughput of traffic that we need.
By this point, we have a good idea of the problems that Centrifuge solves (reliable message delivery), the issues of various queueing topologies, and our core requirements. So let’s look at the Centrifuge data layer to understand how we’ve solved for the constraints we just listed above.
The core delivery unit of Centrifuge is what we call a job.
Jobs require both a payload of data to send, as well an endpoint indicating where to send the data. You can optionally supply headers to govern things like retry logic, message encoding, and timeout behavior.
In our case, a job is a single event which should be delivered to a partner API. To give you an idea of what jobs look like in practice, here’s a sample Segment job:
Looking back at our requirements, we want a way of quickly altering the delivery order for our jobs, without having to create many copies of the jobs themselves.
A queue won’t solve this problem in-place for us. Our consumer would have to read and then re-write all of the data in our new ordering. But a database, on the other hand, does.
By storing the execution order inside a relational database, we can immediately change the quality of service by running a single SQL statement.
Similarly, whenever we want to change the delivery semantics for our messages, we don’t have to re-shuffle terabytes of data or double-publish to a new datastore. Instead, we can just deploy a new version of our service, and it can start using the new queries right away.
Using a database gives us the flexibility in execution that queues are critically lacking.
The Centrifuge database model has a few core properties that allow it to perform well:
immutable rows: we don’t want to be frequently updating rows, and instead be appending new rows whenever new states are entered. We’ve modeled all job execution plans as completely immutable, so we never run updates in the database itself.
no database JOINs: rather than needing a lot of coordination, with locks across databases or tables, Centrifuge need only query data on a per job basis. This allows us to massively parallelize our databases since we never need to join data across separate jobs.
predominantly write-heavy, with a small working set: because Centrifuge is mostly accepting and delivering new data, we don’t end up reading from the database. Instead, we can cache most new items in memory, and then age entries out of cache as they are delivered.
To give you a sense of how these three properties interact, let’s take a closer look at how jobs are actually stored in our Centrifuge databases.
The jobs table
First, we have the jobs table. This table is responsible for storing all jobs and payloads, including the metadata governing how jobs should be delivered.
headers fields govern message transmission, the
expire_at field is used to indicate when a given job should be archived.
expire_at into a separate field, our operations team can easily adjust if we ever need to flush a large number of failing messages to S3, so that we can process them out-of-band.
Looking at the indexes for the jobs table, we’ve been careful to minimize the overhead of building and maintaining indexes on each field. We keep only a single index on the primary key.
The jobs table primary key is a KSUID, which means that our IDs are both are k-sortable by timestamp as well as globally unique. This allows us to effectively kill two birds with one stone–we can query by a single job ID, as well as sort by the time that the job was created with a single index.
Since the median size of the payload and settings for a single job is about 5kb (and can be as big as 750kb), we’ve done our best to limit reads from and updates to the jobs table.
Under normal operation, the jobs table is immutable and append-only. The golang process responsible for inserting jobs (which we call a Director) keeps a cached version of the payloads and settings in-memory. Most of the time, jobs can be immediately expired from memory after they are delivered, keeping our overall memory footprint low.
In production, we set our jobs to expire after 4 hours, with an exponential backoff strategy.
Of course, we also want to keep track of what state each job is in, whether it is waiting to be delivered, in the process of executing, or awaiting retry. For that, we use a separate table, the job_state_transitions table.
The job state transitions table
The job_state_transitions table is responsible for logging all of the state transitions a single job may pass through.
Within the database, the job state machine looks like this:
A job first enters with the
awaiting_scheduling state. It has yet to be executed and delivered to the downstream endpoint.
From there, a job will begin
executing, and the result will transition to one of three separate states.
If the job succeeds (and receives a 200 HTTP response from the endpoint), Centrifuge will mark the job as
succeeded. There’s nothing more to be done here, and we can expire it from our in-memory cache.
Similarly, if the job fails (in the case of a 400 HTTP response), then Centrifuge will mark the job as
discarded. Even if we try to re-send the same job multiple times, the server will reject it. So we’ve reached another terminal state.
However, it’s possible that we may hit an ephemeral failure like a timeout, network disconnect, or a 500 response code. In this case, retrying can actually bring up our delivery rate for the data we collect (we see this happen across roughly 1.5% of the data for our entire userbase), so we will retry delivery.
Finally, any jobs which exceed their expiration time transition from
archiving. Once they are successfully stored on S3, the jobs are finally transitioned to a terminal
If we look deeper into the transitions, we can see the fields governing this execution:
Like the jobs table, rows in the job_state_transitions are also immutable and append-only. Every time a new attempt is made, the attempt number is increased. Upon job execution failure, the retry is scheduled with a
retry_at time by the retry behavior specified in the job itself.
In terms of indexing strategy, we keep a composite index on two fields: a monotonically incrementing ID, as well the ID of the job that is being executed.
You can see here in one of our production databases that the first index in the sequence is always on the job_id, which is guaranteed to be globally unique. From there, the incrementing ID ensures that each entry in the transitions table for a single job’s execution is sequential.
To give you a flavor of what this looks like in action, here’s a sample execution trace for a single job pulled from production.
Notice that the job first starts in the
awaiting-scheduling state before quickly transitioning to its first delivery attempt. From there, the job consistently fails, so it oscillates between
While this trace is certainly useful for internal debugging, the main benefit it provides is the ability to actually surface the execution path for a given event to the end customer. (Stay tuned for this feature, coming soon!)
Up until this point, we’ve focused exclusively on the data model for our jobs. We’ve shown how they are stored in our RDS instance, and how the
jobs table and
jobs_state_transitions table are both populated.
But we still need to understand the service writing data to the database and actually executing our HTTP requests. We call this service the Centrifuge Director.
Traditionally, web-services have many readers and writers interacting with a single, centralized database. There is a stateless application tier, which is backed by any number of sharded databases.
Remember though, that Segment’s workload looks very different than a traditional web-service.
Our workload is extremely write-heavy, has no reads, and requires no JOINs or query coordination across separate jobs. Instead, our goal is to minimize the contention between separate writers to keep the writes as fast as possible.
To do that, we’ve adopted an architecture where a single Director interacts with a given database. The Director manages all of its caching, locks, and querying in-process. Because the Director is the sole writer, it can manage all of its cache invalidations with zero-coordination.
The only thing a Director needs to globally coordinate is to which particular database it is writing. We call the attached database a JobDB, and what follows is a view into the architecture for how Directors coordinate to acquire and send messages to a JobDB.
When a Director first boots up, it follows the following lifecycle:
Acquire a spare JobDB via Consul – to begin operating; a Director first does a lookup and acquires a consul session on the key for a given JobDB. If another Director already holds the lock, the current Director will retry until it finds an available spare JobDB.
Consul sessions ensure that a given database is never concurrently written to by multiple Directors. They are mutually exclusive and held by a single writer. Sessions also allow us to lock an entire keyspace so that a director can freely update the status for the JobDB in Consul while it continues to hold the lock.
Connect to the JobDB, and create new tables – once a Director has connected to a spare JobDB, it needs to create the necessary tables within the connected DB.
Rather than use an ORM layer, we’ve used the standard database/sql golang interface, backed by the go-sql-driver/mysql implementation. Many of these queries and prepared statements are generated via go:generate, but a handful are handwritten.
Begin listening for new jobs and register itself in Consul – after the Director has finished creating the necessary tables, it registers itself in Consul so that clients may start sending the Director traffic.
Start executing jobs – once the Director is fully running, it begins accepting jobs. Those jobs are first logged to the paired JobDB; then the Director begins delivering each job to its specified endpoint.
Now that we understand the relationship between Directors and JobDBs, we can look back at the properties of the system (immutable, extremely write-heavy with a small working set, no database JOINs), and understand how Centrifuge is able to quickly absorb traffic.
Under normal operation, the Director rarely has to read from the attached JobDB. Because all jobs are immutable and the Director is the sole writer, it can cache all jobs in-memory and expire them immediately once they are delivered. The only time it needs to read from the database is when recovering from a failure.
Looking at the pprof for our memory profile, we can see that a significant proportion of heap objects do indeed fall into the category of cached jobs:
And thanks to the cache, our writes dominate our reads. Here’s the example Cloudwatch metrics that we pulled from a single active database.
Since all jobs are extremely short-lived (typically only a few hundred milliseconds while it is being executed), we can quickly expire delivered jobs from our cache.
Taking a step back, we can now combine the concepts of the Centrifuge data model with the Director and JobDB.
First, the Director is responsible for accepting new jobs via RPC. When it receives the RPC request, it will go ahead and log those jobs to the attached JobDB, and respond with a transaction ID once the jobs have been successfully persisted.
From there, the Director makes requests to all of the specified endpoints, retrying jobs where necessary, and logging all state transitions to the JobDB.
If the Director fails to deliver any jobs after their expiration time (4 hours in our case), they are archived on S3 to be re-processed later.
Of course, a single Director wouldn’t be able to handle all of the load on our system.
In production, we run many of individual Directors, each one which can handle a small slice of our traffic. Over the past month, we’ve been running anywhere from 80 to 300 Directors at peak load.
Like all of our other services at Segment, the Directors scale themselves up and down based upon CPU usage. If our system starts running under load, ECS auto-scaling rules will add Directors. If we are over capacity, ECS removes them.
However, Centrifuge created an interesting new motion for us. We needed to appropriately scale our storage layer (individual JobDBs) up and down to match the scaling in our compute layer (instances of Director containers).
To do that, we created a separate binary called the JobDB Manager. The Manager’s job is to constantly adjust the number of databases to match the number of Directors. It keeps a pool of ‘spare’ databases around in case we need to scale up suddenly. And it will retire old databases during off-peak hours.
To keep the ‘small working set’ even smaller, we cycle these JobDBs roughly every 30 minutes. The manager cycles JobDBs when their target of filled percentage data is about to exceed available RAM.
This cycling of databases ensures that no single database is slowing down because it has to keep growing its memory outside of RAM.
Instead of issuing a large number of random deletes, we end up batching the deletes into a single
drop table for better performance. And if a Director exits and has to restart, it must only read a small amount of data from the JobDB into memory.
By the time 30 minutes have passed, 99.9% of all events have either failed or been delivered, and a small subset are currently in the process of being retried. The manager is then responsible for pairing a small drainer process with each JobDB, which will migrate currently retrying jobs into another database before fully dropping the existing tables.
Today, we are using Centrifuge to fully deliver all events through Segment. By the numbers this means:
800 commits from 5 engineers
50,000 lines of Go code
9 months of build, correctness testing, and deployment to production
400,000 outbound HTTP requests per second
2 million load-tested HTTP requests per second
340 billion jobs executed in the last month
On average, we find about 1.5% of all global data succeeds on a retry, where it did not succeed on the first delivery attempt.
Depending on your perspective, 1.5% may or may not sound like a big number. For an early-stage startup, 1.5% accuracy is almost meaningless. For a large retailer making billions of dollars in revenue, 1.5% accuracy is incredibly significant.
On the graph below, you can see all successful retries split by ‘attempt number’. We typically deliver the majority of messages on their second try (the large yellow bar), but about 50% of retries succeed only on the third through the tenth attempts.
Of course, seeing the system operate at ‘steady-state’ isn’t really the most interesting part of Centrifuge. It’s designed to absorb traffic in high-load failure scenarios.
We had tested many of these scenarios in a staging account, but had yet to really see a third-party outage happen in production. One month after the full rollout, we finally got to observe the system operating in a high-failure state.
At 4:45 pm on March 17th, one of our more popular integrations started reporting high latencies and elevated 500s. Under normal operation, this API receives 16,000 requests per second, which is a fairly significant portion of our outbound traffic load.
From 4:45pm until 6:30pm, our monitoring saw a sharp decline and steeply degraded performance. The percentage of successful calls dropped to about 15% of normal traffic load.
Here you can see the graph of successful calls in dark red, plotted against the data from one week before as the dashed thin line.
During this time, Centrifuge began to rapidly retry the failed requests. Our exponential backoff strategy started to kick in, and we started attempting to re-send any requests which had failed.
Here you can see the request volume to the third-party’s endpoint. Admittedly this strategy still needs some tuning–at peak, we were sending around 100,000 requests per second to the partner’s API.
You can see the requests rapidly start retrying over the first few minutes, but then smooth out as they hit their exponential backoff period.
This outage was the first time we’d really demonstrated the true power of Centrifuge. Over a 90-minute period, we managed to absorb about 85 million analytics events in Segment’s infrastructure. In the subsequent 30 minutes after the outage, the system successfully delivered all of the queued traffic.
Watching the event was incredibly validating. The system worked as anticipated: it scaled up, absorbed the load, and then flushed it once the API had recovered. Even better, our mutual customers barely noticed. A handful saw delays in delivering their data to the third-party tool, but none saw data loss.
Best of all, this single outage didn’t affect data delivery for any other integrations we support!
All told, there’s a lot more we could say about Centrifuge. Which is why we’re saving a number of the implementation details around it for further posts.
In our next posts in the series, we plan to share:
how we’ve verified correctness and exactly-once delivery while moving jobs into Centrifuge
how we’ve optimized the system to achieve high performance, and low-cost writes
how we’ve built upon the Centrifuge primitives to launch an upcoming visibility project
which choices and properties we plan on re-thinking for future versions
Until then, you can expect that Centrifuge will continue evolving under the hood. And we’ll continue our quest for no data left behind.
Interested in joining us on that quest? We’re hiring.
Centrifuge is the result of a 9-month development and roll-out period.
Rick Branson designed and architected the system (as well as christened it with the name). Achille Roussel built out most of the core functionality, the QA process, and performance optimizations. Maxence Charriere was responsible for building the initial JobDB Manager as well as numerous integration tests and checks. Alexandra Noonan built the functionality for pairing drainers to JobDBs and helping optimize the system to meet our cost efficiency. And Tom Holmes wrote most of the archiving code, the drainer processes, and tracked down countless edge cases and bugs. Julien Fabre helped architect and build our load testing environment. Micheal Lopez designed the amazing logo.
Special thanks to James Cowling for advising on the technical design and helping us think through our two-phase-commit semantics.
To close, we wanted to share a few of the moments in development and rollout:
June 23rd, 2017: Max, Rick, and Achille begin testing Centrifuge on a subset of production traffic for the first time. They are stoked.
Sept 22, 2017: Achille gets some exciting ideas for cycling databases. Feverish whiteboarding ensues.
January 12, 2018: we hit a major milestone of 70% traffic flowing through the system. Tom bows for the camera.
Mar 14, 2018: We hit a new load test record of 2M messages per second in our “Black Hole” account.
May 22, 2018: Tom, Calvin, Alexandra, and Max take a group picture, since we forgot to earlier. Rick and Achille are traveling
Geoffrey Keating on April 19th 2020
Doug Roberge, Lyuda Grigorieva on May 10th 2020
What’s it like to build and grow a business in the age of COVID-19? We sat down with three top e-commerce industry growth leaders to find out.
Harini Karthik, the Director of Data Science at Shipt, a delivery service owned by Target Corporation.
Bryan Mahoney, the Co-Founder & CTO at arfa, a consumer goods company that develops personal care brands.
Lex Roman, Founder and Growth Design Lead at lexroman.com, a design wunderkind who focuses on growth, design, and analytics.
Each member of the panel brought a unique perspective on the state of e-commerce during the COVID-19 pandemic from both their personal and professional experiences. Their discussion ranged from how the e-commerce industry is being affected as a whole, to how individual businesses can use data “the right way” in a time of such uncertainty.
Dive into the conversation below, or watch the full on-demand webinar here.
Over-index on being nimble. Abandon the idea of planning six months ahead. It’s impossible to know what that market landscape will look like. Instead, ask your teams to plan six weeks, or even one week ahead, and make resolute decisions knowing that you may have to pivot. Experiment aggressively to determine where there are potential pockets of growth.
Use data to determine new customer needs. Your customer needs have undoubtedly changed, and it’s vital you can adapt to them to retain their business. Making sure your data is organized and accessible is also more important than ever given the remote working environment. Everyone in your company should be able to see and understand the trends amongst your customer base in order to make data-driven decisions.
Focus on customer LTV, but from a different perspective. Because the LTV of a customer is dictated by how much you spend to acquire them, it’s worthwhile to think about how to make that first interaction profitable. Using data and marketing tools to help customers find the right products or services quickly can be a smart way to boost LTV and customer satisfaction.
Embrace authenticity over perfection when it comes to content. While video marketing is proliferating, there is a growing desire for a more “authentic,” less polished approach. Your customers have started to become accustomed to life at home, and that means changes to the type of content they consume, and how it’s consumed. Moreover, from a practicality standpoint, it’s much easier to produce “authentic” content over a perfectly polished ad given the remote work environment.
Be a good global citizen. While we all navigate these unprecedented times for our businesses, there will be setbacks, frustrations, and the development of a “new normal.” Despite this, there are also tremendous opportunities to re-envision company priorities, reflect on our contributions to the world, and make a lasting impact on our consumers. Caution: don’t be cute or tricky about it; customers will see right through it.
Can you share your take on the impact COVID-19 has had on your company and the e-commerce industry as a whole?
Lex: A lot of new products have come forward as a result of this crisis. Since so many businesses are being forced to go digital, I’m working with a lot of local organizations to figure out how they can sell online, how they can build an online presence, and how they can communicate with their base differently than they did before.
Bryan: Arfa launched right when COVID-19 began. We’d been working so hard toward an e-commerce goal, and we were so excited to share it with the world. We knew we had product/market fit, but we paused to decide: Is this the right time to try to bring a new product into the market?
Within 72 hours, we changed our launch strategy and changed our website. It’s a testament to arfa’s ability to be nimble. Like everyone else right now, we are trying to find new ways to work, to be collaborative, and over-index on being nimble. Despite all of our prior planning, so many things right now are a guess.
Harini: Yes, I agree with Lex. Right now, so many companies are being forced to have a digital transformation, be it working remotely or getting their products to have an online presence. Digital transformation has been a key for businesses, but I think that on a personal front, we are having internal transformations as well. We are reflecting on health and family and what is really important for us.
Shipt is an online grocery delivery service, and the demand has spiked tremendously. Our product teams are working on launching newer features that will increase member and shopper safety. Our other focus is on how to meet the higher demand.
What changes have you made in terms of product strategy?
Harini: We are creating small groups to brainstorm ideas around what they are seeing in the data. What are the gaps? How can we provide better service? What are members requesting? What are shoppers requesting? What is the data telling us about these things? We look at all of that data together and then determine: What product features should we be launching?
From there, we scope out those ideas to understand the effort versus the impact. It’s definitely accelerated development driven by the health and safety of our members and shoppers and the data that we are receiving.
Bryan: Yeah, to follow up on that, we are being really careful with the data that we are receiving right now because the economy has really slowed down. So we are giving ourselves license to do the same.
We don’t want to start chasing something it’s the new normal because we don’t really know what the new normal is. We are going to be getting data we’ve never seen before, so we’re asking ourselves: How do we identify real opportunities? How can we come up with a “generalizable approach” to take advantage of those opportunities so that it will also be useful when we return to “normal,” or a “new normal,” and whatever that looks like.
I don’t think we can “data” our way into identifying what that new normal is, so we need to fight the instinct to chase these new opportunities that might not last. Cooler heads should prevail.
Lex: I think it’s interesting to see people reacting to the new data, which is emerging every day. I think there is an argument for stepping back and seeing where the chips fall. One of my recent clients was Gusto, the payroll and HR company. They took a look at their roadmap and asked, “Ok, what can we do to help our customers now?” Their customers are mostly small businesses.
So, they are now looking at things like webinars, education, giving people breaks on payments. This is Gusto’s way to support all of these restaurants, coffee shops and fitness studios who are now trying to make ends meet.
Gusto's Small Business Relief Finder
This presented a really cool opportunity for product and growth teams to shake up that road map and ask themselves, “What do our customers really need, and how can we deliver on that now?” I think it’s really cool to be able to ship value that is aligned with your customers.
Harini: One thing I’d like to add is that some of the things that we’re seeing now in the data likely will stick. Take working from home, for example. We were already trending in that direction, and now people are discussing if this will be part of the new normal.
So, I think companies need to keep in mind that although some of these trends may not have the exponential growth curves that we’re seeing during this digital transformation, it’s still important to understand and reflect on how your business will work in this setting.
Do you see a future with more or less commissioned-based marketing, specifically for direct-to-consumer brands?
Bryan: I think that’s an excellent question. I’ve spent the last 10 years in the direct-to-consumer space and part of that playbook is spending on ads to find customers for your products.
Now, at least at arfa, we’re trying to find products for our customers, and how we think about growth is through that lens. That may be a cheeky way to answer a question about ad spend, but what I mean is that we are not motivated to spend money to try to convince you to buy our product.
Instead, we want to use our tech resources and marketing know-how to have better conversations with our customers, to really find out what they need, and then we’ll find products for them, with them.
That’s how I want to grow products and brands. That’s how I want e-commerce to move forward, and I think that’s the promise of direct-to-consumer (DTC). I guess a more direct way to answer this question is: We are not spending more on ads during COVID-19.
The impact of COVID-19 on online advertising.
COVID-19 has changed your customer needs, how has that influenced your roadmap or prioritization of existing product development plans?
Lex: I’ve been really impressed by local restaurants and how they are changing their marketing strategies because they can no longer rely on foot traffic. They are relying heavily on email marketing, offering delivery, curbside pick up, some restaurants are even offering groceries.
Really paying attention to your customers is your best growth tactic, and I’ve always felt that way. And now is a really good time to be listening to them, interviewing them, understanding their mindset, their emotional state, their situations. Then you can respond with messaging and offerings that are relevant to them.
Bryan: I totally agree. Now is the time for entrepreneurship, and the restaurant example is a good one. We see people being entrepreneurial in the sense that they are doing what they need to do to survive.
Harini: On the topic of ad spend and growth, with so many people at home now, I think there will be shifts in paid ad spend budgets, skewing more towards digital channels rather than other channels.
I also think companies should focus more on creating a “growth loop” where the product has a viral loop by itself. I think these external channels can be good at triggering these loops, but ultimately, you have to have a good product/market fit and add value to customers. So, the focus should be more on making sure we are delivering good products and in parallel leveraging channels to help drive some of those loops. That will be important.
How has the use of video and video content changed since COVID-19?
Bryan: Video is incredibly important. What I’m noticing is that it's not important to be perfect anymore with content, especially video. Content creation is becoming far more democratized. We are also seeing more content that is not created by influencers who make it “perfect.” And the more we see ad spend shift there, it will become a movement that snowballs and that can be really powerful.
I think this ties back to the idea that we are going to see more entrepreneurial shifts during this time, and this feels really related to that. I’m fascinated to see how this all plays out.
Lex: I think authenticity is really important. For example, if you're creating from home, just be really authentic. It’s important to realize the moment we’re in and communicate about that. We can’t be tone-deaf and just blast things into the ether like we used to.
We’ve seen a lot of companies transition their businesses to provide much-needed services during COVID-19 like Brooks Brothers, AB InBev, and more. What advice would you give to a smaller company that’s looking to do something similar?
Bryan: I’d encourage companies to do good for the right reasons—don’t make it a marketing stunt. As a company, just make it clear what your intent is and do that. If you try to get cute or tricky, customers are just too smart and they’re going to see right through it.
Lex: Also, I think you can collaborate with other companies. For example, if you feel like you can't do enough on your own, try teaming up with companies adjacent to you.
The other suggestion I have is to reach out to your local legislators because they also have a list of needs and they might have ideas for you. You can reach out to your state or city legislators to ask what they need to solve.
How can e-commerce companies grow LTV? What are some specific tactics you’d suggest to retain customers and grow your relationship with them?
Bryan: I think it’s a good time to focus on profitability, which may not be the most popular answer right now. You think about the LTV of a customer, in part, by how much you have to spend to acquire that customer. Now is a good time to review that—maybe you’re doing it wrong. So, how do we think about being more profitable in that first transaction? How do we think about creating products so we don't need to convince people (by way of traditional marketing spend) that it’s the right thing for them?
I don’t know what the new normal is going to be. But I hope it’s a move towards brands that resonate a bit more with better product/market fit, more word-of-mouth, and less of that traditional marketing playbook.
Lex: My career has been focused on how the design of your product really drives that impact. The way you build your customer’s experience drives impact. When it comes to LTV, how are you reducing friction for those customers? How are you responding in a way that makes sense?
How are you measuring new customer behaviors, and how are you making sure that you're delivering what customers want and not just what you think they want?
Bryan: Yeah, we’re still figuring that out. The honest answer is we don’t know. We’re doing our best to continue to collect data and have a real, hard look at it and decide where we want to go with it.
Harini: At Shipt, we test all of the new features that we're launching. We run A/B or multivariate tests and monitor our KPIs to understand the impact. We make sure that we’re doing all of the correct event tracking, which is where Segment comes in. We make sure we run the tests properly, then we measure the results and build out the dashboards and communicate the test results.
Given the scale of Shipt, we are deeply involved in every product release, and data has been a critical part in trying to understand what features we need to build, in testing it out, and in seeing if it will help ease any of our metrics because there is such an increase in demand.
Lex: I would also say that from the UX side, especially for anyone who is working on a two-sided marketplace, and anyone who is dealing with manufacturing or any kind of supply chain, make sure you’re thinking of the human impact of the decisions you’re making. These KPIs are human beings on the other side. How do you make sure that you’re accounting for that as you’re measuring the success of your company?
Harini: Absolutely, I second that. We are a four-sided marketplace. We have members, shoppers, retailers, and CPG partners. I agree that so much can be done using the data, which is really the voice of all of the customers, B2B or B2C, which is definitely the most important during these times.
What challenges do you have collaborating as an organization around your customer data while also working remotely? How do you gain consensus on growth decisions and make sure you're making the right decisions?
Lex: I love this question. I think this is a great time to take stock of your analytics tools and processes and figure out what’s broken within them. For example, are there people on your team that can’t access data or can’t understand data? Is there something you can do to address that, like changing tools or offering training or some additional kind of support?
I would also look at the cadences in which you are reviewing data. Are people doing it on their own? Are they just kind of throwing it over the wall to each other? Is it something that you bring into your process? Are you looking at it together, perhaps in a standing meeting? I would strongly suggest you incorporate a regular cadence where you’re looking at data.
The other thing I’d mention is to make sure that you’re pairing qualitative research with your quantitative data—especially if you’re seeing changes in the behavior of your product. So, more usage, less usage, erratic behavior, anything different—reach out to those customers, interview them, try to understand the “why” behind the data.
Bryan: I absolutely agree that you need to balance the qual/quant data, and now is a great time to reach out and interview your customers. If you haven’t developed that muscle as an organization, I’d encourage you to do that now.
COVID-19 has created a lot of emotion among the customers that you might need to serve. What digital signals can you use to understand, quantify, and react to that?
Harini: Sentiment analysis is the best way to measure emotion. If companies are truly trying to understand customer sentiment, they could look at social data like Twitter. You could see what your company is tagged in and do sentiment analysis on that and measure real-time NPS.
Lex: Yeah, I’d look at all of your channels including social media channels and customer support channels. What is the feedback that you’re getting on your website? In your app? Customer calls? This way you have sort of a broad qualitative pulse on your customer base, and you can reach them in a more meaningful way.
If you want exclusive access to AMAs and other live events like this in the future, sign up to the Growth Masters community today.
Steffen Hedebrandt on April 23rd 2020
Geoffrey Keating, Andy Schumeister on April 29th 2020
Our mission at Segment is to help every company in the world use good data to power their growth. And over the past eight years, we’ve been lucky enough to help thousands of businesses do exactly that.
During that time, most of the people implementing these customer data strategies have come from specific, and largely technical teams, like engineering and analytics. This is not only because of the technical effort involved in data collection, but also user comprehension around what they should collect and why.
But imagine a world where a product manager didn’t have to ping an engineer to instrument a tracking plan for the new feature they just launched. A marketer didn’t have to submit a Jira ticket to track whether an A/B test was successful or not.
Today we’re introducing Visual Tagger — to help everyone in your company collect the website data they need in minutes, without relying on technical teams.
With Visual Tagger, you can capture all the events that you want to capture without having to touch your code. - Neeraj Sinha, Chief Technology Officer, ScoutMyTrip
In less than 10 years, the amount of customer data being tracked by businesses has grown exponentially. According to Gartner, 82% of organizations have seen an increase in the types of customer data they are tracking over the past five years.
Just as the overall volume of customer data has increased, so too has its influence across the business. Nowadays, product and marketing teams are as eager to collect and use customer data as engineers and analysts.
How customer data is changing in the organization. Image source: Gartner
The desire to collect customer data may be universal, but so too is the struggle to execute.
Whether you’re an early-stage startup or Fortune 500 company, engineering teams are facing increasingly limited bandwidth. When implementing a customer data tracking plan has to compete for attention with dozens of other priorities, it can be deprioritized, or worse, ignored.
Driven by these constraints, an entire industry has started to flourish. Known as the “no-code” or “low-code” movement, tools like Webflow, Airtable, and Zapier (and Segment customers like Buildfire and Hotjar) are democratizing access to tasks that previously required technical expertise.
We now live in a world where you no longer need to become a developer to build a website or build an app. At Segment, we believe you shouldn’t have to be an engineer to start collecting the data you need to understand your customers either.
Visual Tagger is available free to all Segment customers, so long as you have a Segment workspace set up and the analytics.js snippet installed on your website. While the functionality of Visual Tagger is powerful, the user experience itself is incredibly simple and intuitive. In fact, it’s as easy as 1, 2, 3.
If you’ve ever used “WYSIWYG” software, Segment’s Visual Tagger will feel familiar. Simply point and click the different parts of your website that you’d like to track.Out of the box, Visual Tagger lets you create three types of events:
Button or Link Clicked: Create an event every time a user clicks a button or link on your website.
Form Submitted: Create an event every time a user fills out a form on your website.
Any Clicked: Similar to button or link clicked, create an event for any HTML element you would like to track (e.g. a banner).
The Visual Tagger interface
From there, you can then add rich, contextual information in the form of properties to help you understand more about the specific action that the user took.
For example, if you wanted to track a button on your homepage, you could add the property Location: Homepage. If you wanted to track a demo request on your pricing page, you could add the properties Location: Pricing Page and Feature: Demo Request.
If you aren’t sure which events or properties to track, we’ll provide recommended events to track based on your industry right in your workspace. For additional recommendations, you can also check out our industry-standard tracking plans for ecommerce, B2B SaaS, media, and more.
Your data-driven decisions are only as good as the quality and accuracy of the data you collect. With Visual Tagger, you can preview the results before the event gets created and starts flowing in production.
In test mode, you can:
Target URLs: Specify which URLs or group of URLs that an event is fired from.
Preview your events: Test the actions that would trigger the event and verify they are firing e.g. if your event gets fired, then a green checkmark will display.
Check all the details: Get granular feedback on where, how, and what event gets fired so that you can fix any issues before you deploy.
This means that you can have complete confidence before you deploy.
That’s it! Once you hit publish, the events you created will start flowing into Segment and will show up in your Debugger. You can use the Visual Tagger console to see what events are live, along with additional context for you and your team on who created the event, when it was created, and any recent updates.
Check out the docs to learn more.
Whether it’s helped them to experiment faster or save valuable engineering resources, Visual Tagger has become integral to the workflows of technical, and even non-technical, users alike.
Here are use cases from a few of our early customers and the positive impact Visual Tagger has had on their businesses.
ScoutMyTrip, an AI-driven road trip planner and travel expert marketplace, relies heavily on quick, iterative experiments to help drive customer acquisition and retention. However, with a small engineering team, they struggled to make a strong business case for launching and analyzing these experiments. Investing precious engineering resources in capturing data for an experiment (that might not even work) felt wasteful, given other demands from across the business.
With Visual Tagger, ScoutMyTrip can track all the experiments they need without having to touch a line of code. Whereas tracking events for an experiment usually took the team six days on average, they were able to track the same number of events in two hours.
“It took me about two hours to capture 30 events. If I didn't have Visual Tagger, I would have decided not to capture half of the events altogether because that would be too much of a hassle.” - Neeraj Sinha, Chief Technology Officer, ScoutMyTrip
Nationbuilder, a nonprofit and political campaign software platform that combines website, communication, and donation tools, wanted to understand and optimize their onboarding experience.
Unfortunately, getting the data they needed was a highly manual process. Tracking events had to be manually instrumented with engineering, and, quite often, was ignored in favor of more pressing matters. For Alex Stevens, Director of Growth at Nationbuilder, they needed to start tracking with minimal investment of resources, so they could understand product adoption and onboard customers as easily and quickly as possible.
With Visual Tagger, the team now feels comfortable implementing event tracking themselves.
“It used to take weeks for an analytics request to get prioritized. With Visual Tagger, I can quickly collect the data I need without having to create tons of requests for our engineering team.” - Alex Stevens, Growth Director, Nationbuilder
With Visual Tagger, customers can configure events in minutes and then activate the data in 300+ tools in the Segment catalog.
For example, Voopty, a business management software that connects educators, tutors, and clubs, with families searching for their services, needed better visibility into engagement metrics.
Voopty’s Founder and Software Developer, Taja Kholodova, needed access to product analytics to provide reporting to the hundreds of partner businesses they work with. Unfortunately, collecting this data was manual and required time to learn every additional tool and platform.
Kholodova turned to Visual Tagger and was able to quickly set up event tracking to understand user engagement in a matter of minutes. From there, Kholodova instantly activated the data across their entire tech stack, including tools like Indicative, FullStory, and Google Tag Manager.
“We were able to start sending events to destinations in 5 minutes.” - Taja Kholodova, Founder and Software Developer, Voopty
Visual Tagger is a win for both technical and non-technical users alike. Engineers are free from menial and repetitive tasks that take away time they could spend on building product. Product managers and marketers can unlock the myriad of use cases within the Segment product, without writing a single line of code.
But we think the true benefits will be felt by your end customer. By making analytics instrumentation unbelievably easy, data is accessible to everyone in your company. This empowers every individual to use that data to inform their day-to-day work.
And ultimately, that means better products and better customer experiences for all of us.
Ready to get started with Visual Tagger? Log in to your workspace to set up new events in minutes.
Olivia Buono, Sasha Blumenfeld on March 31st 2020
Over the last few years and increasingly today, changes to our global economy and rising customer expectations have forced businesses to adapt quickly and often. The shift we’re seeing in the market today is forcing change at a pace and scope that we haven’t seen before.
If you’re already a digital business, you need to optimize for financial performance today. If you’re still on your transformation journey, you need to speed up your investments into the digital business models, products, and user experiences that will carry your business into tomorrow. Balancing these priorities can be a difficult challenge, especially when the stakes are high and the timeline is tight.
We’re seeing our customers rise to the challenge by building a deep understanding of their end-users and quickly mobilizing that knowledge across every team and technology. Here’s how your business can use the Segment customer data platform to accelerate marketing impact, make better product decisions, and build a resilient tech stack.
Effective marketing doesn’t necessarily require more people or budget, but it does require a better understanding of your customers. Segment helps you unify each interaction into trusted user profiles. We then help you turn those profiles into real-time audiences, like “High-Value Buyers,” which you can feed into a variety of tools to ensure all of your marketing channels are working in tandem. With reliable data at your fingertips, your team can launch marketing campaigns faster, rapidly iterate, and personalize with confidence.
Personalization can have a significant business impact. According to Forrester’s Total Economic Impact Report, Segment helps companies test marketing campaigns five times faster and deliver 33% better conversion rates. To navigate the current climate, marketers should be looking to:
Reduce wasted ad spend by immediately removing recent purchasers from campaigns to focus spend on those who haven’t converted.
Make the most of your high value users by using them to create look-alike audiences that power your new user acquisition.
Optimize budget allocation by connecting all online and offline data to run more complete marketing attribution and make informed budgeting decisions.
Here are a few examples of how Segment customers are accelerating marketing impact:
Bonobos improved return on ad spend by 2x in two months by using Segment to gather first-party user data across channels and send that data to ad tools.
Consulting.com improved ROAS by 10% using Segment's Identity Graph and event-based audiences to build seed audiences for Facebook lookalike campaigns.
Digital Ocean drove a 33% improvement in cost per conversion by rapidly optimizing digital ad campaigns using Segment to create granular audience targeting.
Product teams make hundreds of decisions each day. In times of uncertainty, you need their decisions to be as fast and well-informed as possible. Whether it’s the decision to build a new product or streamline onboarding, they should have an accurate picture of how people use their products and the impact their products make on the business.
The most agile product teams have access to high-quality data and make data-driven decisions. A complete customer view doesn’t just allow marketers to personalize different channels, it also allows product teams to quickly run A/B tests and drive product adoption. When all of their tools are connected to Segment, product teams are also able to quickly analyze data and generate insights without redundant engineering efforts. Right now, product teams should be looking to:
Understand which features drive revenue to create opinionated onboarding and growth experiments that lead users to take action.
Enforce a data dictionary so that entire product team has consistent analytics across every product or customer touchpoint.
Build propensity models with trustworthy data that doesn’t require days of cleaning by your data science team.
Here are a few examples of how Segment customers are making better product decisions:
IBM increased product adoption by 30% and billable usage by 17% in three months using Segment to standardize customer data across divisions and provide visibility to drive efficient nurture programs.
Imperfect Foods increased user retention by 21% with a fast experimentation cycle—22 experiments in 6 months.
Norrøna saw an immediate 50% lift in conversion by automating product recommendations based on seamless data collection in Segment.
The technology stack you have today won’t be the same one you use to tackle the digital problems and opportunities of tomorrow. And as new tools spring up every day, it can become unclear whether to switch now or how to make the most of what you already have.
When considering a new tool, you’ll naturally want to test it with your own data. Historically, that meant you needed to convince your engineering team to build a data pipeline and create yet another data silo. The alternative—not testing—could lead to a lot of wasted engineering hours spent on a tool that doesn’t work as advertised. This resource-heavy process of bringing on a new tool has caused many businesses to simply accept the status quo.
Segment helps you de-risk this problem by making your customer data portable. Segment makes customer data fully accessible to the tools you already use to help you maximize each investment. When you decide to test new tools, Segment helps you test using your complete data set, rather than a sliver, and limits the engineering work required to hours, not months. Many engineering teams are turning to Segment because they can save millions of dollars by being able to integrate and experiment more quickly. Engineers can also use Segment to:
Streamline customer data collection that’s trusted and actionable across every team in your organization.
Connect your data to hundreds of tools with out-of-the-box integrations and the ability to build your own logic for custom workflows.
Give your team the ability to adapt your tool stack in a way that enables you to keep pace with customer expectations.
Here are a few examples of how Segment customers are building a resilient tech stack:
Clearscore launched in a new market at 1/3 the engineering cost of building an in-house solution that would allow them to comply with regulation using their existing tech stack.
Heycar quadrupled digital conversions and tripled user retention by using Segment to implement a new analytics tool that gave their product team instant data insights.
Halp improved activated digital sign-ups by 4x in less than a week by connecting best-in-class tools they were already using to launch a new digital onboarding experience.
The move to digital business is accelerating. The seismic shifts over the last few weeks are just the beginning. If you’re already a digital business, you need to prepare as usage continues to evolve. If you’re making measured moves toward digital, your transformation timeline needs to speed up to meet our new realities. It’s critical that you quickly tackle the new challenges ahead, so you can build new digital experiences for customers while the opportunity is still there.
As you evaluate your strategy across marketing, product, analytics, and engineering, consider whether you’re prepared to empower each team with the quality data they need to drive business impact and create tailored digital experiences. Also consider if you can do it fast enough and at the scale your business needs.
If you want to discuss how customer data can accelerate your strategy and power your digital business—with or without Segment—reserve some time with a Segment Growth Expert today.
Calvin French-Owen on March 18th 2020
Each year, I like to reflect on what’s now different about Segment. Thinking back to a year ago, there’s an incredible amount that we’ve managed to accomplish.
In many ways, I see the Segment of 2020 as a new company, with fresh challenges and lots of new opportunities.
But as the famous saying goes—the days are long but the years are short. Without further ado, here are the major highlights from Segment over the past 12 months.
Segment helps businesses connect and use reliable customer data to fuel product experimentation, marketing, analytics, data science, and so much more. And in 2019 we doubled down to ensure that companies could truly add data into Segment and use the data wherever they needed.
Simple code editor to build a function
While we’ve made it possible to build any connection with Functions, we’ve also added more than one new integration per week in 2019 to our integration catalog.
That means our customers can more easily set up 350+ tools (37% growth from 2018) to connect new sources and destinations. About 1 in every 3 customers has already tried one of the newest tools in our catalog.
Since the early days of Segment, we’ve stood for privacy and responsible data handling. We’ve tried to avoid dealing with sketchy data brokers and third-party cookies, and instead make it easier for companies to comply with privacy-first legislation like the GDPR and CCPA.
At Synapse, our annual user conference, we launched our new Privacy Portal that allows organizations to automate detect and classify information flowing through Segment, and control where it is sent. It’s currently used by hundreds of customers to help enforce data standards at their companies.
Aliya Dossa and Tido Carriero announce our new Privacy Portal at Synapse
One of our core values is karma, and we want to make sure we treat our customers’ data in a way they would expect. Privacy isn’t just a big deal for our customers. It’s a big deal to end-consumers as well.
We went beyond being ready for CCPA when it became effective on January 1, 2020. We also stood up a Privacy Program and a Public Policy Team in 2019. We made multiple trips to Washington to help campaign for privacy rights. We’ve also been very active in California around the CCPA.
Destination Filters are being used by more than one out of every three business plan customers. They are key in helping customers control their API costs and instrument basic privacy controls. Protocols Transformations allow users to alter their events without changing their code. We’re seeing companies use transformations to help standardize event names between new and old systems, connect previously siloed data, and more.
Simple UI to build a Destination Filter in a few steps
In 2019, we grew from roughly 350 employees to 500+. In March 2019, we launched our Denver office as a new hub for our Customer Success org with a plucky landing crew of 3. Since then, we’ve grown the office to 17 strong, and made 24/7 customer support possible.
Segmenters around the world
We saw similar growth across our other offices too. In April, we expanded our NYC office to a high-rise right next to Times Square. In November, we expanded the SF office an entire floor. And in November, our EMEA team in Dublin made its 50th hire! There’s never been a better time to join!
In February 2019, we launched the Segment Startup Program, to give early stage startups access to $50,000 in free Segment, and more than a million dollars in free software from companies like Intercom, AWS, and HubSpot.
We’ve onboarded 3,000+ startups into the program, and continue to add hundreds each month. More importantly, we’ve helped the next generation of startups make data-driven decisions powered with good customer data.
This year, we expanded our customer conference, Synapse, in a big way. We hosted a record-breaking 1200 attendees. We had two full days of talks from experts and customers, as well as a partner summit.
Main stage audience at Synapse 2019
Synapse was an opportunity to learn from the best, and we continue to be impressed by the level of depth and expertise that customers shared.
For example, it’s humbling to hear that Segment played a meaningful role in helping Allergan increase revenue by $250 million. We hope to share many more stories like this over the next year.
Full-page Wall Street Journal advertisement
We have come together with over 100 partners to cut through the noise and make sure that businesses know that there is an entire alternative ecosystem outside of their CRM suites.
There are alternatives that meet the demands of today’s customers, are flexible, and work together seamlessly, so that businesses can build customer technology stacks that are unique to their needs, versus a one-size-fits-all approach.
We rebuilt our documentation site
Our documentation is one of the most-visited parts of our website, and while it had grown organically over time, it was starting to show the strain. In late November we released an entirely new docs site, with a fresh, readable design, greatly improved information architecture, and better navigation. We also released new content in the form of an all-new intro to Segment, and some introductory guides tailored to different user roles.
In August 2019, we launched one of the more iconic startup campaigns on the market: “What good is bad data?” To highlight the importance of clean, accurate data, we staged a series of mixups that found their way into billboards across Austin, New York, Los Angeles, and more. You can read more about it here.
Our billboard in San Francisco
The campaign created quite the storm, as we saw tweets and Reddit posts shared from celebrities, friends, and customers.
Segment has the privilege of working with many different types of clients, including Fortune 500 enterprises to industry-specific small businesses. In every case, we want to build trust and show every customer our commitment to security.
Throughout late 2019, we completed our SOC 2 Type 1 attestation. I’m excited to share that this is one of many steps we are taking to continue to build trust with our customers.
Thanks to all who made the above possible. If you’re interested in helping us build the next wave of progress, we’re hiring!
Calvin French-Owen on May 26th 2020
Calvin French-Owen on May 26th 2020
Calvin French-Owen on May 19th 2020
Calvin French-Owen on May 26th 2020
Calvin French-Owen on May 26th 2020
Calvin French-Owen on May 19th 2020