Benjamin Yolken, Julien Fabre on October 29th 2020
Calvin French-Owen on June 5th 2020
This blog should not be construed as legal advice. Please discuss with your counsel what you need to do to comply with the GDPR, CCPA, and other similar laws.
Deletion – all identifying info related to the user must be properly deleted.
Suppression – the user should be able to specify where their data is used and sent (e.g. for a marketing, advertising, or product use case).
When you get a deletion request, it doesn’t just mean deleting a few rows of data in your database. It’s your responsibility to purge data about your users from all of your tools – email, advertising, and push notifications.
Typically, this process is incredibly time-consuming. We have seen companies create custom JIRA workflows, in-depth checklists, and other manual work to comply with the law.
In this article we’ll show you how to automate and easily respect user privacy by:
Managing consent with our open source consent manager.
Issuing DSAR (Data Subject Access Requests) on behalf of your users.
Federating those requests to downstream tools.
Let's dive in.
If you haven’t already, you’ll want to be sure you have a source data setup on your website, and collecting your user data through Segment.
// when a user first logs in, identify them with name and email
Generally, we recommend you first:
Generate user ID in your database – a user ID should never change! It’s best to generate these in your database, so they can stay constant even if a user changes their email address. We’ll handle anonymous IDs automatically.
Collect the traits you have – you don’t have to worry about collecting all traits with every call. We’ll automatically merge them for you, so just collect what you have.
Start with messaging – if you’re trying to come up with a list of traits to collect, start with email personalization. Most customers start by collecting email, first and last name, age, phone, role, and company info so they can send personalized emails or push notifications.
Once you’ve collected data, you’re ready to start your compliance efforts.
Giving users the ability to control what personal data is collected is a huge part of any privacy compliance regime.
We’ve built an open source drop-in consent manager that automatically works with Analytics.js.
Adding it in is straightforward.
First, you’ll want to remove the two lines from your analytics.js snippet.
analytics.load("<Your Write Key") // <-- delete meanalytics.page() // <-- delete me
These will automatically be called by the consent manager.
We’ve included some boilerplate configuration, which dictates when the consent manager is shown and what the text looks like. You’ll want to add this somewhere and customize it to your liking.
You’ll also want to add a target container for the manager to load.
You can and should also customize this to your liking.
Finally, we’re ready to load the consent manager.
<script src="https://unpkg.com/@firstname.lastname@example.org/standalone/consent-manager.js" defer></script>
Once you’re done, it should look like this.
Great, now we can let users manage their preferences! They can opt-in to all data collection, or just the portion they want to.
Now it’s time to allow users to delete their data. The simplest way to do this is to start an Airtable sheet to keep track of user requests, and then create a form from it.
At a minimum, you’ll want to have columns for:
The user identifier – either an email or user ID.
A confirmation if your page is public (making sure the user was authenticated).
A checkbox indicating that the deletion was submitted.
From there, we can automatically turn it into an Airtable form to collect this data.
To automate this you can use our GDPR Deletion APIs. You can automatically script these so that you don’t need to worry about public form submissions. We’ve done this internally at Segment.
Tip: Make sure deletions are guarded by some sort of confirmation step, or only accessible when the user is logged in.
Now we’re ready to put it all together. We can issue deletion requests within Segment for individual users.
This will remove user records from:
Your warehouses and data lakes
Downstream destinations that support deletion
To do so, simply go to the deletion manager under Workspace Settings > End User Privacy.
This will allow you to make a new request by ID.
Simply select “New Request”, and enter the user ID from your database.
This will automatically kick off deletions in any end tools which support them. You’ll see receipts in Segment indicating that these deletions went through.
As your different destinations begin processing this data, they will send you notifications as well.
And just like that, we’ve built deletion and suppression into our pipeline, all with minimal work!
Here’s what we’ve accomplished in this article. We’ve:
Collected our user data thoughtfully and responsibly by asking for consent with the Segment open source consent manager.
Accepted deletion requests via Airtable or the Segment deletion API.
Automated that deletion in downstream tools with the deletion requests.
Stephen Grable on May 4th 2020
Leif Dreizler on March 31st 2020
Segment receives billions of events every day from thousands of customers that trust Segment to keep their data safe. At Segment, we believe that good security is an essential part of building high-quality software, similar to reliability or scalability. In addition to the tools and processes developed by our Security Org to help software engineers make good security choices, we also rely on reviews from both traditional and crowdsourced security companies.
Bug bounties are often seen as a significant burden for the security teams that run them. You’ll hear horror stories about companies launching a bug bounty, their security team getting inundated with low-quality reports, duplicate submissions, and researchers going outside the scope of the program.
Shortly after, you'll hear about their engineering team being overwhelmed with newly discovered vulnerabilities to fix. From those that survive the opening salvo, you may hear complaints that, over time, they have stopped receiving impactful submissions.
A few years ago, when the space was less mature, critics questioned whether running a program was worth it. Now, it is expected that organizations of a certain size and maturity run a bug bounty program.
In late 2019, the Cybersecurity and Infrastructure Security Agency (CISA) of the United States published a draft directive recommending all agencies develop a mechanism to accept information related to security vulnerabilities from external parties.
In this blog post, we’ll break down how to start and manage a bug bounty program, consistently achieve good results, and maintain healthy relationships with the people that power the program.
If you’re short on time, check out the “Top Tips” section at the bottom of this post.
A bug bounty program is like a Wanted Poster for security vulnerabilities.
Companies running bug bounty programs pay independent security researchers from across the world for security vulnerabilities in their own products and infrastructure.
I assume most readers are already bought into the benefits of running a bug bounty.
Most companies that have an internet presence or make an internet-connected device should consider running a bounty, or at least have a way for security researchers to report security issues.
It is also part of the Vendor Security Alliance questionnaire, so it may be something your customers ask you about if you are in the B2B space.
If you don’t have a way for researchers to report issues, they will email people at your company (or any alias they can find on your website), connect with you on LinkedIn, or just tweet about the issues they think they’ve found.
It’s a much better experience for researchers, your employees, and your customers if you give the security community a clear avenue to report vulnerabilities.
Your security and engineering orgs will be regularly impressed by the creativity of the researcher community. These are people that, without internal knowledge, can find critical vulnerabilities in organizations across the world.
I strongly recommend using a bug bounty platform like HackerOne or Bugcrowd (we use Bugcrowd here at Segment) to help manage this process. These companies provide a platform and services to help run an efficient program.
Severity baselines make it easier to tell how serious a vulnerability is, and how much time you should be spending on review and remediation.
When running a program on your own, you’ll frequently have researchers overhyping vulnerabilities. Platforms have a guided submission form, which helps researchers pick the appropriate category and rating.
The reputation systems reward researchers that accurately rank vulnerabilities and creates a competitive environment that benefits both researchers and program owners.
It also helps reinforce good behavior. Any researcher discipline issues have stricter consequences. If a researcher misbehaves, they may be banned from the platform.
To submit vulnerabilities via these platforms, researchers have to agree not to disclose the vulnerability without approval from your company.
Both platforms also provide triage services, which I highly recommend paying for. These are the first line of defense for your internal security resources. These globally distributed teams will help clean up researcher reports, mark submissions as duplicates, and filter out low-quality reports.
These companies also serve as a knowledge base for you to learn about running a program and ask questions. You can bounce ideas off of someone that works at a company running hundreds of programs.
Platforms have structured input with required fields and integrations with popular tools, like Jira. These make it much easier to turn a submission into a ticket for your engineering org.
For most companies, it isn’t possible to run a private program without the help of a bug bounty platform.
We’ll talk about private programs in more depth later, but this is the recommended starting point for companies launching a bug bounty for the first time.
All of the above features free your team to focus on the security challenges unique to your business.
Having a successful program starts with a good foundation, and it’s your job as a program owner to help set your organization up for success.
Think about your current process for handling a security vulnerability. What happens when someone internally or externally finds a security bug?
You will inevitably get more of these after starting a bug bounty program, so you must have a good way to handle these reports.
Your vulnerability process doesn’t have to be perfect, but you should have a way to prioritize and assign bug reports to the appropriate engineering team without high overhead.
As someone starting a program, you’ll also need to get buy-in from your Engineering org. They are the ones that will have to fix the issues and will likely be the ones responding to alerts triggered by researchers.
Your alerting story doesn’t need to be perfect. But you also don’t want engineers to be woken up every time someone triggers an error because some input validation was working correctly and stopped a researcher from submitting a
< symbol into an email address field.
Remember, your team doesn't have to fix every valid vulnerability immediately.
Vulnerabilities are just bugs and should be prioritized appropriately. If you’re having trouble getting your engineering org to fix critical vulnerabilities in a timely manner, you may want to direct your efforts to job-hunting instead of starting a bug bounty program 🙃
Once your organization is bought-in, you can focus on getting things ready for the researchers.
Your bounty brief is what researchers will read to determine if they’re interested in working on your program. It’s part instructions, part rules, and part advertisement.
Keep in mind you’re competing for researchers’ time; they don’t have to work on your program when there are so many other programs.
Your bounty brief should be clear, concise, and should set expectations with the researchers. You can find the Segment program hosted on Bugcrowd.
Where do you want researchers to focus their testing? What’s in scope? What’s out of scope?
I recommend starting with assets that are easy for the researchers to access, ideally something free, that anyone can sign up for.
You should also try to pick a target that has at least medium size, complexity, and business impact. This will help you show value early, which will help you expand the program.
How do researchers get access to the scope? Are there docs they can read to help them get up to speed? We instruct researchers to sign up for our app using their @bugcrowdninja.com email address and to include
-bugcrowdninja as part of their workspace slug.
This makes it easier for us to determine if someone is part of our Bugcrowd program when we review logs and alerts. If we notice someone causing problems in our app, we can ask Bugcrowd to provide researcher coaching.
How are you going to rate submissions? Consistent severity is important because it impacts how much the researcher gets paid for a submission. HackerOne uses Mitre’s Common Weakness Enumeration (CWE) and Bugcrowd uses the Bugcrowd Vulnerability Rating Taxonomy (VRT).
How much are you going to pay for vulnerabilities? Researchers need to know upfront how much you’ll pay for vulnerabilities so they can assess if it is worth their time to hunt on your program.
Think about using different reward ranges for different assets. This can help control costs and also helps researchers understand which targets are more important. For example, we describe specific objects that will net a higher reward:
A handful of years ago, getting a T-shirt as a reward was pretty standard. I’d strongly encourage anyone thinking about running a swag-based bounty to reconsider.
T-shirts don’t pay rent and are more work for your team than sending a researcher money. What do you do when that T-shirt you sent to another continent is lost in the mail or doesn’t fit?
We reserve swag for our top performers. Sending a T-shirt requires the researcher to trust us enough to give us their address and requires me to go to the post office.
Take the time to explain what a bug bounty is, why it's important, and have a few examples from recognizable organizations ready to show them.
Learn a little bit about the platform you’re using. Your actions on the platform impact the researcher. If you mistreat researchers, they will go elsewhere; without researchers, your program isn’t providing value to your organization.
The same report status can have different meanings and impact on different platforms.
For example, on HackerOne
Not Applicable reduces the researcher’s site-wide score by 5 points, and should be used for reports that don’t contain a valid issue.
Not Applicable does not impact the researcher’s score, and is commonly used for reports that should neither be accepted or rejected. To achieve this result on HackerOne, you would use the
If you have any questions about the platform you’re using, I strongly recommend reviewing documentation or reaching out to your account manager for help.
Regardless of how big your company’s internet footprint is, you can start with a small scope open only to a handful of individuals as part of a private program.
In mid-2017, Segment was running a private program with 25 researchers and a single target: our app.
The early researchers invited will be some of the platform’s most trusted, and they will generally be more accepting of companies that are learning how to manage a program, as long as you pay them fairly and treat them with respect.
Starting small allows your organization to learn how to run a program in a safer environment. If your vulnerability management program has some gaps, you can fix them; if your bounty brief is unclear, you can rewrite it; if your alerts aren’t tuned properly, you can invest time into improving them. If you need to pause your program, you can relaunch later with a less negative impact.
Even while we had a private program, we would direct researchers that reached out via email to our Bugcrowd program. This allowed us to receive the benefits of the platform and triage services for all submissions before running a public program.
It’s much easier to explain to a researcher why you won’t be paying for a low-effort submission when you have a prioritization system established and enforced by a third party.
Like any multi-year project, your bug bounty will evolve and require ongoing effort to keep it healthy.
Researchers love testing new features and assets; in most bug bounty programs, only the first person to find a vulnerability receives a monetary reward.
If you started with a small scope, your program is steady, and you’re ready for more submissions, this is a great time to add more targets to your brief.
As you add scope, keep in mind that not all assets are of equal value to an adversary. It is encouraged to specify different reward ranges for different assets based on their security maturity and value.
Over time, you should also consider an “open scope” bounty if it is appropriate for your organization. We have listed as a target,
Any host/web property verified to be owned by Segment (domains/IP space/etc.), which serves as a catch-all for anything not explicitly listed in our “In Scope” or “Out of Scope” sections.
Having an open scope bounty is enticing to researchers. Not only does it show you take running a bug bounty program seriously. It also shows that regardless of where they find a vulnerability, it will likely be rewarded (assuming it is valid, unique, and not out of scope).
Many researchers specialize in finding forgotten internet-facing assets as part of open-scope programs, and have developed their own infrastructure to identify assets and vulnerabilities to be able to efficiently earn rewards.
It’s also worth noting that there is no scope for an attacker trying to compromise your company’s security. Working towards an open scope means that it is more likely a bug bounty researcher will find and report a vulnerability before an attacker exploits it.
Over time, you’ll build trust and form relationships with particular researchers. These are great people to give early access to upcoming features. Many times, these features require manual provisioning, making them less suitable for wide-scale testing.
Early access is a mutually beneficial system in which you will receive security vulnerabilities prior to release, which makes them easier to fix. Researchers will be able to test features with less competition, which makes them more likely to earn a reward and continue testing on your program.
If the effort to set up these features is medium or higher, consider paying the researcher a grant before they start working.
Clearly communicate what you're looking for and what you expect from them. When offering a researcher grant, we want to see a short write-up of what didn't work in addition to any findings they submit. Rewardable findings should be paid in addition to the grant.
Once you’ve been running a program for a while and are confident in your company’s ability to receive vulnerabilities from the global researcher community, you should consider evolving it into a public program.
If you don’t have a wide scope, this is a great time to revisit that decision.
Maximizing your scope (while private) will reduce the uptick in submissions when your program is launched publicly. You should also invite as many researchers as possible to your private program before going public for the same reason.
Because public programs are open to anyone, you will inevitably receive testing from a lot of newer folks that will pay less attention to your bounty brief, so having a wide scope helps in this regard as well.
Segment has run a public bug bounty program since late 2018, roughly 18 months after launching our private program.
Hopefully over time, you will think of your outsourced triage team as an extension of your internal team. Spending the time to let them know how you want your program run will pay dividends in the future. Any submission they can validate without asking your team questions saves time for everyone involved.
Here are some examples of guidance we’ve given to the Bugcrowd triage team:
Identify duplicates for non-rewardable submissions
Many programs do not bother to mark informational, out of scope, or other non-rewardable submissions as duplicates. We do this for two reasons:
The first is that if we decide to fix one of these issues later, we can go back and mark the original submission as resolved and pay the researcher. Any duplicates of this issue will still receive points.
The second is that when there is a false positive identified by a tool commonly used by bug bounty hunters, you will get this submitted to your program a lot.
If we repeatedly see an out-of-scope or not reproducible submission, we can add a specific item in our bounty brief to warn researchers; it will be marked as out-of-scope or not reproducible without a working proof of concept.
Don’t be afraid to deduct points for undesired behavior
While we are generally laid-back and understanding program owners, we aren’t afraid to deduct points from a researcher’s score for when it’s warranted.
Many programs shy away from deducting points, but we want to ensure participants in our program thoroughly read our brief and think that it helps the larger bug bounty community to slightly penalize those that disregard clearly documented rules.
Two of the common arguments against bug bounty programs is that the submissions are often low-value and that researchers don’t respect scope.
For example, we have a very small out-of-scope section, which includes:
CORS or crossdomain.xml issues on api.segment.io without proof of concept.
This is identified by a tool commonly used by bug bounty participants is a finding we have received dozens of times, but never with any impact.
We do this to save time for both researchers and our triage team. If a researcher submits this finding without a proof of concept, we encourage Bugcrowd to mark this as out-of-scope. If a researcher submits a finding that showed impact, we would be more than happy to reward, fix, and update our brief.
If you need to deviate from the baseline rating established in your bounty brief, take the time to explain to the researcher why the rating and reward are higher or lower than they might expect.
Researchers are generally understanding, as long as your rating, reward, and explanation are fair and make sense. If you find yourself commonly deviating from the ratings, it may be time to make changes to your bounty brief so that researchers know what to expect in advance. If you make severity or scope changes as the result of a submission, reward the researcher at whichever rate is more favorable to them.
In addition to explaining why something was rated lower than the baseline, take the time to explain why something was rated higher than the baseline. This is a great way to encourage further testing in these areas and is a great way to build trust with a researcher.
These explanations also help your triage team learn more about your program, and allow them to more accurately triage future submissions.
Take time to build relationships and trust with researchers, especially those that repeatedly submit to your program. Treat researchers fairly, with respect, and consider paying for anything that brings value to your organization.
You’re competing for a researcher’s time, especially the ones that are the most talented. They can likely work on almost any bug bounty program available; think about ways you can encourage them to work on yours.
Keep in mind that all researchers, even those that are unskilled, are human beings. Assume that they want to help you secure your organization, learn more about security and technology, and get paid.
If there is one sentence you remember from this blog, I hope it is “pay for anything that brings value.”
Bug bounty hunters put in a lot of time and effort that doesn’t result in getting paid. This could be time spent developing tooling, hunting without finding any bugs, or having a valid bug marked as a duplicate.
Try to avoid thinking about individual bug costs. Instead, think about the overall value the program brings to your organization in terms of bugs found, time saved, and peace of mind. If you’re debating between two severities, pick the higher one and pay the researcher at that rate. You can always change the severity in your internal ticketing system later.
Once you’ve received a rewardable submission, try to triage and pay quickly. Sometimes determining the full impact takes time; if this is the case, add a comment letting the researcher know you appreciate their work but need some extra time to determine the appropriate reward.
Work collaboratively with the researcher
As an employee of your company, you should know more about the codebase and infrastructure than a security researcher outside your organization (although occasionally I question this based on the creative and impactful submissions we receive 😅).
Sometimes when running a bug bounty program, you’ll get a submission that makes you think, “What’s the next level the researcher could get to?” If this is a researcher you trust, it may be appropriate to give them some hints to help further their testing. If you give them hints, you can also issue some cautionary advice to help them continue in a way that is safe for your organization and customers.
Giving the researcher hints helps show them you value their testing and saves your team from spending time on something that may not be possible. If the hint is helpful, the researcher will be submitting a higher-severity finding, which positively impacts their researcher score and earns a higher monetary reward. It also allows you to get the vulnerability fixed faster due to the higher severity.
Sometimes, it isn’t appropriate to include the researcher in this phase of the process. If our team continues the investigation, and it leads to the discovery of a higher-impact weakness in our systems, we reward the researcher as if their report contained the full impact. We also explain why we paid them at this higher rate, but let them know we are unable to share the details. This is a great way to show the researcher you value their work and build trust.
Share progress with the researcher
If a researcher submits a vulnerability that leads to a systemic fix for a vulnerability class, share this with them! Researchers are generally excited to hear that their work led to meaningful change within your organization. It also is a cue for them to attempt to bypass the new protections.
Pay for Dupes
At Segment, we commonly pay researchers for well-written duplicates, and frequently reach out to let them know that we appreciated their submission. We also let them know that we don’t always pay for duplicates to make sure that expectations are set appropriately.
This has worked out incredibly well for us. All of our most critical submissions have come from researchers that were originally rewarded for a well-written duplicate. Segment is a complex product that takes time to set up and fully understand. Researchers that put in the effort to fully set up a Segment workspace have additional context and understanding that take time to acquire—these people are hard to replace, and you want to keep them happy.
Pay bonuses for well-written reports
We also pay extra for well-written reports. Valid submissions need to get turned into Jira tickets which are assigned to engineering teams. Reports that are concise, easy to follow, have clear impact, and are well-formatted take less time for us to turn into tickets. We want to encourage researchers to save us time so we make sure to reward appropriately and let them know that we appreciate their efforts.
Running a successful bug bounty program requires consistent effort from your team, but can bring tremendous value to your company and customers. Any vulnerability reported and fixed is one fewer vulnerability an attacker could use to get a foothold in your organization. Bug bounty submissions can help illuminate vulnerability trends, which can help prioritize where you spend resources to fix systemic issues in your applications or infrastructure.
Bug bounty programs are people-powered. Spend the time to make those involved in your program feel valued, help them understand the motivations behind your decisions, and be excellent to each other!
Thanks for taking the time to read my post! I hope you learned a few things to help your company run a successful program. Here are some of my top tips to reference later:
Pay for anything that brings value
Pay extra for well-written reports, even if they’re dupes
Avoid thinking about individual bug costs
Partner with a bug bounty platform and pay for triage services
If you make changes to your bounty brief as the result of a submission, reward the researcher at the more favorable rate
Invest time into building and maintaining relationships with your researchers and triage team
Don’t be afraid to deduct points for bad behavior
Start small and partner early with Engineering
Write a clear and concise bounty brief to set expectations with the researchers
A special thanks to the Segment Engineering organization for fixing vulnerabilities and responding to alerts.
To Edis from Bugcrowd for helping us triage three years of vulnerabilities and truly being an extension of our team.
To all the researchers that have helped keep our customers safe by looking for vulnerabilities as part of our program.
And finally, to researchers danieloizo and sheddow, you have both submitted numerous well-written and high impact findings and are an absolute pleasure to work with.
Rick Branson, Collin Van Dyck on June 25th 2019
This is the story of how we built ctlstore, a distributed multi-tenant data store that features effectively infinite read scalability, serves queries in 100µs, and can withstand the failure of any component.
Highly-reliable systems need highly-reliable data sources. Segment’s stream processing pipeline is no different. Pipeline components need not only the data that they process, but additional control data that specifies how the data is to be processed. End users configure some settings in a UI or via our API which in turn this manipulates the behavior of the pipeline.
In the initial design of Segment, the stream processing pipeline was tightly coupled to the control plane. Stream processors would directly query a set of control plane services to pull in data that directs their work. While redundancy generally kept these systems online, it wasn’t the 5-9s system we are aiming for. A common failure mode was a stampede of traffic from cold caches or code that didn’t cache at all. It was easy for developers to do the wrong thing, and we wanted to make it easy to do the right thing.
To better separate our data and control planes, we built ctlstore (pronounced “control store” or “cuttle store” as some like to call it), a multi-tenant distributed control data store that specifically addresses this problem space.
At the center of the read path is a SQLite database called the LDB, which stands for Local Database. The LDB has a full copy of all of the data in ctlstore. This database exists on every container instance in our fleet, the AWS EC2 instances where our containerized services run. It’s made available to running containers using a shared mount. SQLite handles cross-process reads well with WAL mode enabled so that readers are never blocked by writers. The kernel page cache keeps frequently read data in memory. By storing a copy of the data on every instance, reads are low latency, scale with size of the fleet, and are always available.
A daemon called the Reflector, which runs on each container instance, continuously applies a ledger of sequential mutation statements to the LDB. This ledger is stored in a central MySQL database called the ctldb. These ledger entries are SQL DML and DDL statements like
CREATE TABLE. The LDB tracks its position in the ledger using a special table containing the last applied statement’s sequence number, which is updated transactionally as mutation statements are applied. This allows resuming the application of ledger statements in the event of a crash or a restart.
The implications of this decoupling is that the data at each instance is usually slightly out-of-date (by 1-2 seconds). This trade-off of consistency for availability on the read path is perfect for our use cases. Some readers do want to monitor this staleness. The reader API provides a way to fetch an approximate staleness measurement that is accurate to within ~5 seconds. It sources this information from the timestamp attached to ledger statements which indicates when the statement was inserted. A heartbeat service sends a mutation every few seconds to ensure there’s always a relatively fresh ledger statement for measurement purposes.
The ctldb is an AWS Aurora cluster. Reflectors connect to and poll one of the cluster’s read replicas for new ledger statements every second (with some jitter). A publish-subscribe model would be more efficient and lower latency, but polling was simple and ended up being quite suitable for our use case. Scaling up our current measurements, a single Aurora cluster should be able to support tens of thousands of Reflectors at once. Aurora clusters can be connected together using the MySQL replication protocol, which would support scaling beyond a single cluster’s limitations, implementing multi-region support efficiently, and even multi-cloud if that is ever in the cards.
ctlstore exposes a relational model to readers. Control data is stored as rows in tables that have a defined schema, and these tables are grouped into families. A Reader library, which wraps access to the LDB, provides primary key oriented queries. This layer allows potentially switching the underlying implementation of the read path, and focuses queries on those that will be efficient for production use cases. Currently, only primary key queries are supported, but adding secondary key support is being considered.
During design, we knew we wanted to pull in data from many systems of record, so a single monolithic source of truth was off the table. This means the master records for ctlstore data actually live outside of the system itself, in an origin. A loader ingests change stream from the origin, and applies these changes to ctlstore via the HTTP API exposed by the Executive service.
In practice, it’s a bit more complicated. Our production setup uses the open-source change data capture system Debezium. Debezium streams MySQL replication logs from the origin database and emits the changes in JSON format to a Kafka topic. A loader process consumes this topic, batches the changes, and applies them to ctlstore. The HTTP API provides a transactional offset tracking mechanism alongside the write path to ensure exactly-once delivery. In ctlstore, all mutations are either “upserts” or deletes, so replays are idempotent.
To ensure that data passes integrity checks, ledger statements are applied to the ctldb as they’re inserted into the ledger. This is done transactionally so that a failure will rollback the ledger insert atomically. For example, if a string failed to validate as UTF-8, it would be rejected, preventing bad ledger entries from halting ledger processing at the Reflector side. This safety mechanism caught an early bug: field names in the ledger statements weren’t being escaped properly, and a developer used the name “type” for a field, a reserved word. MySQL rejected this table creation statement as invalid before it poisoned the ledger.
ctlstore is a multi-tenant system, so it is necessary to limit resource usage to protect the health of the system overall. A large influx of mutations would not only crowd out other writers, but could also have damaging effects across our fleet. To avoid this, there are limits on the rate of mutations over time for each loader. The other resource to manage is disk usage. LDB space is precious because every instance must store a full copy of the data. The Executive service monitors disk usage for each table, alerts when a soft limit is reached, and enforces a hard limit once a table reaches a certain size. Both rate limits and table size limits can be adjusted on a per-resource basis.
ctlstore exposes a relational model, and as such, it requires setting up and managing a schema. We chose a structured, relational approach as opposed to a semi-structured, document approach to eliminate various edge cases that would lead to incorrect behavior and/or failures in production. Teams share the same tables so a schema helps developers understand the data they are handling.
Schema is managed using the HTTP API exposed by the Executive service. Endpoints are available for creating families and tables as well as adding fields to existing tables. Fields are specified with simple types (string, bytestring, integer, decimal, text, and binary) that map to compatible MySQL and SQLite internal types. Due to constraints of the underlying databases, only a subset of types are supported as primary key fields: string, bytestring, and integer.
Removing tables is currently not supported by the API, to prevent inadvertent disasters. We’re considering a safe way to implement this functionality. Removing fields will likely never be supported due to the implications downstream, such as breaking existing production deployments that depend on that field.
Our primary compute fleet is constantly churning. Instances are coming and going all the time. Typically, an instance lasts less than 72 hours. One of the requirements for ctlstore is that a freshly launched instance can be “caught up” within minutes. Replaying the entire ledger from the beginning wouldn’t cut it.
So instead, new instances bootstrap themselves by pulling a snapshot of the data from S3. The snapshot is just an LDB frozen in time. Using the same mechanism for crash recovery, a freshly booted Reflector can “resume” processing of the ledger from the snapshot. Once enough of the mutation statements have been applied, the container instance is marked as caught up. Services can specify that their tasks are only scheduled on container instances which are caught up.
Snapshots are constructed continuously by a dedicated service called the Supervisor. The Supervisor builds it’s own private LDB by running an internal Reflector instance. It pauses periodically to create a new snapshot. This process involves flushing the WAL to ensure all writes are captured, vacuuming the private LDB file to trim any extra unused space, compressing it to reduce its size, and uploading it to S3.
Deciding which consistency model fits a system is complicated. In terms of the CAP theorem, ctlstore is a CP system because writes go offline if the ctldb fails or is partitioned. A copy of all of the data runs on every node, so reads stay available even in the face of the most severe partitions.
In terms of data consistency and isolation, it’s a bit hard to pin down. MySQL provides
REPEATABLE READ isolation and SQLite provides
SERIALIZABLE isolation, so which is it? Perhaps it’s neither, because ctlstore doesn’t provide similar read-write transactional semantics. Transactions are either a batch of writes or a single read operation.
What we do know is that ctlstore has the following high-level consistency attributes:
It has no real-time constraint, so readers can read stale data.
All ledger statements are applied in-order, so all readers will eventually observe all writes in the same order.
Batches of mutations are atomic and isolated.
All readers observe the latest committed data (there are no multi-read transactions).
Readers never encounter phantom reads.
ctlstore applies batched mutations transactionally, even down to the LDB. The ledger contains special marker statements that indicate transactional boundaries. In theory this provides strong isolation. In practice, Debezium streams changes outside of transactional context, so they’re applied incrementally to ctlstore. While they usually wind up within the boundaries of a batch, upstream transactions can and do straddle batches applied to ctlstore. So while ctlstore provides this isolation, in use we aren’t currently propagating transactional isolation from the origin to the reader.
Here are some of the things we’re eyeing for the future:
We’re currently experimenting with a “sidecar” read path that uses RPC instead of accessing the LDB directly. This could make it simpler to interface with ctlstore on the read path.
Currently ledger statements are kept forever. Ledger pruning might be necessary in the future to keep the ledger compact. This is complicated to implement in the general case, but there are some classes of ledger statements that would be low-hanging fruit, such as heartbeat entries generated by our monitoring system.
No data or schema inspection is exposed via the HTTP API. Reads via the HTTP API would be consistent with the write path, making it possible to implement systems that use ctlstore as their source of truth. Schema inspection helps developers understand the system, and should be coming soon.
While we don’t anticipate this anytime soon, it might be necessary in the future to split up the LDB into groups of table families. Each cluster of container instances would be able to “subscribe” to a subset of families, limiting the amount of disk and memory for cache required.
As mentioned above, secondary indices might become very valuable for some use cases. MySQL and SQLite both support them, but in general we are very conservative on the read path, to protect the key performance characteristics of ctlstore.
While we’d prefer that the shared mount is read-only, it currently is mounted read-write to work because one of our early users experienced intermittent read errors due to an obscure bug in SQLite. Switching the mount to read-write was the workaround. In the future we’d love to find a way to switch this back to a read-only mount for data integrity purposes.
ctlstore is a distributed, relational data store intended for small-but-critical data sets. While we’re still in the process of transitioning many data sets to ctlstore, it is now a hardened, production, business critical system deployed for a number of use cases. We’ve layered systems on-top of ctlstore, such as flagon, our feature flagging system. The architecture allows sourcing data from multiple systems of record, critical for adoption across teams. Developers no longer need to be concerned with read scalability or availability of their control data. It has been incredibly reliable in practice — we have yet to experience downtime on the read path.
We’d like to thank the trailblazing people that were involved in the early testing and deployment of ctlstore: Ray Jenkins, Daniel St. Jules, Archana Ramachandran, and Albert Strasheim.
Leif Dreizler on June 25th 2019
Segment receives billions of events from thousands of customers that authenticate weekly, mostly using usernames and passwords. It’s common knowledge in the security community that users frequently pick weak passwords and reuse them across sites. Six months ago, we set out to change that by helping customers select stronger passwords and allowing them to protect their Segment account with Multi-Factor Authentication (MFA).
At Segment, we have various sub-teams working diligently to improve our security story. These efforts have provided significant security improvements which are seldom seen by customers. A perfect example of this is our previous blog post, Secure access to 100 AWS accounts, which describes some incredibly impactful work that most customers will never know about.
In an online business, the parts of your security program most customers interact with are your product security features—even though this is just the tip of the security iceberg.
Some customers will view your compliance certifications, but the foundational areas of your security program will mostly go unnoticed outside of your customers’ vendor security teams.
To provide the security your customers deserve, you need to have a well-rounded security program. If you’re interested in seeing what that journey looks like from our Chief Information Security Officer, Coleen Coolidge, check out her recent presentation at an OWASP Bay Area meetup on how to build security capabilities at a startup.
Two tenets of our security team are to be part of the overall business’ success, and to partner closely with the rest of engineering. The importance of being part of the overall business’ success is obvious—without Segment there is no Segment Security Team. This translates to making practical security decisions for Segment employees, helping sales and marketing, and always keeping the customer in mind.
Working closely with engineering is important because the cumulative number of design choices, lines of code written, and bugs avoided by engineering is much higher than that of security. Working collaboratively with engineering allows us to learn from each other and help everyone be a better security champion.
We believe that making good software requires making good security choices, and our security team is here to help the larger Segment team achieve that goal.
Luckily for the security industry, security is becoming an increasingly important part of the software-evaluation process. Security should be something customers have positive experiences with throughout their evaluation of your product. Two areas of opportunity that Segment Security and Engineering recently partnered on are our sign up and authentication workflows.
A few years ago NIST released updated guidelines for passwords. Historically, many applications made users select an 8 character password with letters, numbers, and symbols which resulted in an overwhelming number of people picking
Password1!. Applications also required regular rotation of passwords, which resulted in
Password2!. Some of the new guidelines for passwords include disallowing commonly used passwords, disallowing passwords from previous breaches, and encouraging users to employ a variety of strong password strategies as illustrated in this well-known XKCD comic. To accomplish this, we turned to Dropbox’s zxcvbn module and Troy Hunt’s Have I Been Pwned (HIBP) API.
zxcvbn allows us to meet most of NIST’s password guidelines. Commonly used passwords or those that are easy to guess are scored low, and good password created by a variety of strategies are scored high. However,
zxcvbn does not identify when passwords have been part of a known breach, which is why we rely on Have I Been Pwned (don’t worry, we aren’t sending your password outside of Segment and if you’re interested in how this process works it is explained on Troy Hunt’s website—linked above).
As a SaaS company, the sign up process at Segment is extremely important to the user. It sets the stage for a smooth experience. To make sure our design wouldn’t negatively impact signups and would positively impact password strength, we took a few steps to make sure we got things right. If you want your security engineering team to be treated like an engineering team, you need to follow the same principles and processes as software engineering teams.
For this feature, that meant getting feedback from our activation engineering, product, and design teams early in the process.
Through a close partnership with these teams, we were able to take my original mockups and turn them into customer-facing feature.
As you can see in the above images, we block users from selecting any passwords of zxcvbn score 0 or 1, and warn them if they are selecting a password that has previously shown up in a breach. We chose to warn instead of block on breached passwords partially to reduce our reliance on the Have I Been Pwned API, and partially to limit the times a user is blocked during sign up.
Once we had a design and workflow we were all happy with, we released it to all customers completing the change and reset password workflows. These flows are seldom taken compared to sign up, and less risky to change because existing customers are more forgiving to a potentially negative experience.
To verify our changes were having the intended effect, we used our
analytics.js module to track changes in password strength. As customers and frequent readers of our blog will know, Segment helps our customers track events and make informed decisions for their customers. While monitoring stats about recently changed passwords, we saw a 30% decrease of breached passwords and an increase in average
zxcvbn score from 3 to 3.5, which meant our new UI was having the intended effect.
To make sure our design was pleasing to customers we added the new password UI to our signup flow and displayed it to half of our users as part of a 2 week A/B test. If things went well, we would make it live for everyone, and if it didn’t we’d try a different design. During the first week the signup percentages were the same, and during the second week we actually saw a slightly higher conversion rate on the new password interface. As a result, every customer that signs up now goes through the new user interface and receives guidance on choosing a strong password.
Unfortunately, despite our best efforts in helping users pick strong and unique passwords, we know that many will not and those that do would like the added layer of security of a second factor. Several weeks ago we quietly released MFA to all workspaces that do not authenticate via Single Sign-On (MFA is usually handled at the Identity Provider level for SSO users). Everyone now has the option to use Time-based One Time Passwords (TOTP) via an app that can read QR Codes (e.g. Google Authenticator or 1Password), and our U.S. and Canadian customers can use SMS-based codes sent by Authy.
Similar to the development of our new password experience, Security partnered closely with Engineering and Design, and used Segment tracking events to monitor adoption. Once the feature was released, we let customers know it was available using a simple notification.
We also wanted to closely monitor MFA failures. If someone completes the username and password portion of authentication, and then repeatedly fails the MFA code portion it may mean that account password has been compromised by an attacker. If we see this behavior, we want to let our customer know that they should change their password to help prevent the attacker from gaining access to their account. To do this we used Segment’s Personas product. We created a custom audience of users that have failed MFA a certain number of times in a given time period. When a user enters that audience we send them an email.
Using Segment’s own product to deliver a better Segment experience to our customers is a perfect embodiment of the two tenets we talked about earlier. It helps us become more familiar with the product we’re helping defend which makes us more relevant during design reviews and other security engineering efforts. It also identifies new ways that Segment could be used and marketed to our customers which makes the business successful—all while improving our security posture, helping us safeguard the data our customers have entrusted with us. Our product security story may never be complete, but we’re thrilled to have our customers supporting us on this journey of incremental improvement.
A special thanks goes out to Cat and everyone else that helped make these features possible. There are too many to list but you know who you are 🙂
Theodore Chao, Bilal Mahmood on April 1st 2019
This post is a guest submission by our partner, ClearBrain. It shares their experience building a Segment Destination. Thanks to ClearBrain for sharing their story!
At ClearBrain, we’re building the first self-serve predictive analytics platform. Growth marketers have one consistent objective—forecasting and testing with their best users for incremental ROI, as fast as possible. At ClearBrain, our proprietary AI enables customers to personalize and test every user journey by their predicted intent, for any conversion event, in minutes.
Delivering on this promise of predicting user intent requires two distinct components working together: a data layer and an intelligence layer. On the data side, you need a standardized collection of user attributes and events, aggregated across disparate digital channels—effectively requiring dozens of API integrations. The intelligence layer, in turn, normalizes that data to run machine learning models and predict any conversion event, automating the statistical analyses needed to recommend which audiences are most likely to perform an event.
The challenges and infrastructures required to build these two components of a predictive analytics platform couldn’t be more different. It’s hard enough to build one product, let alone two at the same time. Thankfully due to Segment opening up their platform to partners, this tradeoff was not an issue for ClearBrain.
Segment’s Customer Data Infrastructure enabled us to focus on the intelligence components that truly differentiate ClearBrain. Rather than spending years building API integrations into every customer data input, we instead invested that time into automating the statistical inference necessary to power a predictive analytics platform.
Segment was a natural partner to power the data layer for our platform. Breaking it down, there are 3 critical features of a data layer necessary for predictive analytics: omni-channel data, standardized data, and historical data.
Predictive analytics is built on the foundation of predicting user intent via lookalike modeling. You predict your user’s intent to sign up by analyzing the users who signed up in the past (vs those who didn’t). Thus to build an accurate model of user intent you need a complete digital picture of their user journey. The problem, of course, is data heterogeneity. Apps may be built on Go or Java, running on Android or iOS, or integrated with email clients like Braze or Iterable. Further, companies in different verticals organize their user journeys in completely different ways, from user-based checkout to account-based subscription funnels.
Segment resolves a lot of this data heterogeneity. By building an integration to Segment via their platform, ClearBrain was able to build for one API and automatically collect data from any client, server-side, platform, or email integration. Rather than spending years building an integration for every code library and email API, we got instant access to the dozens of data sources available in Segment’s sources catalog. And all of that data is cleanly organized. Regardless of whether it is a server-side attribute or an email event, all data is received via a universal spec of four main event collections: identifies, tracks, pages, and screens. Further, there are vertical-specific specs for eCommerce and B2B SaaS that map out the user journey via standardized sets of event names specific to each vertical. Regardless of data source, we can always be guaranteed that the data is received in a predictable format.
Clean data is just as vital as homogeneous data when powering a predictive analytics platform. There’s the classic statement, “garbage in, garbage out”. If the data you’re receiving is anomalous, stale, or redundant, the predictive models powering your insights will be too. Thankfully, a benefit of building on top of Segment is that they provide tools for data governance and quality. Their Protocols product guarantees that data received into ClearBrain will be standardized, live, and accurate. I can’t tell you the number of times we’ve seen data come in from other data sources where there are 4 different events for sign up (e.g. sign_up, signup, sign-up, and Sign Up).
Lastly, a critical component for any analytics product is time to value. Any data visualization requires multiple weeks of data to appreciate engagement or retention trends (remember, it takes a minimum of two points to make a line!). This problem is only compounded if your analytics relies on predictive modeling. Predictive modeling is based on analyzing past behavior to infer future behavior, so it follows that the more historical user data you have, the better you can project w/w changes, seasonality, and other critical trends. Segment’s historical Replay plays a critical role for ClearBrain here—rather than waiting weeks to collect enough historical data to power their predictive insights, they can replay years of user events in a matter of days.
These three facets–omni-channel, standardized, and historical data, made integrating with Segment a no-brainer. Rather than spending years on the pipes, we could focus on the statistical inference that makes ClearBrain truly unique. You effectively get all the benefits of being customer data infrastructure, with none of the work!
Building an integration on the Segment platform was really straightforward, thanks to their new Developer Center, clean documentation, and a developer-friendly support team. There are a few main steps to integrate with Segment as a Destination:
Step 1a: Set up a static HTTPS endpoint to receive customer data
The way that Segment is able to send customer data to your destination is by passing a batch of JSON messages as part of a POST request payload to an HTTPS endpoint that you provide. Their specs detail the requirements for your endpoint, but in a nutshell, your endpoint must satisfy the following requirements:
Accept POST requests
Accept data in JSON format
Our particular endpoint is set up using a CNAME in our DNS provider that points to the endpoint for an Application Load Balancer (ALB) in AWS. In the following section, we will talk about the use of reference architecture provided by AWS, that handles setting up an ALB.
Step 1b: Set up your API server (using AWS!)
The bulk of the work centers around building infrastructure that can support the amount of data your customers will be sending you at any point in time (keep in mind that historical Replay is an option that customers may be able to leverage, which can cause a one-time rate of requests higher than your average rate, but we’ll get to that later). Building a scalable API server is not the simplest of tasks, but there are solid templates made publicly available that you can reference. At ClearBrain, we decided to build our API server in AWS, which was made easier using a boilerplate provided by AWS.
This particular reference architecture provided by AWS uses a set of Cloudformation templates to define a set of related resources in AWS (e.g. VPCs, ECS Clusters, Subnets, ECS Services) that will represent the entire API server. We won’t go into specifics of how we adapted the templates for developing the ClearBrain destination, but here are a few changes we made on top of the reference architecture to productionize our API server:
Replace product-service and website-service with our own API service, which was loosely based on the product-service .yaml file
The AWS template provides auto-scaling at the ecs service level, but does NOT provide auto-scaling at the cluster level, which means that the cluster would not auto-scale in response to the auto-scaling of the contained services
Define a Makefile
We wanted to be able to run commands that would handle various tasks, such as:
running the API server locally for development
sending batches of messages at various rates for stress testing
validating cloudformation templates prior to deploying
deploying cloudformation templates
To find a sweet spot for minimal cluster size, as well as verify that our cluster would be able to scale up to handle larger loads (especially during Replay), we performed a series of load tests. This involved sending synthetic loads to a test cluster at varying levels of traffic, both with autoscaling off and on, carefully observing resulting performance (latency, errors), and verifying autoscaling mechanics and speed.
For the load test, we used an open source benchmarking tool, wrk. This tool allows specification of parallelism level and even the exact query to send (which allows sending realistic queries). It then measures performance while sending requests as fast as possible–and as wrk is written in C++ it’s able to do so with very minimal CPU usage. However, the nodes sending load are still limited by their network throughput, so we made sure to use high network throughput instances. On AWS, we used c5n class instances to send the load, but even then, we had to run multiple instances in parallel to ensure that enough load was sent to the cluster so that it was pushed to its limit.
Finally, in evaluating the cluster size, consider that it will take some time for a high CPU condition to be measured, plus some additional time to provision instances, install the image, start the server, and add them to the load balancer. Leave a comfortable margin of time to be safe. In our case, we verified (with auto-scaling off) that the cluster could comfortably handle well over 90% CPU utilization without affecting correctness or latency. Then, when setting up auto-scaling we set a target tracking policy with CPU utilization target of 50%. We also set a minimum cluster size so that even in low-traffic times if we received a surge of traffic, the cluster could handle it comfortably without needing to wait to scale up.
Step 1c: Build your ingestion logic
Once your API server is set up to receive requests, the rest of the work is (mostly) business logic. Ultimately, you just need to ensure that incoming requests get handled appropriately. Possible tasks include:
validating requests are coming from expected customers’ Segment sources
stripping / augmenting requests
writing data to persistent storage + partitioning
It is strongly encouraged that you build validation into your API service, to ensure that you are processing data that is being expected. The simplest way to perform validation is to check the
Authorization Header that is sent with each request from Segment. As mentioned in their documentation, each request will have an
Authorization Header that contains a Base64 encoded string containing a secure token that you will be passing to Segment as part of the OAuth flow (see Step 2). By decoding the header, you can verify whether you have already issued this token to a customer, and additionally map the token to a customer, if you so choose.
The next recommendation is to write specific logic to handle any of the various types of calls that Segment supports (page, track, identify, group). You can do this by checking the type property on each JSON message and routing your logic accordingly.
The last recommendation is to respond to the request with an appropriate status code. See documentation for details on what each unique status code means to Segment.
Step 2: Build the OAuth flow to allow users to set up the destination
Once your API endpoint has been tested, submitted for review, and approved, the last step is to build an easy OAuth flow to make it easy for your customers to set up your integration as a destination in their Segment accounts.
Segment provides a button, which you can embed on your site/application, that handles redirecting your users to Segment and allowing them to select a source to set up for a Destination. Due to some technical complications with how ClearBrain’s app works, we ended up inspecting their source code and boiled it down to a simple redirecting of the browser to https://app.segment.com/enable, passed with the following properties as query parameters:
the unique name of your integration in Segment
Base64 encoded string of
Base64 encoded string of the url for Segment to redirect back to
A sample url would look like:
Notice the securityToken that is passed in as part of the base64 encoded settings query parameter. This will be a unique and secure token that you will generate (and save!) on your end and pass to Segment. Segment will then send this security token back with every request to your API endpoint, which you can use to validate the request (as mentioned in Step 1c).
Building our integration into Segment–from OAuth to API server to data ingestion, took only a couple days to implement. Thats days of work compared to the months, if not years, it would take otherwise to build and maintain a whole data infrastructure layer.
In turn, we’ve been able to focus on the statistics and machine learning components necessary for a best-in-class predictive analytics platform–one that from day one can deliver on the promise of predicting user intent on top of omni-channel, standardized, and historical data powered by Segment.
Learn more about how Clearbrain and Segment work together here.
Maxime Santerre on February 28th 2019
Counting things is hard. And it gets even harder when you have millions of independent items. At Segment we see this every single day, counting everything from number of events, to unique users and other high-cardinality metrics.
Calculating total counts proves to be easier since we can distribute that task over many machines, then sum up those counts to get the total sum. Unfortunately, we can’t do this for calculating cardinality since the sum of the number of unique items in multiple sets, isn’t equal to the number of unique items in the union of those sets.
Let’s go through the really basic way of calculating cardinality:
This works great for small amounts of data, but when you start getting large sets, it becomes very hard to keep all of this in memory. For example, Segment’s biggest workspaces will get over 5,600,000,000 unique
anonymous_id in a month.
Let’s look at an example of a payload:
Another big issue, is that you need to know which time range you want to calculate cardinality over. The number of unique items from February 14 to February 20 can be drastically different than February 15 to February 21. We would essentially need to keep the sets at the lowest granularity of time we need, then merge them to get the unique counts. Great if you need to be precise and you have a ton of time, not great if you want to do quick ad hoc queries for analytics.
Thankfully, we’re not the first ones to have this issue. In 2007, Philippe Flajolet came up with an algorithm called HyperLogLog, which could approximate the cardinality of a set using only a small, constant amount of memory. HyperLogLog also allows you to increase this amount of memory to increase accuracy by using more memory, or to decrease memory at the cost of accuracy if you have memory limitations.
16kb for 0.81% error rate? Not too bad. Certainly cuts down on that 134.4GB we had to deal with earlier.
We initially decided to use Redis to solve this problem. The first attempt was a very simple process where we would take every message out of a queue and put it directly into Redis and then services would call
PFCOUNT to get the cardinality counts we needed.
This worked very well for our basic needs then, but it was hard for our analytics team to do any custom reporting. The second issue is that this architecture doesn’t scale very well. When initially got to that point, our quick fix was this:
Not ideal, but you would be surprised how much we got out of this. This scaled upwards of 300,000 messages/s due to Redis HLL operations being amazingly fast.
Ever so often, our message rate would exceed this and our queue depth would grow, but our system would keep chugging along once the message rate went back down. At this point we had the biggest Redis machine available on AWS, so we couldn’t throw more money at it.
The maximum queue depth is steadily increasing every week and soon we will always be behind. Any kind of reporting that depends on this pipeline will become hours late, and soon maybe even days late, which is unacceptable. We really need to find a solution quickly
We already used Spark for our billing calculations, so we thought we could use what was there to feed into another kind of reporting.
The problem with the billing count is that they’re static. We only calculate for the workspace’s billing period. It’s a very expensive operation that takes every message for a workspace and calculates a cardinality over the
anonymous_id for that exact slice. For some bigger customers, that’s several terabytes of data, so we need big machines, and it takes a while.
We previously used HLLs to solve this problem, so we’d love to use it again.
The problem is: How do we use HLLs outside of Redis?
Our requirements are:
We need a way to calculate cardinality across several metrics.
We need to be able to store this in a store with quick reads.
We can’t pre-calculate cardinality, we need to be able to run ad-hoc queries across any range.
It needs to scale almost infinitely. We don’t want to fix this problem again in a year when our message rate doubles.
It needs very little engineering resources.
(Optional) Doesn’t require urgent attention if it breaks.
We want something reliable and easy to query, so we think of PostgreSQL and MySQL. At that moment, we find they are equally fine, but we choose MySQL because we already have one to store reporting data and the clock is ticking .
We use Scala for our Spark jobs, so we settle on Algebird to compute HLLs. It’s an amazing library and using it with Spark is trivial, so this step was very smooth.
The map-reduce job goes like this:
On every message, create a HLL out of every item you want cardinality metrics on and return that as the result of your key/map.
On the reduce step, add all of the HLLs resulting from the map.
Convert the resulting HLLs to the bytes representation and write them to MySQL.
Now we have all the HLLs as bytes into MySQL, we need a way to use them from our Go service. A bit of Googling around led me to this one. There were no ways of directly injecting the registers, but I was able to add it pretty easily and created a PR which will hopefully get merged one day.
The next step is to get the registers and precision bits from Algebird, which was very simple. Next, we just take those bytes and make GoHLL objects that we can use!
Sweet! Now we can calculate cardinality of almost anything very easily. If it breaks, we fix the bug and turn it back on without any data loss. The worse that can happen is data delays. We can start using this for any use case: Unique events, unique traits, unique mobile device ids, unique anything.
It doesn’t take long for us to realize there’s 2 small problems with this:
Querying this is done through RPC calls on internal services, so it’s not easy for the analytics team to access.
At 16kb per HLL, queries ranging over several days on several metrics creates a big response from MySQL.
The latter being quite important. If we’re doing analysis on all our workspaces multiple times a day, we need something that iterates quickly. Let’s say someone wants to figure out the cardinality of 10 metrics, over 5 sources and 180 days.
We’re pulling down 140 MB of data, creating 9,000 HLLs, merging them, and calculating cardinality on each metric. This takes approximately 3–5s to run. Not a huge lot, but if we want to do this for hundreds of thousands of workspaces, it takes way too long.
So we have a solution, but it’s not ideal.
I sat down at a coffee shop one weekend and started looking around for better solutions. Ideally, we’d want to do this on the database level so we don’t have to do all of this data transfer.
MySQL has UDFs (user defined functions) that we could use for this, but we use MySQL on AWS, and from my research, there doesn’t seem to be a way to use UDFs on Aurora, or RDS.
PostgreSQL on the other hand, has an extension called
postgresql-hll, which is available on PostgresSQL RDS.
The storage format is quite different unfortunately, but it’s not as much of a problem since they have a library called java-hll that I can use in my Spark job instead of Algebird. No need to play with the headers this time.
Now we’re only pulling down 64 bit integers for each metric, we can query cardinality metrics with SQL directly, and this pipeline can scale almost infinitely. The best part? If it breaks for some reason at 4am, I don’t need to wake up and take care of it. We’re not losing any data because we’re using our S3 data archives and it can be retried at any time without any queues filling up disk.
So now that we have all of this processing from now going forward, we need to back fill everything back to the beginning of the year. I could take the lazy way and just run the Spark jobs on every day from January 1st 2018 to now, but that’s quite expensive.
I said earlier we didn’t need to play with the headers, but we have an opportunity to save a bit of money (and have some fun) here by taking the HLLs saved in MySQL and transforming them into the postgresql-hll format and migrating them to our new PostgreSQL database.
You can take a look at the whole storage specification here. We’ll just be looking at the dense storage format for now.
Let’s look at an example of HLL bytes in hexadecimal:
The first byte:
14 is defined as the version byte. The top half is the version and the bottom is the type. Version is easy, there’s only one, which is
1. The bottom half is
4, which represents a
FULL HLL, or what we’ve been describing as dense.
The next 2 bytes represents the
log2m parameter and the
log2m is the same as precision bits that we’ve seen above. The top 3 bits represents
registerWidth - 1 and the bottom 5 represents
log2m. Unfortunately, we can’t do this directly from the hex, so let’s expand it out to bits:
The rest are the HLL registers bytes.
Now that we have this header, we can easily reconstruct our MySQL HLLs by taking their registers and prefixing those with our
146e header. The only thing we need to do is take all of our existing serialized HLLs in MySQL and dump them in this new format in PostgreSQL. Money saved, and definitely much fun had.