Leif Dreizler on March 31st 2020
Segment receives billions of events every day from thousands of customers that trust Segment to keep their data safe. At Segment, we believe that good security is an essential part of building high-quality software, similar to reliability or scalability. In addition to the tools and processes developed by our Security Org to help software engineers make good security choices, we also rely on reviews from both traditional and crowdsourced security companies.
Bug bounties are often seen as a significant burden for the security teams that run them. You’ll hear horror stories about companies launching a bug bounty, their security team getting inundated with low-quality reports, duplicate submissions, and researchers going outside the scope of the program.
Shortly after, you'll hear about their engineering team being overwhelmed with newly discovered vulnerabilities to fix. From those that survive the opening salvo, you may hear complaints that, over time, they have stopped receiving impactful submissions.
A few years ago, when the space was less mature, critics questioned whether running a program was worth it. Now, it is expected that organizations of a certain size and maturity run a bug bounty program.
In late 2019, the Cybersecurity and Infrastructure Security Agency (CISA) of the United States published a draft directive recommending all agencies develop a mechanism to accept information related to security vulnerabilities from external parties.
In this blog post, we’ll break down how to start and manage a bug bounty program, consistently achieve good results, and maintain healthy relationships with the people that power the program.
If you’re short on time, check out the “Top Tips” section at the bottom of this post.
A bug bounty program is like a Wanted Poster for security vulnerabilities.
Companies running bug bounty programs pay independent security researchers from across the world for security vulnerabilities in their own products and infrastructure.
I assume most readers are already bought into the benefits of running a bug bounty.
Most companies that have an internet presence or make an internet-connected device should consider running a bounty, or at least have a way for security researchers to report security issues.
It is also part of the Vendor Security Alliance questionnaire, so it may be something your customers ask you about if you are in the B2B space.
If you don’t have a way for researchers to report issues, they will email people at your company (or any alias they can find on your website), connect with you on LinkedIn, or just tweet about the issues they think they’ve found.
It’s a much better experience for researchers, your employees, and your customers if you give the security community a clear avenue to report vulnerabilities.
Your security and engineering orgs will be regularly impressed by the creativity of the researcher community. These are people that, without internal knowledge, can find critical vulnerabilities in organizations across the world.
I strongly recommend using a bug bounty platform like HackerOne or Bugcrowd (we use Bugcrowd here at Segment) to help manage this process. These companies provide a platform and services to help run an efficient program.
Severity baselines make it easier to tell how serious a vulnerability is, and how much time you should be spending on review and remediation.
When running a program on your own, you’ll frequently have researchers overhyping vulnerabilities. Platforms have a guided submission form, which helps researchers pick the appropriate category and rating.
The reputation systems reward researchers that accurately rank vulnerabilities and creates a competitive environment that benefits both researchers and program owners.
It also helps reinforce good behavior. Any researcher discipline issues have stricter consequences. If a researcher misbehaves, they may be banned from the platform.
To submit vulnerabilities via these platforms, researchers have to agree not to disclose the vulnerability without approval from your company.
Both platforms also provide triage services, which I highly recommend paying for. These are the first line of defense for your internal security resources. These globally distributed teams will help clean up researcher reports, mark submissions as duplicates, and filter out low-quality reports.
These companies also serve as a knowledge base for you to learn about running a program and ask questions. You can bounce ideas off of someone that works at a company running hundreds of programs.
Platforms have structured input with required fields and integrations with popular tools, like Jira. These make it much easier to turn a submission into a ticket for your engineering org.
For most companies, it isn’t possible to run a private program without the help of a bug bounty platform.
We’ll talk about private programs in more depth later, but this is the recommended starting point for companies launching a bug bounty for the first time.
All of the above features free your team to focus on the security challenges unique to your business.
Having a successful program starts with a good foundation, and it’s your job as a program owner to help set your organization up for success.
Think about your current process for handling a security vulnerability. What happens when someone internally or externally finds a security bug?
You will inevitably get more of these after starting a bug bounty program, so you must have a good way to handle these reports.
Your vulnerability process doesn’t have to be perfect, but you should have a way to prioritize and assign bug reports to the appropriate engineering team without high overhead.
As someone starting a program, you’ll also need to get buy-in from your Engineering org. They are the ones that will have to fix the issues and will likely be the ones responding to alerts triggered by researchers.
Your alerting story doesn’t need to be perfect. But you also don’t want engineers to be woken up every time someone triggers an error because some input validation was working correctly and stopped a researcher from submitting a
< symbol into an email address field.
Remember, your team doesn't have to fix every valid vulnerability immediately.
Vulnerabilities are just bugs and should be prioritized appropriately. If you’re having trouble getting your engineering org to fix critical vulnerabilities in a timely manner, you may want to direct your efforts to job-hunting instead of starting a bug bounty program 🙃
Once your organization is bought-in, you can focus on getting things ready for the researchers.
Your bounty brief is what researchers will read to determine if they’re interested in working on your program. It’s part instructions, part rules, and part advertisement.
Keep in mind you’re competing for researchers’ time; they don’t have to work on your program when there are so many other programs.
Your bounty brief should be clear, concise, and should set expectations with the researchers. You can find the Segment program hosted on Bugcrowd.
Where do you want researchers to focus their testing? What’s in scope? What’s out of scope?
I recommend starting with assets that are easy for the researchers to access, ideally something free, that anyone can sign up for.
You should also try to pick a target that has at least medium size, complexity, and business impact. This will help you show value early, which will help you expand the program.
How do researchers get access to the scope? Are there docs they can read to help them get up to speed? We instruct researchers to sign up for our app using their @bugcrowdninja.com email address and to include
-bugcrowdninja as part of their workspace slug.
This makes it easier for us to determine if someone is part of our Bugcrowd program when we review logs and alerts. If we notice someone causing problems in our app, we can ask Bugcrowd to provide researcher coaching.
How are you going to rate submissions? Consistent severity is important because it impacts how much the researcher gets paid for a submission. HackerOne uses Mitre’s Common Weakness Enumeration (CWE) and Bugcrowd uses the Bugcrowd Vulnerability Rating Taxonomy (VRT).
How much are you going to pay for vulnerabilities? Researchers need to know upfront how much you’ll pay for vulnerabilities so they can assess if it is worth their time to hunt on your program.
Think about using different reward ranges for different assets. This can help control costs and also helps researchers understand which targets are more important. For example, we describe specific objects that will net a higher reward:
A handful of years ago, getting a T-shirt as a reward was pretty standard. I’d strongly encourage anyone thinking about running a swag-based bounty to reconsider.
T-shirts don’t pay rent and are more work for your team than sending a researcher money. What do you do when that T-shirt you sent to another continent is lost in the mail or doesn’t fit?
We reserve swag for our top performers. Sending a T-shirt requires the researcher to trust us enough to give us their address and requires me to go to the post office.
Take the time to explain what a bug bounty is, why it's important, and have a few examples from recognizable organizations ready to show them.
Learn a little bit about the platform you’re using. Your actions on the platform impact the researcher. If you mistreat researchers, they will go elsewhere; without researchers, your program isn’t providing value to your organization.
The same report status can have different meanings and impact on different platforms.
For example, on HackerOne
Not Applicable reduces the researcher’s site-wide score by 5 points, and should be used for reports that don’t contain a valid issue.
Not Applicable does not impact the researcher’s score, and is commonly used for reports that should neither be accepted or rejected. To achieve this result on HackerOne, you would use the
If you have any questions about the platform you’re using, I strongly recommend reviewing documentation or reaching out to your account manager for help.
Regardless of how big your company’s internet footprint is, you can start with a small scope open only to a handful of individuals as part of a private program.
In mid-2017, Segment was running a private program with 25 researchers and a single target: our app.
The early researchers invited will be some of the platform’s most trusted, and they will generally be more accepting of companies that are learning how to manage a program, as long as you pay them fairly and treat them with respect.
Starting small allows your organization to learn how to run a program in a safer environment. If your vulnerability management program has some gaps, you can fix them; if your bounty brief is unclear, you can rewrite it; if your alerts aren’t tuned properly, you can invest time into improving them. If you need to pause your program, you can relaunch later with a less negative impact.
Even while we had a private program, we would direct researchers that reached out via email to our Bugcrowd program. This allowed us to receive the benefits of the platform and triage services for all submissions before running a public program.
It’s much easier to explain to a researcher why you won’t be paying for a low-effort submission when you have a prioritization system established and enforced by a third party.
Like any multi-year project, your bug bounty will evolve and require ongoing effort to keep it healthy.
Researchers love testing new features and assets; in most bug bounty programs, only the first person to find a vulnerability receives a monetary reward.
If you started with a small scope, your program is steady, and you’re ready for more submissions, this is a great time to add more targets to your brief.
As you add scope, keep in mind that not all assets are of equal value to an adversary. It is encouraged to specify different reward ranges for different assets based on their security maturity and value.
Over time, you should also consider an “open scope” bounty if it is appropriate for your organization. We have listed as a target,
Any host/web property verified to be owned by Segment (domains/IP space/etc.), which serves as a catch-all for anything not explicitly listed in our “In Scope” or “Out of Scope” sections.
Having an open scope bounty is enticing to researchers. Not only does it show you take running a bug bounty program seriously. It also shows that regardless of where they find a vulnerability, it will likely be rewarded (assuming it is valid, unique, and not out of scope).
Many researchers specialize in finding forgotten internet-facing assets as part of open-scope programs, and have developed their own infrastructure to identify assets and vulnerabilities to be able to efficiently earn rewards.
It’s also worth noting that there is no scope for an attacker trying to compromise your company’s security. Working towards an open scope means that it is more likely a bug bounty researcher will find and report a vulnerability before an attacker exploits it.
Over time, you’ll build trust and form relationships with particular researchers. These are great people to give early access to upcoming features. Many times, these features require manual provisioning, making them less suitable for wide-scale testing.
Early access is a mutually beneficial system in which you will receive security vulnerabilities prior to release, which makes them easier to fix. Researchers will be able to test features with less competition, which makes them more likely to earn a reward and continue testing on your program.
If the effort to set up these features is medium or higher, consider paying the researcher a grant before they start working.
Clearly communicate what you're looking for and what you expect from them. When offering a researcher grant, we want to see a short write-up of what didn't work in addition to any findings they submit. Rewardable findings should be paid in addition to the grant.
Once you’ve been running a program for a while and are confident in your company’s ability to receive vulnerabilities from the global researcher community, you should consider evolving it into a public program.
If you don’t have a wide scope, this is a great time to revisit that decision.
Maximizing your scope (while private) will reduce the uptick in submissions when your program is launched publicly. You should also invite as many researchers as possible to your private program before going public for the same reason.
Because public programs are open to anyone, you will inevitably receive testing from a lot of newer folks that will pay less attention to your bounty brief, so having a wide scope helps in this regard as well.
Segment has run a public bug bounty program since late 2018, roughly 18 months after launching our private program.
Hopefully over time, you will think of your outsourced triage team as an extension of your internal team. Spending the time to let them know how you want your program run will pay dividends in the future. Any submission they can validate without asking your team questions saves time for everyone involved.
Here are some examples of guidance we’ve given to the Bugcrowd triage team:
Identify duplicates for non-rewardable submissions
Many programs do not bother to mark informational, out of scope, or other non-rewardable submissions as duplicates. We do this for two reasons:
The first is that if we decide to fix one of these issues later, we can go back and mark the original submission as resolved and pay the researcher. Any duplicates of this issue will still receive points.
The second is that when there is a false positive identified by a tool commonly used by bug bounty hunters, you will get this submitted to your program a lot.
If we repeatedly see an out-of-scope or not reproducible submission, we can add a specific item in our bounty brief to warn researchers; it will be marked as out-of-scope or not reproducible without a working proof of concept.
Don’t be afraid to deduct points for undesired behavior
While we are generally laid-back and understanding program owners, we aren’t afraid to deduct points from a researcher’s score for when it’s warranted.
Many programs shy away from deducting points, but we want to ensure participants in our program thoroughly read our brief and think that it helps the larger bug bounty community to slightly penalize those that disregard clearly documented rules.
Two of the common arguments against bug bounty programs is that the submissions are often low-value and that researchers don’t respect scope.
For example, we have a very small out-of-scope section, which includes:
CORS or crossdomain.xml issues on api.segment.io without proof of concept.
This is identified by a tool commonly used by bug bounty participants is a finding we have received dozens of times, but never with any impact.
We do this to save time for both researchers and our triage team. If a researcher submits this finding without a proof of concept, we encourage Bugcrowd to mark this as out-of-scope. If a researcher submits a finding that showed impact, we would be more than happy to reward, fix, and update our brief.
If you need to deviate from the baseline rating established in your bounty brief, take the time to explain to the researcher why the rating and reward are higher or lower than they might expect.
Researchers are generally understanding, as long as your rating, reward, and explanation are fair and make sense. If you find yourself commonly deviating from the ratings, it may be time to make changes to your bounty brief so that researchers know what to expect in advance. If you make severity or scope changes as the result of a submission, reward the researcher at whichever rate is more favorable to them.
In addition to explaining why something was rated lower than the baseline, take the time to explain why something was rated higher than the baseline. This is a great way to encourage further testing in these areas and is a great way to build trust with a researcher.
These explanations also help your triage team learn more about your program, and allow them to more accurately triage future submissions.
Take time to build relationships and trust with researchers, especially those that repeatedly submit to your program. Treat researchers fairly, with respect, and consider paying for anything that brings value to your organization.
You’re competing for a researcher’s time, especially the ones that are the most talented. They can likely work on almost any bug bounty program available; think about ways you can encourage them to work on yours.
Keep in mind that all researchers, even those that are unskilled, are human beings. Assume that they want to help you secure your organization, learn more about security and technology, and get paid.
If there is one sentence you remember from this blog, I hope it is “pay for anything that brings value.”
Bug bounty hunters put in a lot of time and effort that doesn’t result in getting paid. This could be time spent developing tooling, hunting without finding any bugs, or having a valid bug marked as a duplicate.
Try to avoid thinking about individual bug costs. Instead, think about the overall value the program brings to your organization in terms of bugs found, time saved, and peace of mind. If you’re debating between two severities, pick the higher one and pay the researcher at that rate. You can always change the severity in your internal ticketing system later.
Once you’ve received a rewardable submission, try to triage and pay quickly. Sometimes determining the full impact takes time; if this is the case, add a comment letting the researcher know you appreciate their work but need some extra time to determine the appropriate reward.
Work collaboratively with the researcher
As an employee of your company, you should know more about the codebase and infrastructure than a security researcher outside your organization (although occasionally I question this based on the creative and impactful submissions we receive 😅).
Sometimes when running a bug bounty program, you’ll get a submission that makes you think, “What’s the next level the researcher could get to?” If this is a researcher you trust, it may be appropriate to give them some hints to help further their testing. If you give them hints, you can also issue some cautionary advice to help them continue in a way that is safe for your organization and customers.
Giving the researcher hints helps show them you value their testing and saves your team from spending time on something that may not be possible. If the hint is helpful, the researcher will be submitting a higher-severity finding, which positively impacts their researcher score and earns a higher monetary reward. It also allows you to get the vulnerability fixed faster due to the higher severity.
Sometimes, it isn’t appropriate to include the researcher in this phase of the process. If our team continues the investigation, and it leads to the discovery of a higher-impact weakness in our systems, we reward the researcher as if their report contained the full impact. We also explain why we paid them at this higher rate, but let them know we are unable to share the details. This is a great way to show the researcher you value their work and build trust.
Share progress with the researcher
If a researcher submits a vulnerability that leads to a systemic fix for a vulnerability class, share this with them! Researchers are generally excited to hear that their work led to meaningful change within your organization. It also is a cue for them to attempt to bypass the new protections.
Pay for Dupes
At Segment, we commonly pay researchers for well-written duplicates, and frequently reach out to let them know that we appreciated their submission. We also let them know that we don’t always pay for duplicates to make sure that expectations are set appropriately.
This has worked out incredibly well for us. All of our most critical submissions have come from researchers that were originally rewarded for a well-written duplicate. Segment is a complex product that takes time to set up and fully understand. Researchers that put in the effort to fully set up a Segment workspace have additional context and understanding that take time to acquire—these people are hard to replace, and you want to keep them happy.
Pay bonuses for well-written reports
We also pay extra for well-written reports. Valid submissions need to get turned into Jira tickets which are assigned to engineering teams. Reports that are concise, easy to follow, have clear impact, and are well-formatted take less time for us to turn into tickets. We want to encourage researchers to save us time so we make sure to reward appropriately and let them know that we appreciate their efforts.
Running a successful bug bounty program requires consistent effort from your team, but can bring tremendous value to your company and customers. Any vulnerability reported and fixed is one fewer vulnerability an attacker could use to get a foothold in your organization. Bug bounty submissions can help illuminate vulnerability trends, which can help prioritize where you spend resources to fix systemic issues in your applications or infrastructure.
Bug bounty programs are people-powered. Spend the time to make those involved in your program feel valued, help them understand the motivations behind your decisions, and be excellent to each other!
Thanks for taking the time to read my post! I hope you learned a few things to help your company run a successful program. Here are some of my top tips to reference later:
Pay for anything that brings value
Pay extra for well-written reports, even if they’re dupes
Avoid thinking about individual bug costs
Partner with a bug bounty platform and pay for triage services
If you make changes to your bounty brief as the result of a submission, reward the researcher at the more favorable rate
Invest time into building and maintaining relationships with your researchers and triage team
Don’t be afraid to deduct points for bad behavior
Start small and partner early with Engineering
Write a clear and concise bounty brief to set expectations with the researchers
A special thanks to the Segment Engineering organization for fixing vulnerabilities and responding to alerts.
To Edis from Bugcrowd for helping us triage three years of vulnerabilities and truly being an extension of our team.
To all the researchers that have helped keep our customers safe by looking for vulnerabilities as part of our program.
And finally, to researchers danieloizo and sheddow, you have both submitted numerous well-written and high impact findings and are an absolute pleasure to work with.
Calvin French-Owen on December 15th 2015
At Segment, we’ve fully embraced the idea of microservices; but not for the reasons you might think.
The microservices vs. monoliths debate has been pretty thoroughly discussed, so I won’t completely re-hash it here. Microservices proponents say that they provide better scalability and are the best way to split responsibilty across software engineering teams. While the pro-monolith group say that microservices are too operationally complex to begin with.
But a major benefit of running microservices is largely absent from today’s discussions: visibility.
When we’re getting paged at 3am on a Tuesday, it’s a million times easier to see that a given worker is backing up compared to adding tracing through every single function call of a monolithic app.
That’s not to say you can’t get good visiblity from more tightly coupled code, it’s just rarer to have all the right visibility from day one.
Where does that visibility come from? Consider for a moment the standard tools that are part of our ops arsenal:
None of them monitor individual program execution: hot codepaths, stack size, etc. The battle-tested tools we’ve built over the past 20 years are all built around the concepts of hosts, processes, or drives.
With a distributed system, we can add in requests and network throughput to our metrics, but most tools still tend to aggregate at a host or service level.
The process-centric nature of monitoring tools makes it really difficult to get a sense of where a program is actually spending time. With a monolithic app, our best options to debug are either to run the program against a profiler or to implement our own timing metrics.
Now that’s kind of crazy when you think about it. Most of the reason flamegraphs are so useful is that we don’t have that detailed amount of monitoring at the level of individual function calls.
So instead of trying to shoe-horn lots of functionality into monoliths, at Segment we’ve doubled down on microservices. We’re betting that container scheduling and orchestration will continue to get easier and more powerful, while most metrics and monitoring will continue to be dominated by the idea of ‘hosts’ and ‘services’.
The caveat here is that microservices only work so long as it’s actually easy to create new services. Otherwise we’ve just traded a visibility problem for a provisioning problem.
In other posts, we’ve talked a little bit about what our services look like, and how we build them with terraform. And now, we’ve started splitting each service into modules, so we can re-use the exact configuration between stage and prod.
Here’s an example of a simple auth service, using terraform as our configuration to set up all of our resources:
For the curious, you can check out an example of the full module definition.
As long as there’s a singificant benefit (free metrics) and low cost (10-line terraform script), we remove the temptation to tack on different functionality into an existing service.
And so far, that approach has been working quite well.
Segment is a bit unusual–instead of microservices which are coordinating together, we have a lot of what I’d call “microworkers.” Fundamentally, it’s the same concept, but the worker doesn’t serve requests to clients. Instead, the typical Segment worker reads some data from a queue, does some processing on it, and then acks the message.
These workers end up being a lot simpler than services because there are no dependencies. There’s no coupling or worrying that a given problem with one worker will compound and disrupt the rest of a system. If a service is acting up, there’s just a single queue that ends up backing up. And we can scale additional workers to handle the load.
There are a few forces at work which make tiny workers the right call for us. But the biggest comes from our team size and relative complexity of what we’re trying to build.
Microservices are usually touted when the team grows to a size where there are too many people working on the same codebase. At that point, it makes sense to shard ownership of the codebase by team. But we’ve seen it be equally helpful with a small team as well.
Most folks I talk with are surprised at how small our engineering team is. To give you a rough sense of our scale:
400 private repos
70 different services (workers)
We’re in the postion of having a large product scope and a small engineering team. So if I’m currently on-call and get paged, it could be for code that I wrote 6-months ago and haven’t touched since.
And that’s the place where tiny, well-defined, services shine.
Here’s the typical scenario: first there’s an alert which gets triggered because a particular queue depth is backing up.
We can verify this is really the case (and isn’t getting better) by checking the queue depth in our monitoring tools.
At that point we know exactly which worker is backing up (since each worker subscribes to a single queue), and which logs to look at. Each service logs with its own tags, so we don’t have to worry about unrelated logs interleaving within a single app for multiple requests.
We can look at Datadog for a single dashboard containing that worker’s CPU, Memory, and the responses and latency coming from it’s ELB. Once we’ve identified the problem, it’s a question of reading through 50-100 line file to isolate exactly where the problem is happening (let’s play spot the memory leak!).
With a monolith, we could add individual monitoring specifically for each endpoint. But why bother when we get it for free by running code as part of its own process?
Not to mention the fact that we also get isolated CPU, memory, and latency (if the service sits behind an ELB) out of the box. It’s infinitely easier to track down a memory leak in a hundred-line worker with a single codepath than it is in a monolithic app with hundreds of endpoints.
I understand this approach won’t work for everyone. And it requires a pretty significant investment in up-front tooling to make sure that creating a new service from scratch has everything it needs. Depending on your team, workload, and product scope, it might not make sense.
But for any product operating with a high level of operational complexity and load, I’d choose the microservice architecture every time. It’s made our infrastructure flexible, scalable, and far easier to monitor–without sacrificing developer productivity.
Calvin French-Owen on November 20th 2015
Every month, Segment collects, transforms and routes over 50 billion API calls to hundreds of different business-critical applications. We’ve come a long way from the early days, where my co-founders and I were running just a handful of instances.
Today, we have a much deeper understanding of the problems we’re solving, and we’ve learned a ton. To keep moving quickly and avoid past mistakes, our team has started developing a list of engineering best practices.
Now that a lot of these “pro tips” have been tested, deployed and are currently in production… we wanted to share them with you. It’s worth noting that we’re standing on the shoulders of giants here, to The Zen of Python, Hints for Computer System Design, and the Twelve-Factor App for the inspiration.
Editor’s Note: This post was based off an internal wiki page for Segment “Pro Tips”. There are more tips recorded there, but we chose a handful that seemed most broadly applicable. They’re written as fact, but internally we treat them as guidelines, always weighing other trade-offs within the organization. Each practice is followed by a few bullet-points underscoring the main takeaways.
When we first started out, we had one massive repo. Every module was filled with tightly coupled dependencies and was completely unversioned. Changing a single API required changing code globally. Developing with more than a handful of people would’ve been a nightmare.
So one of our first changes as the engineering team grew was splitting out the modules into separate repos (thanks TJ!). It was a massive task but it had huge payoff by making development with a larger team actually sane. Unfortunately, it was way harder than it should have been because we lumped everything together at the start.
It turns out this temptation to combine happens everywhere: in services, libraries, repos and tools. It’s so easy to add (just) one more feature to an existing codebase. But it has a long-term cost. Separation of concerns is the exact reason why UNIX-style systems are so successful; they give you the tools to compose many small building blocks into more complex programs.
structure code so that it’s easy to be split (or split from the beginning)
if a service or library doesn’t share concerns with existing ones, create a new one rather than shoe-horning it into an existing piece of code
testing and documenting libraries which perform a single function is much easier to understand
keep uptime, resource consumption and monitoring in mind when combining read/write concerns of a service
prefer libraries to frameworks, composing them together where possible
“Clever” code usually means “complicated” code. It’s hard to search for, and tough to track down where bugs are happening. We prefer simple code that’s explicit in it’s purpose rather than trying to create a magical API that relies on convention (go’s lack of “magic” is actually one of our favorite things about it).
As part of being explicit, always consider the “grep-ability” of your code. Imagine that you’re trying to find out where the implementation for the
post method lives, which is easier to find in a codebase?
Where possible, write code that is short, straightforward and easy to understand. Often that will come down to single functions that are easy to test and easy to document. Even libraries can perform just a single function and then be combined for more powerful functionality.
With comments, describe the “why” versus the typical “what” for a given process or routine. If a routine seems out of place but is necessary, it’s sometimes worth leaving a quick note as to why it exists at all.
avoid generating code dynamically or being overly ‘clever’ to shorten the line count
aim for functions that are <7 lines and <2 nested callbacks
Running code in production without metrics or alerting is flying blind. This has bitten our team more times than I’d care to admit, so we’ve increased our test coverage and monitoring extensively. Every time a user encounters a bug before we do, it damages their trust in us as a company. And that sucks.
Trust in our product is perhaps most valuable asset we have as a company. Losing that is almost completely irrecoverable; it’s the way we lose as a business. Our brand is built around data, and reliability is paramount to our success.
write test cases first to check for the broken behavior, then write the fix
all top-level apps should ship with metrics and monitoring
create ‘warning’ alerts for when an internal system is acting up, ‘critical’ ones when it starts affecting end customers
try to keep unrealistic failure scenarios in mind when designing the alerts
When building a product, there are three aspects you can optimize: Speed, quality, and scope. The catch… is that you can’t ever juggle all three simultaneously. Sacrificing quality by adding hacky fixes increases the amount of technical debt. It slows us down over the long-term, and we risk losing customer trust in the product. Not to mention, hacks are a giant pain to work on later.
At the same time, we can’t sacrifice speed either–that’s our main advantage as a startup. Long-running projects tend to drag on, use up a ton of resources and have no clearly defined “end.” By the time a monolithic project is finally ready to launch, releasing the finished product to customers becomes a daunting process.
When push comes to shove, it’s usually best to cut scope. It allows us to split shipments into smaller, more manageable chunks, and really focus on making each one great.
evaluate features for their benefit versus their effort
identify features that could be easily layered in later
cut features that create obvious technical debt
Separate code paths almost always become out of sync. One will get updated while another doesn’t, which makes for inconsistent behavior. At the architecture level, we want to try and optimize for a single code path.
Note that this is still consistent with splitting things apart, it just means that we need smaller pieces which are flexible enough to be combined together in different ways. If two pieces of code rely on the same functionality, they should use the same code path.
have a peer review your code; an objective opinion will almost always help
get someone else to sign-off on non-trivial pull-requests
if you ever find yourself copy-pasting code, consider pulling it into a library
if you need to frequently update a library, or keep state around, turn it into a service
Creating a loose mockup of a program is often the quickest way to understand the problem you’re solving. When you’re ready to write the real thing just
`rm -fr .git ` to start with a clean slate and better context.
Building something helps you learn more than you could ever hope to uncover through theorizing. Trust me, prototyping helps discover strange edge-cases and bottlenecks which may require you to rearchitect the solution. This process minimizes the impact of architectural changes.
don’t spend a lot of time with commit messages, keep them short but sensical
refactors typically come from a better understanding of the problem, the best way to get there is by building a version to “throw away”
Early on, it’s easy to write off automation as unimportant. But if you’ve done any time-consuming task more than 3 times you’ll probably want to automate it.
A key example of where we failed at this in the past was with Redshift’s cluster management. Investing in the tooling around provisioning clusters was a big endeavor, but it would have saved a ton of time if we’d started it sooner.
if you find yourself repeatedly spending more than a few minutes on a task, take a step back and consider tooling around it
ask yourself if you could be 20% more efficient, or if automation would help
share tools in dotfiles, vm, or task runner so the whole team can use them
Whenever you’re building out a new project or library, it’s worth considering which pieces can be pulled out and open sourced. At face value, it sounds like an extra constraint that doesn’t help ship product. But in practice, it actually creates much cleaner code. We’re guaranteed that the code’s API isn’t tightly coupled to anything we’re building internally, and that it’s more easily re-used across projects.
Open sourced code typically has a well-documented Readme, tests, CI, and more closely resembles the rest of the ecosystem. It’s a good sanity check that we’re not doing anything too weird internally, and the code is easier to forget about and re-visit 6-months later.
if you build a general purpose library without any dependencies, it’s a prime target for open sourcing
try and de-couple code so that it can be used standalone with a clear interface
never include custom configuration in a library, allow it to be passed in with sane defaults
Sometimes big problems arise in code and it may seem easier to write a work-around. Don’t do that. Hacking around the outskirts of a problem is only going to create a rat’s nest that will become an even bigger problem in the future. Tackle the root cause head-on.
A textbook example of this came from the first version of our integrations product. We proxied and transformed analytics calls through our servers to 30–40 different services, depending on what integrations the customer had enabled. On the backend, we had a single pool of integration workers that would read each incoming event from the queue, look up which settings were enabled, and then send copies of the event each enabled integration.
It worked great for the first year, but over time we started running into more and more problems. Because the workers were all shared, a single slow endpoint would grind the entire pool of workers to a halt. We kept adjusting and tweaking individual timeouts to no end, but the backlogs kept occurring. Since then, we’ve fixed the underlying issue by partitioning the data processing queues by endpoint so they operate completely independently. It was a large project, but one that had immediate pay-off, allowing us to scale our integrations platform.
Sometimes it’s worth taking a step back to solve the root cause or upstream problem rather than hacking around the periphery. Even if it requires a more significant restructuring, it can save you a lot of time and headache down the road, allowing you to achieve much greater scale.
whenever fixing a bug or infrastructure issue, ask yourself whether it’s a core fix or just a band-aid over one of the symptoms
keep tabs on where you’re spending the most time, if code is continually being tweaked, it probably needs a bigger overhaul
if there’s some bug or alert we didn’t catch, make sure the upstream cause is being monitored
When designing applications, coming up with a data model is one of the trickiest parts of implementation. The frontend, naturally, wants to match the user’s idea of how the data is formatted. Out of necessity, the backend has to match the actual data format. It must be stored in a way that is fast, performant and flexible.
So when starting with a new design, it’s best to first look at the user requirements and ask “which goals do we want to meet?” Then, look at the data we already have (or decide what new data you need) and figure out how it should be combined.
The frontend models should match the user’s idea of the data. We don’t want to have to change the data model every time we change the UI. It should remain the same, regardless of how the interface changes.
The service and backend models should allow for a flexible API from the programmer’s perspective, in a way that’s fast and efficient. It should be easy to combine individual services to build bigger pieces of functionality.
The controllers are the translation layer, tying together individual services into a format which makes sense to the frontend code. If there’s a piece of complicated logic which makes sense to be re-used, then it should be split into it’s own service.
the frontend models should match the user’s conception of the data
the services need to map to a data model that is performant and flexible
controllers can map between services and the frontend to assemble data
It’s easy to talk at length about best practices but actually following them requires discipline. Sometimes it’s tempting to cut corners or skip a step; but that doesn’t help long-term.
Now that we’ve codified these engineering best practices and the rationale behind each one, they have made their way into our default mode of operation. The act of explicitly writing them down has both clarified our thinking and helped us avoid making the same short-term mistakes over and over.
In practice, this means that we invest heavily in good tooling, modular libraries and microservices. In development, we keep a shared VM that auto-updates, with shared dotfiles for easily navigating our many small repositories. We put a focus on creating projects which increase functionality through composability rather than inheritance. And we’ve worked hard to streamline our process for running services in production.
All of this keeps our development team moving quickly and increases the quality of the product we ship. We’re able to accomplish a lot more with a lot less effort. And we’ll continue trying to improve and share that tooling with the community as it matures.
Andy Jiang, Vince Prignano on November 17th 2015
Growing a business is hard and growing the engineering team to support that is arguably harder, but doing both of those without a stable infrastructure is basically impossible. Particularly for high growth businesses, where every engineer must be empowered to write, test, and ship code with a high degree of autonomy.
Over the past year, we’ve added ~60 new integrations (to over 160), built a platform for partners to write their own integrations, released a Redshift integration, and have a few big product announcements on the way. And in that time, we’ve had many growing pains around managing multiple environments, deploying code, and general development workflows. Since our engineers are happiest and most productive when their time is spent shipping product, building tooling, and scaling services, it’s paramount that the development workflow and its supporting infrastructure are simple to use and flexible.
And that’s why we’ve automated many facets of our infrastructure. We’ll share our current setup in greater detail below, covering these main areas:
Let’s dive in!
As the code complexity and the engineering team grow, it can become harder to keep dev environments consistent across all engineers.
Before our current solution, one big problem our engineering team faced was keeping all dev environments in sync. We had a GitHub repo with a set of shell scripts that all new engineers executed to install the necessary tools and authentication tokens onto their local machines. These scripts would also setup Vagrant and a VM.
But this VM was built locally on each computer. If you modified the state of your VM, then in order to get it back to the same VM as the other engineers, you’d have to build everything again from scratch. And when one engineer updates the VM, you have to tell everyone on Slack to pull changes from our GitHub VM repo and rebuild. An awfully painful process, since Vagrant can be slow.
Not a great solution for a growing team that is trying to move fast.
When we first played with Docker, we liked the ability to run code in a reproducible and isolated environment. We wanted to reuse these Docker principles and experience in maintaining consistent dev environments across a growing engineering team.
We wrote a bunch of tools to set up the VM for new engineers to upgrade or to reset from the basic image state. When our engineers set up the VM for the first time, it asks for their GitHub credentials and AWS tokens, then pulls and builds from the latest image in Docker Hub.
On each run, we make sure that the VM is up-to-date by querying the Docker Hub API. This process updates packages, tools, etc. that our engineers use everyday. It takes around 5 seconds and is needed in order to make sure that everything is running correctly for the user.
Additionally, since our engineers use Macs, we switched from boot2dockervirtualbox machine to a Vagrant hosted boot2docker instance so that we could take advantage of NFS to share the volumes between the host and guest. Using NFS provides massive performance gains during local development. Lastly, NFS allows any changes our engineers make outside of the VM to be instantaneously reflected within the VM.
With this solution we have vastly reduced the number of dependencies needed to be installed on the host machine. The only things needed now are Docker, Docker Compose, Go, and a
The ideal situation is dev and prod environments running the same code, yet separated so code running on dev may never affect code running production.
Before we had the AWS state (generated by Terraform) stored alongside the Terraform files, but this wasn’t a perfect system. For example if two people asynchronously plan and apply different changes, the state will be modified and who pushes last is going to have hard times to figure out the merge collisions.
We achieved mirroring staging and production in the simplest way possible: copying files from one folder to another. Terraform enabled us to reduce the amount of hours taken to modify the infrastructure, deploy new services and making improvements.
We integrated Terraform with CircleCI writing a custom build process and ensuring that the right amount of security was taken in consideration before applying.
At the moment, we have one single repository hosted on GitHub named
infrastructure, which contains a collection of Terraform scripts that configure environmental variables and settings for each of our containers.
When we want to change something in our infrastructure, we make the necessary changes to the Terraform scripts and run them before opening a new pull request for someone else on the infra-team to review it. Once the pull request gets merged to master, CircleCI will start the deployment process: the state gets pulled, modified locally, and stored again on S3.
When developing locally, it’s important to populate local data stores with dummy data, so our app looks more realistic. As such, seeding databases is a common part of setting up the dev environment.
We rely on CircleCI, Docker, and volume containers to provide easy access to dummy data. Volume containers are portable images of static data. We decided to use volume containers because the data model and logic becomes less coupled and easier to maintain. Also just in case this data is useful in other places in our infrastructure (testing, etc., who knows).
Loading seed data into our local dev environment occurs automatically when we start the app server in development. For example, when the
app (our main application) container is started in a dev environment,
app‘s docker-compose.yml script will pull the latest
seed image from Docker Hub and mount the raw data in the VM.
seed image from Docker Hub is created from a GitHub repo
seed, that is just a collection of JSON files as the raw objects we import into our databases. To update the seed data, we have CircleCI setup on the repo so that any publishes to master will build (grabbing our mongodb and redis containers from Docker Hub) and publish a new
seed image to Docker Hub, which we can use in the app.
Due to the data-heavy nature of Segment, our app already relies on several microservices (db service, redis, nsq, etc). In order for our engineers to work on the app, we need to have an easy way to create these services locally.
Again, Docker makes this workflow extremely easy.
Similar to how we use
seed volume containers to mount data into the VM locally, we do the same with microservices. We use the docker compose file to grab images from Docker Hub to create locally, set addresses and aliases, and ultimately reduce the complexity to a single terminal command to get everything up and running.
If you write code, but never ship it to production, did it ever really happen? 😃
Deploying code to production is an integral part of the development workflow. At Segment, we prioritize easiness and flexibility around shipping code to production, since that encourages our engineers to move quickly and be productive. We’ve also created adequate tooling around safeguarding for errors, rolling back, and monitoring build statuses.
We use Docker, ECS, CircleCI, and Terraform to automate as much of the continuous deployment process as possible.
Whenever code is pushed or merged into its
master branch, the CircleCI script build the container and push it to Docker Hub.
With this setup, we can define the configuration once for any service, making it extremely easy for our engineers to create and deploy new microservices. As Calvin mentioned in a previous post, “Rebuilding Our Infrastructure with Docker, ECS, and Terraform”:
We no longer have a complex set of provisioning scripts or AMIs—we just hand the production cluster an image, and it runs. There’s no more stateful instances, and we’re guaranteed to run the same exact code on both staging and prod.
The automation and ease of use around deployment have positively impacted more than just our engineers. Our success and marketing teams can update markdown files in a handful of repos that, when merged to master, kick off an auto deploy process so that changes can be live in minutes.
Because we chose to invest effort into rethinking and automating our dev workflow and its supporting infrastructure, our engineering team move fasters and more confidently. We spend more time doing high leverage jobs that we love—shipping product and building internal tools—and less time yak shaving.
That said, this is by no means the final iteration of our infrastructure automation. We are constantly playing with new tools and testing new ideas, seeing what further efficiencies we can eek out.
This has been a tremendous learning process for us and we’d love to hear what others in the community have done with their dev workflows. If you end up implementing something like this (or have already), let us know! We’d love to hear what you’ve done, and what’s worked or hasn’t for others with similar problems.
Andy Jiang on October 20th 2015
A little while ago we open-sourced a static site generator called Metalsmith. We built Metalsmith to be flexible enough that it could build blogs (like the one you’re reading now), knowledge bases, and most importantly our technical documentation.
Using Metalsmith to build a simple blog is one thing, but building easy-to-maintain documentation isn’t as simple. There are different sections, libraries, various reusable code snippets and content that live in multiple places, and other stuff to consider. Metalsmith simplifies maintaining all of these moving parts and let’s us focus purely on creating helpful content. Here’s how we did it!
The first thing to know is that Metalsmith works by running a series of transformations on a directory of files. In the case of our docs, that directory is just a bunch of folders with Markdown files in them.
The directory structure mimics the URL structure:
And an individual Markdown file might look like this:
It’s structured this way because it makes the actual content easy to maintain. They are just regular folders with plain old Markdown files in them. That means that anyone on the team can easily edit the docs without any technical knowledge. You can even do it right form the GitHub editor:
So far so good. But how do just those simple Markdown files get transformed into our entire technical documentation? That’s where Metalsmith comes in. Like I mentioned earlier, Metalsmith is really just a series of transformations to run on the directory of Markdown files.
I’ll walk you through each of the transformations we use in order.
The first transformation we use is a custom plugin that takes all the files in a directory and exposes them as partials in Handlebars. That means we can keep content that we repeat a lot in a single place for easier maintenance.
The next transformation is a plugin that groups files together into “collections”. In our case, those collections are built into our sub-directories, so we have collections like: Libraries, Plugins, Tutorials, etc.
We also pass our own custom
sorter function that will return the order specified in the array and append the remaining files pseudo-alphabetized.
Having all of the collections grouped as simple arrays makes it easy for us to do things like automatically generate a top-level navigation to get to every collection:
Or to automatically generate a collection-level navigation for navigating between pages:
The plugin categorizes all files that fit the provided definition (in our case, providing file path patterns), adds a
collection array to each file that contains the name of the collection, and finally adds a
previous properties to files that points to the sibling file in the collection. This allows us to easily render collections later on with handlebars:
The key is that all of those pieces are automatically generated, so you never need to worry about remembering to link between pages.
Note that this plugin does not determine the final directory structure. By default, the directory structure is preserved from start to end, unless a plugin specifically modifies this.
The third transformation step is to template all of our Markdown files in placewith metalsmith-in-place. By that, I mean that we just run Handlebars over our Markdown files right where they are, so that we can take advantage of a bunch of helpers we’ve added.
For example, in any of our Markdown files we can use an
api-example helper like so:
Which, will render a language-agnostic code snippet that remembers the user’s language preference:
You can find the above code snippet here.
Then, we transform all of those Markdown files into HTML files with the metalsmith-markdown plugin. Pretty self-explanatory!
Now that we have all of our files as
.html instead of
.md, the next transformation is pretty simple using metalsmith-headings. It iterates over all of the files once more, extracting the text of all the
<h2> tags and adding that array as metadata of the file. So you might end up with a file object that looks like this:
Why would we want to do that? Because it means we can build the navigation in the sidebar automatically from the content of the file itself:
So you never need to worry about remembering to update the navigation yourself.
The next step is to use the permalinks plugin to transform files so that all of the content lives in
index.html files, so they can be served statically. For example, given a source directory like this:
The permalinks plugin would transform that into:
So that NGINX can serve those static files as:
The last step is to template all of our source files again (they’re not
.md anymore, they’re all
.html at this point) by rendering them into our top-level layout.
layout.html file is where all of the navigation rendering logic is contained, and we just dump the contents of each of the pages that started as Markdown into the global template, like so:
Once that’s done, we’re done! All of those files that started their life as simple Markdown have been run through a bunch of transformations. They now live as a bunch of static HTML files that each have automatically-generated navigations and sidebars (with active states too).
The last step is to deploy our documentation. This step isn’t to be forgotten, because our goal was to make our docs so simple to edit that everyone on the team can apply fixes as customers report problems.
To make our team as efficient as possible about shipping fixes and updates to our docs, we have our repo setup so that any branch merged to
master will kick off CircleCI to build and publish to production. Anyone can then make edits in a separate branch, submit a PR, then merge to
master, which will then automatically deploy the changes.
For the vast majority of text-only updates, this is perfect. Though, occasionally we may need more complex things.
For more information on the tech we use for our backend, check out Rebuilding Our Infrastructure with Docker, ECS, and Terraform.
Before we converted our docs to Metalsmith, they lived in a bunch of Jade files that were a pain in the butt to change. Because we had little incentive to edit them, we let typos run rampant and waited too long to fix misinformation. Obviously this was a bad situation for our customers.
Now that our docs are easy to edit in Markdown and quick to deploy, we fix problems much faster. The quickest way to fix docs issues is to make a permanent change, rather than repeat ourselves in ticket after ticket. With a simpler process, we’re able to serve our customers much better, and we hope you can too!
Calvin French-Owen on October 7th 2015
In Segment’s early days, our infrastructure was pretty hacked together. We provisioned instances through the AWS UI, had a graveyard of unused AMIs, and configuration was implemented three different ways.
As the business started taking off, we grew the size of the eng team and the complexity of our architecture. But working with production was still limited to a handful of folks who knew the arcane gotchas. We’d been improving the process incrementally, but we needed to give our infrastructure a deeper overhaul to keep moving quickly.
So a few months ago, we sat down and asked ourselves: “What would an infrastructure setup look like if we designed it today?”
Over the course of 10 weeks, we completely re-worked our infrastructure. We retired nearly every single instance and old config, moved our services to run in Docker containers, and switched over to use fresh AWS accounts.
We spent a lot of time thinking about how we could make a production setup that’s auditable, simple, and easy to use–while still allowing for the flexibility to scale and grow.
Here’s our solution.
Instead of using regions or tags to separate different staging and prod instances, we switched over totally separate AWS accounts. We need to ensure that our provisioning scripts wouldn’t affect our currently running services, and using fresh accounts meant that we had a blank slate to start with.
ops account serves as the jump point and centralized login. Everyone in the organization can have a IAM account for it.
The other environments have a set of IAM roles to switch between them. It means there’s only ever one login point for our admin accounts, and a single place to restrict access.
As an example, Alice might have access to all three environments, but Bob can only access dev (ever since he deleted the production load balancer). But they both enter through the
Instead of having complex IAM settings to restrict access, we can easily lock down users by environment and group them by role. Using each account from the interface is as simple as switching the currently active role.
Instead of worrying that a staging box might be unsecured or alter a production database, we get true isolation for free. No extra configuration required.
There’s the additional benefit of being able to share configuration code so that our staging environment actually mirrors prod. The only difference in configuration are the sizes of the instances and the number of containers.
Finally, we’ve also enabled consolidated billing across the accounts. We pay our monthly bill with the same invoicing and see a detailed breakdown of the costs split by environment.
As of today, we’re now running the majority of our services inside Docker containers, including our API and data pipeline. The containers receive thousands of requests per second and process 50 billion events every month.
The biggest single benefit of Docker is the extent that it’s empowered the team to build services from scratch. We no longer have a complex set of provisioning scripts or AMIs—we just hand the production cluster an image, and it runs. There’s no more stateful instances, and we’re guaranteed to run the same exact code on both staging and prod.
After configuring our services to run in containers, we chose ECS as the scheduler.
At a high level, ECS is responsible for actually running our containers in production. It takes care of scheduling services, placing them on separate hosts, and zero-downtime reloads when attached to an ELB. It can even schedule across AZs for better availability. If a container dies, ECS will make sure it’s re-scheduled on a new instance within that cluster.
The switch to ECS has vastly simplified running a service without needing to worry about upstart jobs or provisioning instances. It’s as easy as adding a Dockerfile, setting up the task definition, and associating it with a cluster.
In our setup, the Docker images are built by CI, and then pushed to Docker Hub. When a service boots up, it pulls the image from Docker Hub, and then ECS schedules it across machines.
We group our service clusters by their concern and load profile (e.g. different clusters for API, CDN, App, etc). Having separate clusters means that we get better visibility and can decide to use different instance types for each (since ECS has no concept of instance affinity).
Each service has a particular task definition indicating which version of the container to run, how many instances to run on, and which cluster to choose.
During operation, the service registers itself with an ELB and uses a healthcheck to confirm that the container is actually ready to go. We point a local Route53 entry at the ELB, so that services can talk to each other and simply reference via DNS.
The setup is nice because we don’t need any service discovery. The local DNS does all the bookkeeping for us.
ECS runs all the services and we get free cloudwatch metrics from the ELBs. It’s been a lot simpler than having to register services with a centralized authority at boot-time. And the best part is that we don’t have to deal with state conflicts ourselves.
Where Docker and ECS describe how to run each of our services, Terraform is the glue that holds them together. At a high level, it’s a set of provisioning scripts that create and update our infrastructure. You can think of it like a version of Cloudformation on steroids–but it doesn’t make you want to poke your eyes out.
Rather than running a set of servers for maintaining state, there’s just a set of scripts that describe the cluster. Configuration is run locally (and in the future, via CI) and committed to git, so we have a continuous record of what our production infrastructure actually looks like.
Here’s an sample of our Terraform module for setting up our bastion nodes. It creates all the security groups, instances, and AMIs, so that we’re able to easily set up new jump points for future environments.
We use the same module in both stage and prod to set up our individual bastions. The only thing we need to switch out are the IAM keys, and we’re ready to go.
Making changes is also painless. Instead of always tearing down the entire infrastructure, Terraform will make updates where it can.
When we wanted to change our ELB draining timeout to 60 seconds, it took a simple find/replace followed by a
terraform apply. And voilà, two minutes later we had a fully altered production setup for all of our ELBs.
It’s reproduceable, auditable, and self-documenting. No black boxes here.
We’ve put all the config in a central
infrastructure repo, so it’s easy to discover how a given service is setup.
We haven’t quite reached the holy grail yet though. We’d like to convert more of our Terraform config to take advantage of modules so that individual files can be combined and reduce the amount of shared boilerplate.
Along the way we found a few gotchas around the
.tfstate, since Terraform always first reads from the existing infrastructure and complains if the state gets out of sync. We ended up just committing our
.tfstate to the repo, and pushing it after making any changes, but we’re looking into Atlas or applying via CI to solve that problem.
By this point, we had our infrastructure, our provisioning, and our isolation. The last things left were metrics and monitoring to keep track of everything running in production.
In our new environment, we’ve switched all of our metrics and monitoring over to Datadog, and it’s been fantastic.
We’ve been incredibly happy with Datadog’s UI, API, and complete integration with AWS, but getting the most out of the tool comes from a few key pieces of setup.
The first thing we did was integrate with AWS and Cloudtrail. It gives a 10,000 foot view of what’s going on in each of our environments. Since we’re integrating with ECS, the Datadog feed updates everytime a task definition updates, so we end up getting notifications for deploys for free. Searching the feed is surprisingly snappy, and makes it easy to trace down the last time a service was deployed or rescheduled.
Next, we made sure to add the Datadog-agent as a container to our base AMI (datadog/docker-dd-agent). It not only gathers metrics from the host (CPU, Memory, etc) but also acts as a sink for our statsd metrics. Each of our services collects custom metrics on queries, latencies, and errors so that we can explore and alert on the in Datadog. Our go toolkit (soon to be open sourced) automatically collects the output of
pprof on a ticker and sends it as well, so we can monitor memory and goroutines.
What’s even cooler is that the agent can visualize instance utilization across hosts in the environment, so we can get a high level overview of instances or clusters which might be having issues:
Additionally, my teammate Vince created a Terraform provider for Datadog, so we can completely script our alerting against the actual production configuration. Our alerts will be recorded and stay in sync with what’s running in prod.
By convention, we specify two alert levels:
warning is there to let anyone currently online know that something looks suspicious and should be triggered well in advance of any potential problems. The
criticalalerts are reserved for ‘wake-you-up-in-the-middle-of-the-night’ problems where there’s a serious system failure.
What’s more, once we transition to Terraform modules and add the Datadog provider to our service description, then all services end up getting alerts for free. The data will be powered directly by our internal toolkit and Cloudwatch metrics.
Once we had all these pieces in place, the day had finally come to make the switch.
We first set up a VPC peering connection between our new production environment and our legacy one–allowing us to cluster databases and replicate across the two.
Next, we pre-warmed the ELBs in the new environment to make sure that they could handle the load. Amazon won’t provision automatically sized ELBs, so we had to ask them to ramp it ahead of time (or slowly scale it oursleves) to deal with the increased load.
From there, it was just a matter of steadily ramping up traffic from our old environment to our new one using weighted Route53 routes, and continuously monitoring that everything looked good.
Today, our API is humming along, handling thousands of requests per second and running entirely inside Docker containers.
But we’re not done yet. We’re still fine-tuning our service creation, and reducing the boilerplate so that anyone on the team can easily build services with proper monitoring and alerting. And we’d like to improve our tooling around working with containers, since services are no longer tied to instances.
We also plan to keep an eye on promising tech for this space. The Convox team is building awesome tooling around AWS infrastructure. Kubernetes, Mesosphere, Nomad, and Fleet seemed like incredibly cool schedulers, though we liked the simplicity and integration of ECS. It’s going to be exciting to see how they all shake out, and we’ll keep following them to see what we can adopt.
After all of these orchestration changes, we believe more strongly than ever in outsourcing our infrastructure to AWS. They’ve changed the game by productizing a lot of core services, while maintaining an incredibly competitive price point. It’s creating a new breed of startups that can build products efficiently and cheaply while spending less time on maintenance. And we’re bullish on the tools that will be built atop their ecosystem.
Calvin French-Owen, Chris Sperandio on August 4th 2015
It wasn’t long ago that building out an analytics pipeline took serious engineering chops. Buying racks and drives, scaling to thousands of requests a second, running ETL jobs, cleaning the data, etc. A team of engineers could easily spend months on it.
But these days, it’s getting easier and cheaper. We’re seeing the UNIX-ification of hosted services: each one designed to do one thing and do it well. And they’re all composable.
It made us wonder: just how quickly could a single person build their own pipeline without having to worry about maintaining it? An entirely managed data processing stream?
It sounded like analytics Zen. So we set out to find our inner joins peace armed with just a few tools: Terraform, Segment, DynamoDB and Lambda.
make (check it out on Github).
As a toy example, our data pipeline takes events uploaded to S3 and increments a time-series count for each event in Dynamo. It’s the simplest rollup we could possibly do to answer questions like: “How many purchases did I get in the past hour?” or “Are my signups increasing month over month?”
Here’s the general dataflow:
Event data enters your S3 bucket through Segment’s integration. The integration uploads all the analytics data sent to the Segment API on an hourly basis.
That’s where the composability of AWS comes in. S3 has a little-known feature called “Event Notifications”. We can configure the bucket to push a notification to a Lambda function on every file upload.
In theory, our Lambda function could do practically anything once it gets a file. In our example, we’ll extract the individual events, and then increment each
<event, hour> pair in Dynamo.
Once our function is up and running, we’ll have a very rudimentary timeseries database to query event counts over time.
From there, we’ll handle the incoming events, and update each item in Dynamo:
And finally, we can query for the events in our database using the CLI:
We could also build dashboards on it a la google analytics or geckoboard.
Even though we have our architecture and lambda function written, there’s still the task of having to describe and provision the pipeline on AWS.
Configuring these types of resources has been kind of a pain for us in the past. We’ve tried Cloudformation templates (verbose) and manually creating all the resources through the AWS UI (slooooow).
Neither of these options has been much fun for us, so we’ve started using Terraform as an alternative.
If you haven’t heard of Terraform, definitely give it a spin. It’s a badass tool designed to help provision and describe your infrastructure. It uses a much simpler configuration syntax than Cloudformation, and is far less error-prone than using the AWS Console UI.
As a taste, here’s what our
lambda.tf file looks like to provision our Lambda function:
The Terraform plan for this project creates an S3 bucket, the Lambda function, the necessary IAM roles and policies, and the Dynamo database we’ll use for storing the data. It runs in under 10 seconds and immediately sets up our infrastructure so that everything is working properly.
If we ever want to make a change, a simple
terraform apply, will cause the changes to update in our production environment. At Segment, we commit our configurations to source control so that they are easily audited and changelog’d.
We just walked through a basic example, but with Lambda there’s really no limit to what your functions might do. You could publish the events to Kinesis with additional Lambda handlers for further processing, or pull in extra fields from your database. The possibilities are pretty much endless thanks the APIs Amazon has created.
If you’d like to build your own pipeline, just clone or fork our example Github repo.
And that’s it! We’ll keep streaming. You keep dreaming.
Orta Therox on June 24th 2015
We’re excited to welcome Orta from CocoaPods to the blog to discuss the new Stats feature! We’re big fans of CocoaPods and are excited to help support the project.
CocoaPods is the Dependency Manager for iOS and Mac projects. It works similar to npm, rubygems, gradle or nuget. We’ve been running the open source project for 5 years and we’ve tried to keep the web infrastructure as minimal as possible.
Our users have been asking for years about getting feedback on how many downloads their libraries have received. We’ve been thinking about the problem for a while, and finally ended up asking Segment if they would sponsor a backend for the project.
It wasn’t just enough to offer just download counts. We spend a lot of time working around Xcode (Apple’s developer tool) project file intricacies, however in this context, it provides us with foundations for a really nice feature. CocoaPods Stats will be able to keep track of the unique number of installs within Apps / Watch Apps / Extensions / Unit Tests.
This means is that developers using continuous integration only register as 1 install, even if the server runs
pod install each time, separating total installations vs actual downloads.
Let’s go over how we check which pods get sent up for analytics, and how we do the unique installs. CocoaPods-Stats is a plugin that will be bundled with CocoaPods within a version or two. It registers as a post-install plugin and runs on every
pod install or
We’re very pessimistic about sending a Pod up to our stats server. We ensure that you have a CocoaPods/Specs repo set up as your master repo, then ensure that each pod to be sent is inside that repo before accepting it as a public domain pod.
First up, we don’t want to know anything about your app. So in order to know unique targets we use your project’s target UUID as an identifier. These are a hash of your MAC address, Xcode’s process id and the time of target creation (but we only know the UUID/hash, so your MAC address is unknown to us). These UUIDs never change in a project’s lifetime (contrary to, for example, the bundle identifier). We double hash it just to be super safe.
We then also send along the CocoaPods version that was used to generate this installation and about whether this installation is a
pod try [pod] rather than a real install.
My first attempt at a stats architecture was based on how npm does stats, roughly speaking they send all logs to S3 where they are map-reduced on a daily basis into individual package metrics. This is an elegant solution for a companywith people working full time on up-time and stability. As someone who wants to be building iOS apps, and not maintaining more infrastructure in my spare time, I wanted to avoid this.
We use Segment at Artsy, where I work, and our analytics team had really good things to say about Segment’s Redshift infrastructure. So I reached out about having Segment host the stats infrastructure for CocoaPods.
We were offered a lot of great advice around the data-modelling and were up and running really quickly. So you already know about the CocoaPods plugin, but from there it sends your anonymous Pod stats up to stats.cocoapods.org. This acts as a conduit sending analytics events to Segment. A daily task is triggered on the web site, this makes SQL requests against the Redshift instance which is then imported into metrics.cocoapods.org.
If you want to learn more about CocoaPods, check us out here.
Anthony Short on May 11th 2015
Over the past few months at Segment we’ve been rebuilding large parts of our app UI. A lot of it had become impossible to maintain because we were relying on models binding to the DOM via events.
Views that are data-bound to the DOM sound great but they are difficult to follow once they become complex and bi-directional. You’d often forget to bind some events and a portion of your UI would be out of sync, or you’d add a new feature and break 3 others.
So we decided to take on the challenge to build our own functional alternative to React.
We managed to get a prototype working in about a month. It could render DOM elements and the diffing wasn’t too bad. However, the only way to know if it was any good was to throw it into a real project. So that’s what we did. We built the Tracking Plan using the library. At this point it didn’t even have a real name.
It started simple, we found bugs and things we’d overlooked, then we started seeing patterns arising and ways to make the development experience better.
We were able to quickly try some ideas and trash them if they didn’t work. At first we started building it like a game engine. It had a rendering loop that would check to see if components were dirty and re-render on every frame, and a
scenethat managed all the components and inputs like a game world. This turned out to be annoying for debugging and made it overly complex.
Thanks to this process of iteration we were able to cut scope. We never needed
refs like React, so we didn’t add it. We started with a syntax that used prototypes and constructors but it was unnecessarily verbose. We haven’t had to worry about maintaining text selection because we haven’t run across it in real-world use. We also haven’t had any issues with element focus because we’re only supporting newer browsers.
We spent many late nights discussing the API on a white board and it’s something we care about a lot. We wanted it to be so simple that it would be almost invisible to the user. An API is just UI for developers so we treated it like any other design problem at Segment — build, test, iterate.
Performance is the most important feature of any UI library. We couldn’t be sure if the library was on the right path until we’d seen it running in a real app with real data and constraints. We managed to get decent performance on the first try and we’ve been fine-tuning performance as we add and remove new features.
We first ran into performance issues when we had to re-build the debugger. Some customers were sending hundreds of events per second and the animation wouldn’t work correctly if we were just trashing DOM elements every frame. We implemented a more optimized key diffing algorithm and now it renders hundreds of events per second at a smooth 60 fps with ease. Animations included.
Eventually everything started to settle down. We took the risk and implemented our own library and it now powers the a large portion of our app. We’ve stripped thousands of lines of code and now it’s incredibly easy to add new features and maintain thanks to this new library.
Finally, we think it’s ready to share with everyone else.
Deku is our library for building user interfaces. It supports many of the features you’re familiar with in React but aims to be small and functional. You define your UI as a tree of components and whenever a state change occurs it re-renders the entire tree to patch the DOM using a highly optimized diffing algorithm.
The whole library weighs in at less than 10kb and is easy to follow. It’s also using npm so some of those modules are probably being used elsewhere in your code anyway.
It uses the same concept of components as React. However, we don’t support older browsers, so the codebase is small and component API is almost non-existent. It even supports JSX thanks to Babel.
Here’s what a component looks like in Deku:
Then you can import that component and render your app:
You’ll notice there is no concept of classes or use of
this. We’re using plain objects and functions. The ES6 module syntax is used to define components and every lifecycle hook is passed the
component object which holds the
state you’ll use to render your template.
We never really needed classes. What’s the point when you never initialize them anyway? The beauty of using plain functions is that the user can use the ES6 module system to define them however they want! Best of all, there’s is no new syntax to learn.
Deku has many of the same lifecycle hooks but with two new ones -
afterRender. These are called every single render, including the first, unlike the update hooks. We’ve found these let us stop thinking about the lifecycle state so much.
Some of the lifecycle hooks are passed the
setState function so you can trigger side-effects to update state and re-render the app. DOM events are delegated to the root element and we don’t need to use any sort of synthesized event system because we’re not supporting IE9 and below. This means you never need to worry about handling or optimizing event binding.
To render the component to the DOM we need to create a
tree will manage loading data, communicating between components and allows us to use plugins on the entire application. For us it has eliminated the need for anything like Flux and there are no singletons in sight.
You can render the component tree anyway you’d like — you just need a renderer for it. So far we have a HTML renderer for the server and a DOM renderer for the client since those are the two we’ve needed. It would be possible to build a canvas or WebGL renderer.
The dbmonster performance mini-app written in Deku is also very fast and renders at roughly 15-16 fps compared to most other libraries which render at 11-12 fps. We’re always looking for more ways to optimize the diffing algorithm even further but it’s already we think it’s fast enough.
The first thing we usually get asked when we tell people about Deku is “Why didn’t you just use React?”. It could seem like a classic case of NIH syndrome.
We originally looked into this project because we use Duo as a front-end build tool. Duo is like npm, but just uses Github. It believes in small modules doing one thing well. React was a ‘big thing’ doing many things within a black box. We like knowing in detail how code works, so we feel comfortable with it and can debug it when something goes wrong. It’s very hard to do that with React or any big framework.
We ended up using React for a short time but the API forced us to use a class-like syntax that would lock us into the framework. We also found that we kept fighting with function context all the time which is waste of brain energy. React has some functional aspects to it but it still feels very object-oriented. You’re always concerning yourself with implicit environment state thanks to
this and the class system. If you don’t use classes you never need to worry about
this, you never need decorators and you force people to think about their logic in a functional way.
What started as a hack project to see if we could better understand the concept behind React has developed into a library that is replacing thousands of lines of code and has become the backbone of our entire UI. It’s also a lot of fun!
We’ve come a long way in the past few months. Next we’re going to look at a few ways we could add animation states to components to solve a problem that plagues every component system using virtual DOM.
In our next post on Deku, we’ll explain how we structure our components and how we deal with CSS. We’ll also show off our UIKit — the set of components we’ve constructed to rapidly built out our UI.
Steven Miller, Dominic Barne on April 9th 2015
Last week, we open sourced Sherlock, a pluggable tool for detecting third-party services on a given web page. You might use this to detect analytics trackers (eg: Google Analytics, Mixpanel, etc.), or social media widgets (eg: Facebook, Twitter, etc.) on your site.
We know that setting up your integrations has required some manual work. You’ve had to gather all your API keys and enter them into your Segment project one by one. We wanted to make this process easier for you, and thought that a “detective” to find your existing integrations would help!
Enter Sherlock. When you tell us your project’s url, Sherlock searches through your web page and finds the integrations you’re already using. Then, he automatically enters your integrations’ settings, which makes turning on new tools a bit easier.
Here’s a code sample of Sherlock in action:
Since there are no services baked into Sherlock itself, we’re adding a Twitter plugin here manually. Sherlock opens the
url and if
widgets.js is present on the page, then it will be added to
The above example is admittedly trivial. Here’s a more realistic use-case:
Here, we are adding sherlock-segment, a collection of plugins for about 20 of the integrations on our platform. Now,
results will look like this:
To make your own plugin, simply add the following details to your
package.json: (feel free to use sherlock-segment as a starting point)
name should include “sherlock-“ as a prefix
keywords should include “sherlock”
Your plugin should export an array of service configuration objects, each object can support the following keys:
name should be a human-readable string
script can be a string, regular expression, or a function that matches the src attribute of a script tag
settings is an optional function that is run on the page to extract configuration
Here is an example service configuration: