Leif Dreizler on March 31st 2020
Segment receives billions of events every day from thousands of customers that trust Segment to keep their data safe. At Segment, we believe that good security is an essential part of building high-quality software, similar to reliability or scalability. In addition to the tools and processes developed by our Security Org to help software engineers make good security choices, we also rely on reviews from both traditional and crowdsourced security companies.
Bug bounties are often seen as a significant burden for the security teams that run them. You’ll hear horror stories about companies launching a bug bounty, their security team getting inundated with low-quality reports, duplicate submissions, and researchers going outside the scope of the program.
Shortly after, you'll hear about their engineering team being overwhelmed with newly discovered vulnerabilities to fix. From those that survive the opening salvo, you may hear complaints that, over time, they have stopped receiving impactful submissions.
A few years ago, when the space was less mature, critics questioned whether running a program was worth it. Now, it is expected that organizations of a certain size and maturity run a bug bounty program.
In late 2019, the Cybersecurity and Infrastructure Security Agency (CISA) of the United States published a draft directive recommending all agencies develop a mechanism to accept information related to security vulnerabilities from external parties.
In this blog post, we’ll break down how to start and manage a bug bounty program, consistently achieve good results, and maintain healthy relationships with the people that power the program.
If you’re short on time, check out the “Top Tips” section at the bottom of this post.
A bug bounty program is like a Wanted Poster for security vulnerabilities.
Companies running bug bounty programs pay independent security researchers from across the world for security vulnerabilities in their own products and infrastructure.
I assume most readers are already bought into the benefits of running a bug bounty.
Most companies that have an internet presence or make an internet-connected device should consider running a bounty, or at least have a way for security researchers to report security issues.
It is also part of the Vendor Security Alliance questionnaire, so it may be something your customers ask you about if you are in the B2B space.
If you don’t have a way for researchers to report issues, they will email people at your company (or any alias they can find on your website), connect with you on LinkedIn, or just tweet about the issues they think they’ve found.
It’s a much better experience for researchers, your employees, and your customers if you give the security community a clear avenue to report vulnerabilities.
Your security and engineering orgs will be regularly impressed by the creativity of the researcher community. These are people that, without internal knowledge, can find critical vulnerabilities in organizations across the world.
I strongly recommend using a bug bounty platform like HackerOne or Bugcrowd (we use Bugcrowd here at Segment) to help manage this process. These companies provide a platform and services to help run an efficient program.
Severity baselines make it easier to tell how serious a vulnerability is, and how much time you should be spending on review and remediation.
When running a program on your own, you’ll frequently have researchers overhyping vulnerabilities. Platforms have a guided submission form, which helps researchers pick the appropriate category and rating.
The reputation systems reward researchers that accurately rank vulnerabilities and creates a competitive environment that benefits both researchers and program owners.
It also helps reinforce good behavior. Any researcher discipline issues have stricter consequences. If a researcher misbehaves, they may be banned from the platform.
To submit vulnerabilities via these platforms, researchers have to agree not to disclose the vulnerability without approval from your company.
Both platforms also provide triage services, which I highly recommend paying for. These are the first line of defense for your internal security resources. These globally distributed teams will help clean up researcher reports, mark submissions as duplicates, and filter out low-quality reports.
These companies also serve as a knowledge base for you to learn about running a program and ask questions. You can bounce ideas off of someone that works at a company running hundreds of programs.
Platforms have structured input with required fields and integrations with popular tools, like Jira. These make it much easier to turn a submission into a ticket for your engineering org.
For most companies, it isn’t possible to run a private program without the help of a bug bounty platform.
We’ll talk about private programs in more depth later, but this is the recommended starting point for companies launching a bug bounty for the first time.
All of the above features free your team to focus on the security challenges unique to your business.
Having a successful program starts with a good foundation, and it’s your job as a program owner to help set your organization up for success.
Think about your current process for handling a security vulnerability. What happens when someone internally or externally finds a security bug?
You will inevitably get more of these after starting a bug bounty program, so you must have a good way to handle these reports.
Your vulnerability process doesn’t have to be perfect, but you should have a way to prioritize and assign bug reports to the appropriate engineering team without high overhead.
As someone starting a program, you’ll also need to get buy-in from your Engineering org. They are the ones that will have to fix the issues and will likely be the ones responding to alerts triggered by researchers.
Your alerting story doesn’t need to be perfect. But you also don’t want engineers to be woken up every time someone triggers an error because some input validation was working correctly and stopped a researcher from submitting a
< symbol into an email address field.
Remember, your team doesn't have to fix every valid vulnerability immediately.
Vulnerabilities are just bugs and should be prioritized appropriately. If you’re having trouble getting your engineering org to fix critical vulnerabilities in a timely manner, you may want to direct your efforts to job-hunting instead of starting a bug bounty program 🙃
Once your organization is bought-in, you can focus on getting things ready for the researchers.
Your bounty brief is what researchers will read to determine if they’re interested in working on your program. It’s part instructions, part rules, and part advertisement.
Keep in mind you’re competing for researchers’ time; they don’t have to work on your program when there are so many other programs.
Your bounty brief should be clear, concise, and should set expectations with the researchers. You can find the Segment program hosted on Bugcrowd.
Where do you want researchers to focus their testing? What’s in scope? What’s out of scope?
I recommend starting with assets that are easy for the researchers to access, ideally something free, that anyone can sign up for.
You should also try to pick a target that has at least medium size, complexity, and business impact. This will help you show value early, which will help you expand the program.
How do researchers get access to the scope? Are there docs they can read to help them get up to speed? We instruct researchers to sign up for our app using their @bugcrowdninja.com email address and to include
-bugcrowdninja as part of their workspace slug.
This makes it easier for us to determine if someone is part of our Bugcrowd program when we review logs and alerts. If we notice someone causing problems in our app, we can ask Bugcrowd to provide researcher coaching.
How are you going to rate submissions? Consistent severity is important because it impacts how much the researcher gets paid for a submission. HackerOne uses Mitre’s Common Weakness Enumeration (CWE) and Bugcrowd uses the Bugcrowd Vulnerability Rating Taxonomy (VRT).
How much are you going to pay for vulnerabilities? Researchers need to know upfront how much you’ll pay for vulnerabilities so they can assess if it is worth their time to hunt on your program.
Think about using different reward ranges for different assets. This can help control costs and also helps researchers understand which targets are more important. For example, we describe specific objects that will net a higher reward:
A handful of years ago, getting a T-shirt as a reward was pretty standard. I’d strongly encourage anyone thinking about running a swag-based bounty to reconsider.
T-shirts don’t pay rent and are more work for your team than sending a researcher money. What do you do when that T-shirt you sent to another continent is lost in the mail or doesn’t fit?
We reserve swag for our top performers. Sending a T-shirt requires the researcher to trust us enough to give us their address and requires me to go to the post office.
Take the time to explain what a bug bounty is, why it's important, and have a few examples from recognizable organizations ready to show them.
Learn a little bit about the platform you’re using. Your actions on the platform impact the researcher. If you mistreat researchers, they will go elsewhere; without researchers, your program isn’t providing value to your organization.
The same report status can have different meanings and impact on different platforms.
For example, on HackerOne
Not Applicable reduces the researcher’s site-wide score by 5 points, and should be used for reports that don’t contain a valid issue.
Not Applicable does not impact the researcher’s score, and is commonly used for reports that should neither be accepted or rejected. To achieve this result on HackerOne, you would use the
If you have any questions about the platform you’re using, I strongly recommend reviewing documentation or reaching out to your account manager for help.
Regardless of how big your company’s internet footprint is, you can start with a small scope open only to a handful of individuals as part of a private program.
In mid-2017, Segment was running a private program with 25 researchers and a single target: our app.
The early researchers invited will be some of the platform’s most trusted, and they will generally be more accepting of companies that are learning how to manage a program, as long as you pay them fairly and treat them with respect.
Starting small allows your organization to learn how to run a program in a safer environment. If your vulnerability management program has some gaps, you can fix them; if your bounty brief is unclear, you can rewrite it; if your alerts aren’t tuned properly, you can invest time into improving them. If you need to pause your program, you can relaunch later with a less negative impact.
Even while we had a private program, we would direct researchers that reached out via email to our Bugcrowd program. This allowed us to receive the benefits of the platform and triage services for all submissions before running a public program.
It’s much easier to explain to a researcher why you won’t be paying for a low-effort submission when you have a prioritization system established and enforced by a third party.
Like any multi-year project, your bug bounty will evolve and require ongoing effort to keep it healthy.
Researchers love testing new features and assets; in most bug bounty programs, only the first person to find a vulnerability receives a monetary reward.
If you started with a small scope, your program is steady, and you’re ready for more submissions, this is a great time to add more targets to your brief.
As you add scope, keep in mind that not all assets are of equal value to an adversary. It is encouraged to specify different reward ranges for different assets based on their security maturity and value.
Over time, you should also consider an “open scope” bounty if it is appropriate for your organization. We have listed as a target,
Any host/web property verified to be owned by Segment (domains/IP space/etc.), which serves as a catch-all for anything not explicitly listed in our “In Scope” or “Out of Scope” sections.
Having an open scope bounty is enticing to researchers. Not only does it show you take running a bug bounty program seriously. It also shows that regardless of where they find a vulnerability, it will likely be rewarded (assuming it is valid, unique, and not out of scope).
Many researchers specialize in finding forgotten internet-facing assets as part of open-scope programs, and have developed their own infrastructure to identify assets and vulnerabilities to be able to efficiently earn rewards.
It’s also worth noting that there is no scope for an attacker trying to compromise your company’s security. Working towards an open scope means that it is more likely a bug bounty researcher will find and report a vulnerability before an attacker exploits it.
Over time, you’ll build trust and form relationships with particular researchers. These are great people to give early access to upcoming features. Many times, these features require manual provisioning, making them less suitable for wide-scale testing.
Early access is a mutually beneficial system in which you will receive security vulnerabilities prior to release, which makes them easier to fix. Researchers will be able to test features with less competition, which makes them more likely to earn a reward and continue testing on your program.
If the effort to set up these features is medium or higher, consider paying the researcher a grant before they start working.
Clearly communicate what you're looking for and what you expect from them. When offering a researcher grant, we want to see a short write-up of what didn't work in addition to any findings they submit. Rewardable findings should be paid in addition to the grant.
Once you’ve been running a program for a while and are confident in your company’s ability to receive vulnerabilities from the global researcher community, you should consider evolving it into a public program.
If you don’t have a wide scope, this is a great time to revisit that decision.
Maximizing your scope (while private) will reduce the uptick in submissions when your program is launched publicly. You should also invite as many researchers as possible to your private program before going public for the same reason.
Because public programs are open to anyone, you will inevitably receive testing from a lot of newer folks that will pay less attention to your bounty brief, so having a wide scope helps in this regard as well.
Segment has run a public bug bounty program since late 2018, roughly 18 months after launching our private program.
Hopefully over time, you will think of your outsourced triage team as an extension of your internal team. Spending the time to let them know how you want your program run will pay dividends in the future. Any submission they can validate without asking your team questions saves time for everyone involved.
Here are some examples of guidance we’ve given to the Bugcrowd triage team:
Identify duplicates for non-rewardable submissions
Many programs do not bother to mark informational, out of scope, or other non-rewardable submissions as duplicates. We do this for two reasons:
The first is that if we decide to fix one of these issues later, we can go back and mark the original submission as resolved and pay the researcher. Any duplicates of this issue will still receive points.
The second is that when there is a false positive identified by a tool commonly used by bug bounty hunters, you will get this submitted to your program a lot.
If we repeatedly see an out-of-scope or not reproducible submission, we can add a specific item in our bounty brief to warn researchers; it will be marked as out-of-scope or not reproducible without a working proof of concept.
Don’t be afraid to deduct points for undesired behavior
While we are generally laid-back and understanding program owners, we aren’t afraid to deduct points from a researcher’s score for when it’s warranted.
Many programs shy away from deducting points, but we want to ensure participants in our program thoroughly read our brief and think that it helps the larger bug bounty community to slightly penalize those that disregard clearly documented rules.
Two of the common arguments against bug bounty programs is that the submissions are often low-value and that researchers don’t respect scope.
For example, we have a very small out-of-scope section, which includes:
CORS or crossdomain.xml issues on api.segment.io without proof of concept.
This is identified by a tool commonly used by bug bounty participants is a finding we have received dozens of times, but never with any impact.
We do this to save time for both researchers and our triage team. If a researcher submits this finding without a proof of concept, we encourage Bugcrowd to mark this as out-of-scope. If a researcher submits a finding that showed impact, we would be more than happy to reward, fix, and update our brief.
If you need to deviate from the baseline rating established in your bounty brief, take the time to explain to the researcher why the rating and reward are higher or lower than they might expect.
Researchers are generally understanding, as long as your rating, reward, and explanation are fair and make sense. If you find yourself commonly deviating from the ratings, it may be time to make changes to your bounty brief so that researchers know what to expect in advance. If you make severity or scope changes as the result of a submission, reward the researcher at whichever rate is more favorable to them.
In addition to explaining why something was rated lower than the baseline, take the time to explain why something was rated higher than the baseline. This is a great way to encourage further testing in these areas and is a great way to build trust with a researcher.
These explanations also help your triage team learn more about your program, and allow them to more accurately triage future submissions.
Take time to build relationships and trust with researchers, especially those that repeatedly submit to your program. Treat researchers fairly, with respect, and consider paying for anything that brings value to your organization.
You’re competing for a researcher’s time, especially the ones that are the most talented. They can likely work on almost any bug bounty program available; think about ways you can encourage them to work on yours.
Keep in mind that all researchers, even those that are unskilled, are human beings. Assume that they want to help you secure your organization, learn more about security and technology, and get paid.
If there is one sentence you remember from this blog, I hope it is “pay for anything that brings value.”
Bug bounty hunters put in a lot of time and effort that doesn’t result in getting paid. This could be time spent developing tooling, hunting without finding any bugs, or having a valid bug marked as a duplicate.
Try to avoid thinking about individual bug costs. Instead, think about the overall value the program brings to your organization in terms of bugs found, time saved, and peace of mind. If you’re debating between two severities, pick the higher one and pay the researcher at that rate. You can always change the severity in your internal ticketing system later.
Once you’ve received a rewardable submission, try to triage and pay quickly. Sometimes determining the full impact takes time; if this is the case, add a comment letting the researcher know you appreciate their work but need some extra time to determine the appropriate reward.
Work collaboratively with the researcher
As an employee of your company, you should know more about the codebase and infrastructure than a security researcher outside your organization (although occasionally I question this based on the creative and impactful submissions we receive 😅).
Sometimes when running a bug bounty program, you’ll get a submission that makes you think, “What’s the next level the researcher could get to?” If this is a researcher you trust, it may be appropriate to give them some hints to help further their testing. If you give them hints, you can also issue some cautionary advice to help them continue in a way that is safe for your organization and customers.
Giving the researcher hints helps show them you value their testing and saves your team from spending time on something that may not be possible. If the hint is helpful, the researcher will be submitting a higher-severity finding, which positively impacts their researcher score and earns a higher monetary reward. It also allows you to get the vulnerability fixed faster due to the higher severity.
Sometimes, it isn’t appropriate to include the researcher in this phase of the process. If our team continues the investigation, and it leads to the discovery of a higher-impact weakness in our systems, we reward the researcher as if their report contained the full impact. We also explain why we paid them at this higher rate, but let them know we are unable to share the details. This is a great way to show the researcher you value their work and build trust.
Share progress with the researcher
If a researcher submits a vulnerability that leads to a systemic fix for a vulnerability class, share this with them! Researchers are generally excited to hear that their work led to meaningful change within your organization. It also is a cue for them to attempt to bypass the new protections.
Pay for Dupes
At Segment, we commonly pay researchers for well-written duplicates, and frequently reach out to let them know that we appreciated their submission. We also let them know that we don’t always pay for duplicates to make sure that expectations are set appropriately.
This has worked out incredibly well for us. All of our most critical submissions have come from researchers that were originally rewarded for a well-written duplicate. Segment is a complex product that takes time to set up and fully understand. Researchers that put in the effort to fully set up a Segment workspace have additional context and understanding that take time to acquire—these people are hard to replace, and you want to keep them happy.
Pay bonuses for well-written reports
We also pay extra for well-written reports. Valid submissions need to get turned into Jira tickets which are assigned to engineering teams. Reports that are concise, easy to follow, have clear impact, and are well-formatted take less time for us to turn into tickets. We want to encourage researchers to save us time so we make sure to reward appropriately and let them know that we appreciate their efforts.
Running a successful bug bounty program requires consistent effort from your team, but can bring tremendous value to your company and customers. Any vulnerability reported and fixed is one fewer vulnerability an attacker could use to get a foothold in your organization. Bug bounty submissions can help illuminate vulnerability trends, which can help prioritize where you spend resources to fix systemic issues in your applications or infrastructure.
Bug bounty programs are people-powered. Spend the time to make those involved in your program feel valued, help them understand the motivations behind your decisions, and be excellent to each other!
Thanks for taking the time to read my post! I hope you learned a few things to help your company run a successful program. Here are some of my top tips to reference later:
Pay for anything that brings value
Pay extra for well-written reports, even if they’re dupes
Avoid thinking about individual bug costs
Partner with a bug bounty platform and pay for triage services
If you make changes to your bounty brief as the result of a submission, reward the researcher at the more favorable rate
Invest time into building and maintaining relationships with your researchers and triage team
Don’t be afraid to deduct points for bad behavior
Start small and partner early with Engineering
Write a clear and concise bounty brief to set expectations with the researchers
A special thanks to the Segment Engineering organization for fixing vulnerabilities and responding to alerts.
To Edis from Bugcrowd for helping us triage three years of vulnerabilities and truly being an extension of our team.
To all the researchers that have helped keep our customers safe by looking for vulnerabilities as part of our program.
And finally, to researchers danieloizo and sheddow, you have both submitted numerous well-written and high impact findings and are an absolute pleasure to work with.
Dominic Barne on April 3rd 2015
Make is awesome! It’s simple, familiar, and compatible with everything. Unfortunately, editing a
Makefile can be challenging because it has a very terse and cryptic syntax. In this post, we will outline how we author them to get simple, yet powerful, build systems.
For the uninitiated, check out this gist by Isaac Schlueter. That gist takes the form of a heavily-commented Makefile, which makes it a great learning tool. In fact, I would recommend checking it out regardless of your skill level before reading the remainder of this post.
Here at Segment, we write a lot of code. One of our philosophies is that the code we write should be beautiful, especially since we’ll be spending literally hours a day looking at it.
By beautiful, we mean that code should not be convoluted and verbose, but instead it should be expressive and concise. This philosophy is even reflected in how we write a
We dedicate the top section of each
Makefile as a place to define variables (much like normal source code). These variables will be used to reduce the amount of code used in our recipes, making them far easier to read.
In node projects, we always rely on modules that are installed locally instead of globally. This allows us to give each project it’s own dependencies, giving us the room to upgrade freely without worrying about compatibility across our many other projects.
This decision requires more typing at first:
But it’s easily fixed by using
We use this same pattern frequently, as it helps to shorten the code written in a recipe, making the intention far more clear. This makes understanding the recipe much easier, which leads to faster development and maintenance.
Beyond just using variables for the command name, we also put shared flags behind their own variable as well.
This helps keep things dry, but also gives developers a hook to change the flags themselves if needed:
When writing code and interacting with developer tools, we seek to avoid _noise_as much as possible. There are enough things on a programmer’s mind, so it’s best to avoid adding to that cognitive load unnecessarilly.
One example is “echoing” in Make, which basically outputs each command of your recipe as it is being executed. You may notice that we used the
@ prefix on the recipes above, which actually suppresses that behavior. This is a small thing, but it is part of the larger goal.
We also run many commands in “quiet mode”, which basically suppresses all output except errors. This is one case where we definitely want to alert the developer, so they can take the necessary action to correct it.
make, now we only will see errors that happened with the corresponding build. If nothing is output, we can assume everything went according to plan!
There are some target names that are so commonly used, they practically become a convention. While we haven’t invented most of the targets I will mention here, the main principle here is that using names consistently throughout an organization is important to improve the experience for developers new to a specific project.
Since we have a lot of web projects, the
build/ directory is often reserved as the destination for any files we are bundling to serve to the client.
This target is used to delete any transient files from the project. This generally includes:
build/ directory (the generated client assets)
intermediary build files/caches
test coverage reports
Remote dependencies are not part of this process. (see clean-deps)
Depending on the size and complexity of a project, the downloaded dependencies can take a considerable amount of time to completely resolve and download. As a result, they are cleaned using a distinct target.
While Make will automatically assume the first target in a Makefile is the default one to run, we adopt the convention of putting a
default target in every
Makefile, just for consistency and flexibility.
For our projects, the
default target is usually synonymous with
build, as it is common practice to enter a project and use
make to kick off the initial build.
Runs static analysis (eg: JSHint, ESLint, etc) against the source code for this project.
This starts up the web server for the given project. (in the case of web projects)
This is exclusively for running the automated tests within a project. Depending on the complexity of the project, there could also be other related targets, such as
test-server. But regardless, the
test target will be the entry-point for a developer to run those tests.
All in all, Make is a powerful tool suitable for many projects regardless of size, tooling and environment. Other tools like Grunt and Gulp are great, but Make comes out on top for being even more powerful, expressive and portable. It has become a staple in practically all of our projects, and the conventions we follow have helped to create a more predictable workflow for everyone on the team.
Calvin French-Owen on April 1st 2015
We’ve been running Node in production for a little over two years now, scaling from a trickle of 30 requests per second up to thousands today. We’ve been hit with almost every kind of weird request pattern under the sun.
First there was the customer who liked to batch their data into a single dump every Friday night (getting called on a Friday night used to be a good thing). Then the user who sent us their visitor’s entire social graph with every request. And finally an early customer who hit us with a
while(true) send(data) loop and caused a minor emergency.
By now, our ops team has seen the good, the bad, and the ugly of Node. Here’s what we’ve learned.
One of the great things about Node is that you don’t have to worry about threading and locking. Since everything runs on a single thread, the state of the world is incredibly simple. At any given time there’s only a single running code block.
But here… there be dragons.
Our API ingests tons of small pieces of customer data. When we get data, we want to make sure we’re actually taking the JSON and representing any ISO Strings as dates. We traverse the JSON data we’d receive, converting any date strings into native
Date objects. As long as the total size is under
15kb, we’ll pass it through our system.
It seemed innocent enough… until we’d get a massively nested JSON blob and we’d start traversing. It’d take seconds, even minutes, before we chewed through all the queued up function calls. Here’s what the times and sizes would look like after an initial large batch would get rejected:
And then things would only get worse: the problems would start cascading. Our API servers would start failing healthchecks and disconnect from the ELB. The lack of heartbeat would cause the NSQ connection to disconnect so we weren’t actually publishing messages. Our customer’s clients would start retrying, and we’d be hit with a massive flood of requests. Not. Good.
Clearly something had to be done–we had to find out where the blockage was happening and then limit it.
Now we use node-blocked to get visibility into whether our production apps are blocking on the event loop, like this errant worker:
It’s a simple module which checks when the event loop is getting blocked and calls you when it happens. We hooked it up to our logging and statsd monitoring so we can get alerted when a serious blockage occurs.
We dropped in the module and immediately started seeing the following in our logs:
A customer was sending us really large batches of nested JSON. Applying a few stricter limits to our API (this was back before we had limits) and moving the processing to a background worker fixed the problem for good.
To further avoid event loop problems entirely, we’ve started switching more of our data processing services to Go and using goroutines, but that’s a topic for an upcoming post!
Error handling is tricky in every language–and node is no exception. Plenty of times, there will be an uncaught exception which–through no fault of your own–bubbles up and kills the whole process.
There are multiple ways around this using the
vm module or domains. We haven’t perfected error handling, but here’s our take.
Simple exceptions should be caught using a linter. There’s no reason to have bugs for
undefined vars when they could be caught with some basic automation.
To make that super easy, we started adding make-lint to all of our projects. It catches unhandled errors and undefined variables before they even get pushed to production. Then our makefiles run the linter as the first target of `make test`.
If you’re not already catching exceptions in development, add
make-lint today and save yourself a ton of hassle. We tried to make the defaults sane so that it shouldn’t hamper your coding style but still catch errors.
In prod, things get trickier. Connections across the internet fail way more often. The most important thing is that we know when and where uncaught exceptions are happening, which is often easier said than done.
Fortunately, Node has a global
uncaughtException handler, which we use to detect when the process is totally hosed.
We ship all logs off to a separate server for collection, so we want to make sure to have enough time to log the error before the process dies. Our cleanup could use a bit more sophistication, but typically we’ll attempt to disconnect and then exit after a timeout.
Actually serializing errors also requires some special attention (handled for us by YAL). You’ll want to make sure to include both the
stack explicitly, since they are non-enumerable properties and will be missed by simply calling
JSON.stringify on the error.
Finally, we’ve also written our own module called
oh-crapto automatically dump a heap snapshot for later examination.
It’s easily loaded into the chrome developer tools, and incredibly handy for those times we’re hunting the root cause of the crash. We just drop it in and we’ve instantly got full access to whatever state killed our beloved workers.
It’s easy to overload the system by setting our concurrency too high. When that happens, the CPU on the box starts pegging, and nothing is able to complete. Node doesn’t do a great job handling this case, so it’s important to know when we’re load testing just how much concurrency we can really deal with.
Our solution is to stick queues between every piece of processing. We have lots of little workers reading from NSQ and each of them sets a
maxInFlightparameter specifying just how many messages the worker should deal with concurrently.
If we see the CPU thrashing, we’ll adjust the concurrency and reload the worker. It’s way easier to think about the concurrency once at boot rather than constantly tweaking our application code and limiting it across different pipelines.
It also means we get free visibility into where data is queueing, not to mention the ability to pause entire data processing flows if a problem occurs. It gives us much better isolation between processes and makes them easier to reason about.
We moved away from using streams for most of our modules in favor of dedicated queues. But, there are a few places where they still make sense.
The biggest overall gotcha with streams is their error handling. By default, piping won’t cause streams to propagate their errors to whatever stream is next.
Take the example of a file processing pipeline which is reading some files, extracting some data and then running some transforms on it:
Looking at this code, it’s easy to miss that we haven’t actually setup our error handling properly. Sure, the resulting pipeline stream has handlers, but if any errors occur in the
Transform streams, they’ll go uncaught.
To get around this, we use Julian Gruber’s nifty
multipipe module, which provides a nice API over centralized error handling. That way we can attach a single error handler, and be off to the races.
If you’re also running Node in production and dealing with a highly variable data pipeline, you’ve probably run into a lot of similar issues. For all these gotchas, we’ve been able to scale our node processes pretty smoothly.
Now we’re starting to move our new plumbing and data infrastructure layers to Go. The language makes different trade-offs, but we’ve been quite impressed so far. We’ve got another post in the works on how it’s working out for us, along with a bunch of open source goodies! (and if you love distributed systems, we’re hiring!)
Have other tips for dealing with a large scale node system? Send them our way.
And as always: peace, love, ops, analytics.
Chris Sperandio on February 19th 2015
Before I joined Segment, I was something of a Github stalker. Which is how I found Segment.
(To be clear, I’m still a Github stalker, only now I work here.)
I snooped through Segment’s projects for I don’t know how long before starting to realize what drew me in so consistently across every project. And it wasn’t until I joined the team and learned the thought-process and ethos behind them that I gained a sincere appreciation for why we have over 1000 repos.
Our approach to software is radically modular, pluggable and composable.
Which makes sense, because in reality, that’s the whole point of Segment.
When you build your tools at the right level of abstraction, incidental complexity is hidden away and edge-cases take care of themselves. Not to mention, you can be a lot more productive. It’s what precipitated analytics.js in our early days: the intention to find the right level of abstraction for collecting data about who your users are and what they’re doing. It’s why people love express, koa, and reworktoo.
It’s also why we’re big proponents of the component and duo ecosystem. We even manage customer and partner logos with an extensible and modular systemand our entire front-end is built with components based on ripple and, more recently, deku.
It’s hard to communicate the power of this modular and composable approach, but it ends up being disarmingly obvious to developers and product strategists alike (see Rich Hickey’s presentation. Rather than attempt to explain it outright, I’ll give you a tour of our more popular open source repos to show you how we’ve tried to make them small, self-contained, and composable.
Let’s dive into some examples!
When building our documentation, academy, blog, job board, and help section, we wanted the speed and simplicity of static sites over the restrictiveness and complexity of a CMS. And though there’s a lot of logic that could be shared between them, each called for its own unique feature-set and build process.
But when we looked at existing static site generators, they all imposed a degree of structure on the content, and weren’t flexible enough for our wide array of use cases. Enter: Metalsmith.
Metalsmith does not impose any assumptions on your content model or build process. In fact, Metalsmith is just an abstraction for manipulating a directory of files.
Breaking static sites down to their core, the underlying abstraction includes content in files (blog entries, job listings, what have you), and these files’ associated metadata. Metalsmith allows you to read a directory of files, then run a series of plugins on that data to transform it exactly the way you need.
For example, you can run markdown files through handlebars templates, create navigation or a table of contents, compress images, concatenate scripts, or anything your heart desires before writing the result to the build directory.
For example, our blog articles are just files with two sections: a header with metadata about the author, date, title and url, and then markdown for the content of the article. Metalsmith transforms the markdown to HTML, wraps the posts in their layout, looks up and inserts the author’s avatar, renders any custom Handlebars helpers, etc. The beauty is that the build process is completely customizable and abstract for many use cases. It’s just a matter of which plugins you choose. And the word is out: the metalsmith plugin ecosystem is booming!
By building our static sites with Metalsmith and hosting the source on Github, our marketing, success, and business teams can create and edit posts right from the Github web interface, or work locally and “sync” their updates with Github for Mac. This workflow closely mirrors a traditional CMS, but gives us the speed and reliability of a pre-built, static site while also lending things like auto-generated code samples for every language and automatically compiled navigation, two places otherwise prone to falling out of date.
As you’ve heard, Segment is committed to a component-driven development model: breaking things into small pieces that can be developed in isolation, and then shared and reused. But when everything is a plugin, that means an awful lot of small repos. So we build tooling to make working with lots of small repos frictionless.
For example, when we or a partner want to add a new integration to Segment, the very first thing we need to do is create a new repo to house that project. In order to enforce common style across file structure and code, we built a project scaffolder that generates a “base” the developer can use to jump right into their project.
This is where Khaos comes in — our own project scaffolder that’s built on Metalsmith, the first example here of building building blocks (with building blocks :). In the source, you can see how even we tried to make Khaos itself composable and modular.
Khaos is really just a CLI wrapper for metalsmith with the following plugin pipeline:
Read template files from a directory.
Parse files for template placeholders.
Prompt user to fill in each placeholder.
Render files with a templating engine.
Write filled-in files to a new directory.
We have khaos templates for new logos, integrations, back-end services, nightmare plugins, etc. Not only does this make getting started easy, but it reinforces cultural values like defaulting yes to MIT licensing.
As you might guess, we use Segment as the backbone of our customer data pipeline to route our data into our third-party tools and to Amazon Redshift.
While we use our partners’ visualization tools to write and share ad-hoc queries against data in Segment SQL, we wanted to make the most important data points accessible in real time throughout the organization. So we built Metrics.
We query the underlying the data from Segment Warehouses and services like Stripe and Zendesk, and use Metrics to orchestrate these queries and store the aggregate metrics for each team. On any given team’s board you might see ARR, MRR, daily signups, the depth of our queue for new integration requests, number of active Zendesk tickets by department, number of deploys in the last week – the list goes on.
We’ll go into more details about the business motivations and outcomes around Metrics in a future blog post, but what excites me most about Metrics is how it’s designed under the hood. It’s another example of offloading feature scope to plugins.
You can use plugins to define what data gets collected and stored, the interval at which it’s updated, and where those metrics are pushed: to dashboards, spreadsheets, summary emails, or anywhere else your heart desires.
All metrics does is expose an API for orchestrating this dance via plugins.
Check it out on github here!
We have a bit of an obsession with automation and elimination of the mundane. And that’s what drove the development of Hermes.
Raph, our beloved head of sales and first businessperson at Segment, thought Hermes was Ian’s potty-mouthed “ami français” for a few months. Nope. Hermes is a chatbot whose sole feature is, you guessed it, a plugin interface.
When you’re building a new feature for Hermes, like looking up an account’s usage, or fetching a gif from the interwebs, all you need to do is tell Hermes what he’s listening for and what to say back. Everything in between, you define in your own plugin.
Whether we want to announce that lunch is here, check Loggly for errors related to a customer’s project, kickoff a Metalsmith build of the latest blog release, or create an SVG logo for a new integration, we get our boy Hermes Hubeau to do the dirty (repetitive) work.
We were thankful for the plugin approach when we switched from Hipchat to Slack. Instead of rewriting all of Hermes, we just hot-swapped the old plugin with the new!
“Wait a minute — your chat bot creates SVG logos?!”
Nope! Humans do. Hermes just knows how to ask politely. He creates a new logorepo with Khaos, then spins up a Nightmare instance based on Metalsmith plugins, navigates to 99designs Tasks, and posts a job. When the job is finished, he resizes the logos with our logo component creation CLI.
Automating these sorts of jobs, for which there was not yet a public API, required us to mimic and automate pointing and clicking in a web browser, and that’s what Nightmare does. While there were plenty of tools to do this, like PhantomJS, webdriver APIs imposed the burden of a convoluted interface and lots of mental overhead. So we wrote a library that puts all those headaches under the covers, and lets you automate browsers the way you browse the web:
Browser automation is nothing new, but we tried to give Nightmare a cleaner API and a plugin interface so people could more easily compose automations. The goal is to make it really simple to automate tasks on the web and create APIs where a public one doesn’t yet exist.
We try to break things into small, reusable pieces and hate to solve the same problem twice, pushing for simple solutions that build off of each other and are flexible enough for multiple use cases.
This level of commitment to “building building blocks” and sharing them with the community is what drew me in so hypnotically to Segment in the first place, and why I feel so immensely fortunate to be here now. As a success engineer, part of my job is to build and maintain internal tooling that enables us to better serve our customers. I’m empowered to apply the same principles and rigor used by our product team and core engineers to those projects, and my development and product direction skills have improved faster than I ever thought possible as a result.
If you think of any cool use cases for any of these tools at your company, we’d love to hear more about them. Tweet us @segment with your ideas or fork away on GitHub! We always appreciate new plugins and contributions. And if any of this particularly resonates with you, we’re hiring!
TJ Holowaychuk on February 21st 2014
One of the most popular logging libraries for node.js is Winston. Winston is a great library that lets you easily send your logs to any number of services directly from your application. For lots of cases Winston is all you need, however there are some problems with this technique when you’re dealing with mission critical nodes in a distributed system. To solve them we wrote a simple solution called Yet-Another-Logger.
The typical multicast logging setup looks something like this:
The biggest issue with this technique for us was that many of these plugins are only enabled in production, or cause problems that are only visible under heavy load. For example it’s not uncommon for such libraries to use CPU-intensive methods of retrieving stack traces, or cause memory leaks, or even worse uncaught exceptions!
Another major drawback is that if your application is network-bound like ours is, then sending millions of log requests out to multiple services can quickly take its toll on the network, slowing down everything else.
Finally the use of logging intermediaries allows you to add to or remove services at will, with without re-deploying most of your cluster or making code changes to the applications themselves.
Our solution was to build a simple client/server system of nodes to isolate any probelms just to a set of servers whose sole job is to fan out the logs. We call it Yet-Another-Logger, or YAL.
The Yet-Another-Logger client is pretty much you would expect from a standard logging client. It has some log-level methods and accepts
messagearguments—standard stuff. The only difference is that you instantiate the client with an array of YAL Server addresses, which it uses to round-robin:
YAL is backed by the Axon library, a zeromq-inspired messaging library. The great thing about this is that when a node goes down, messages will be routed to stable nodes, and then resume when the node comes back online.
The YAL server is also extremely simple. It accepts log events from the clients and distributes them to any number of configured services, taking the load off of mission-critical applications.
At the time of writing YAL Server is a library, and does not provide an executable, however in the near future an executable may be provided too. Until then a typical setup would include writing a little executable specific to your system.
Server plugins are simply functions that accept a
server instance, and listen on the
'message' event. That makes writing YAL plugins really simple. It’s also trivial to re-use an existing Winston setup by just plunking your Winston code right into YAL Server.
I’d recommend always running at least 3 YAL Servers in a cluster for redundancy, so you can be sure not to lose any data.
That’s all for now! The two pieces themselves are very simple, but combined they give your distributed system a nice layer of added protection against logging-related outages.
Coming up soon I’ll be blogging about some Elasticsearch tooling that we’ve built exlusively for digging through all of those logs we’re sending through YAL!
Anthony Short on February 19th 2014
Every month we’re going to do a round-up of all the projects we’ve open-sourced on Github. We have hundreds of projects available for anyone to use, ranging from CSS libraries and UI components to static-site generators and server tools. Not to mention that Segment all started from analytics.js.
Myth is a preprocess that lets you write pure CSS without having to worry about slow browser support, or even slow spec approval. It’s a like CSS polyfill.
Diff two versions of a node module.
Yet-Another-Logger that pushes logs to log servers with axon/tcp to delegate network overhead.
Adds some concurrency to a transform stream for that multiple items may be transformed at once.
A FIFO queue for co.
If you want to see more of the awesome code we’re releasing, follow us on Github or follow any of our team members. We’re all open-source fanatics.
Peter Reinhardt on November 27th 2013
When we analyze usage and customers and Segment, we constantly need to join queries across Mongo and Redis. Why? Because our account information is in Mongo and our API usage is in Redis. Today we’re open sourcing Hydros. It’s a quick cheat to let us run SQL queries for analysis, while using NoSQL in production.
What we’ve noticed is that every business question boils down to a simple join across account info and usage. Here are some examples:
Enterprise integrations: find the integrations used by projects (Mongo) that send over 100 million API calls per month (Redis).
Mobile projects: get the names of projects (Mongo) that use our iOS or Android SDKs (Redis).
Power users: get the emails of users (Mongo) who have 20 or more active projects (Redis).
Before Hydros, I’d cobble together a bunch of 50-line node scripts that would connect to both databases. All the join and relational logic was in code. It was horrible. Just a huge, messy folder of code that I never wanted to touch again. Check out cohort.js for a taste of what should have been a simple SQL query.
For an engineer turned business guy, this is pretty frustrating. I wanted something maintainable, that we could build on as the company grows.
This was such an annoying problem for us that we even went to so far as to sync our entire database to Google Spreadsheets so that we could sort, filter and join the databases there. Ilya made some magic happen there, but Google Docs is just really slow and clunky.
Finally! one night at Happy Data Hour, Josh from Mode yammered my ear off about how Yammer’s internal analytics system worked. I was a couple beers in, but what I understood, I liked :) I definitely walked away with a bastardized version of what they’ve accomplished over there, but…
It was akin to “data marts”… you sync your databases to SQL tables idempotently and transactionally, and then run the SQL queries there. Simple.
Here we were, getting all fancy with NoSQL, but the answer was right there all along. Good old SQL.
If we had a good syncing abstraction, all we’d need to do is:
Write idempotent transformations from production databases to SQL tables.
Run our queries against the SQL tables.
So that weekend I got really excited, and started building a similar system for ourselves. After a couple fresh starts and a rewrite, Hydros was born.
Hydros is a node module that lets you easily pull any data source into a MySQL table. You define the SQL table name, columns, and two functions:
list function is generates a list of all rows that should be in the Hydros table. For a user table this would be an array of all the user IDs. For a project table this would be an array of all the project IDs. Dropdead simple, that’s the point :)
get function is responsible for filling in a single row. The function is passed a single row id, and returns all the column values for that row. For a project table, this might mean looking up project metadata in Mongo, or looking up API usage in Redis.
Between those two functions, you get a full sync:
list all the rows, then
get the columns for each row.
Hydros handles table creation and manages the timing of
get for you automatically.
The goal is to have many simple tables in MySQL, and then have many simple Hydros instances syncing the data into them. We have a half-dozen tables already, and it’s growing quickly.
For example, a hydros implementation of an “Project API Usage” table might work like this:
list project IDs from Mongo
get each project’s API usage by pulling counters from Redis
The Hydros table gets a list of rows it should have by polling the
list function. Then, at a higher frequency, Hydros polls the
get function to fill out the columns for each row. You control the refresh time.
Here’s an incomplete implementation of that example:
At Segment we use Hydros to answer a ton of business questions. Combined with Chartio, even the nontechnical people on our team can run queries and dashboard the results.
We have seven tables so far:
Project API Usage (list: Mongo, get: Redis)
Project Integrations (list: Mongo, get: Mongo)
Project Channel Usage (list: Mongo, get: Redis)
Project Library Usage (list: Mongo, get: Redis)
Project Metadata (list: Mongo, get: Mongo)
User Metadata (list: Mongo, get: Mongo)
User Projects (list: Mongo, get: Mongo)
And from that we create 25 charts and tables. Here are some examples:
A table of client libraries, sorted by popularity.
A table of integrations sorted by popularity. Juicy competitor data!
A graph of monthly project cohorts, colored by payment tier.
A graph of monthly project cohorts, colored by client library.
The charts tell us which client libraries and integrations we should focus on, help us estimate future revenue, and even let us prioritize enterprise leads to contact.
Hydros keeps the underlying tables in sync with production Mongo and Redis, completing a sync every 8 hours. Chartio keeps the charts up to date within 30 minutes. This is plenty fast for product analytics!
With Hydros in place, we’re saving a ton of time. Instead of writing piles of janky query code for every business question, we just run SQL queries. Best of all, we get to use business intelligence tools like Chartio right out of the box. We’re actually able to build on our analysis instead of treading water.
If you want to take Hydros for a spin, we open-sourced it. Check it out on Github: https://github.com/segmentio/hydros.
Calvin French-Owen on May 9th 2013
Five months ago, we released a small library called Analytics.js by submitting it to Hacker News. A couple hours in it hit the #1 spot, and over the course of the day it grew from just 20 stars to over 1,000. Since then we’ve learned a ton about managing an open-source library, so I wanted to share some of those tips.
At the very beginning, we knew absolutely nothing about managing an open-source library. I don’t think any of us had even been on the merging side of a pull request before. So we had to learn fast.
Since Analytics.js has over 2,000 stars now, lots of people are making amazing contributions from the open-source community. Along the way, we’ve learned a lot about what we can do to keep pull requests top-quality, and how to streamline the development process for contributors.
New contributors will look at your existing codebase to learn how to add functionality to your library. And that’s exactly what they should be doing. Every developer wants to match the structure of a library they contribute to, but don’t own. Your job as a maintainer is to make that as easy as possible.
The trouble starts when your library leaves ambiguity in its source. If you do the same thing two different ways in two different places, how are contributors going to know which way is recommended? Answer: they won’t.
In the worst case they might even decide that because you aren’t consistent, they don’t have to be either!
Solving this takes a lot of discipline and consistency. As a rule, you shouldn’t experiment with different styles inside a single open source repository. If you want to change styles, do it quickly and globally. Otherwise, newcomers won’t be able to differentiate new conventions from ones you abandoned months ago.
We started off being very poorly equipped to handle this. All of our code lived in a single file, and the functions weren’t organized at all. (And if you check out the commits, that was after a cleanup!) We hadn’t taken the time to set a consistent style - the library was a jumble of different conventions.
As the pull requests started coming in, each one conflicted with the others. Everyone was modifying the same parts of the same files and adding their own utility functions wherever they felt like it.
Which leads into my next point…
The initial way we structured our code was leading to loads of problems with merging pull requests: namely we had no structure! One of the big changes we made to fix our structure was moving over to Component.
We love Component because it eliminates the magic from our code and reduces our library’s scope. It lets us use CommonJS, so we can just
require the modules we need, right from where we need them. Everything is explicit, which means our is code much easier for newcomers to follow. It’s a maintainer’s dream.
While making the switch, we wrote a bunch of our own components to replace all the utility functions we had been attaching to our global
analytics object. And now, since components are easy to include and use everywhere else in the library, pull requesters just use them by default!
As soon as we released the right way and made it clear, pull request quality went up dramatically.
As far as keeping a consistent style goes, you have to be militant when it comes to new code. You cannot be afraid of commenting on pull requests even if it seems like a minor style correction, or refusing requests which needlessly clutter your API.
And remember, that goes for your own code as well! If you get lazy while adding new features, why shouldn’t contributors? The more clean code in your repository, the more good examples you have for newcomers to learn from.
Speaking of not getting lazy…
Having great test coverage is easily the best way to speed up development. We push changes all the time, so we can’t afford to spend time worrying about breaking existing functionality. We write lots of tests, and get lots of benefits: much fastoer development, more confidence in our own code, more trust from outsiders, and…
It also leads to much higher-quality pull requests!
When developers copy the library coding style, that extends to tests as well. We don’t let contributors add their own code without adding corresponding tests.
The good part is we don’t have to enforce this too much. Thanks to Travis-CI, nearly all of our pull requests come complete with tests patterned from our existing tests. And since we’ve made sure our existing tests are high quality, the copied tests naturally start off at a higher bar.
Travis is so well integrated with GitHub, that it will make merging your pull requests significantly easier. Both you and your contributors get notifications when a pull request doesn’t pass all your tests, so they’ll be incentivized to fix a breaking change.
Having a passing Travis badge also inspires trust that your library actually does work correctly and is still maintained. Not to mention peer pressures you into keeping your tests in good condition (which we all know is the hardest part).
Versioning properly, just like testing, takes discipline and is completely worth it. Most developers when they are having issues will immediately check to see if they are running the later version of a library. If not, they’ll update and pray the bug is fixed. Without versioning, you’ll get issues being reported about bugs you’ve already fixed.
When we started, we had no idea how to manage versions at all; our repository was basically just a pile of commits. If you push frequent updates, this will needlessly hassle the developers using your library.
From the start, there are three things your repo needs for versioning:
Readme.md describing what your library does.
Version numbers both in the source and in git tags.
History.md containing versioned and dated descriptions of your changes.
Having a changelog is essential for letting developers track down issues. Any time a developer finds a potential bug, the changelog if the first place they’ll go to see if an upgrade will fix it.
Tagging our repository also turned out to be immensely useful. Each version has a well defined point where the code has stabilized. If you’re using a frontend registry like Component or Bower, you also get the advantage of automatic packaging.
Don’t forget to put the version somewhere in your source too, where the developer using the library can access it. Because when you’re helping someone remotely debug you’ll want to have a quick way to check what version they’re running so you don’t waste time needlessly.
Oh, and use semver. It’s the standard for open-source projects.
I’ll let you in on a little secret: we have close to no documentation for people wanting to contribute to the library. (That’s another problem we need to fix.) But I’m amazed at how many pull requests we receive even without any building or testing instructions whatsoever.
Just from looking at our repository, how do new developers know how to build the script?
Considering that most of them have never seen a Component-based repository, my best guess is that they learn through our Makefile. Running
make forms a natural entry point to see how a new project works. It’s become the de facto instruction manual for editing code.
We use long flags in our scripts, and give each command a descriptive name. By using Component to structure and a build our library, we can essentially defer to their documentation before writing any of our own. Once you understand how Component works, you know how any Component-based repo is built.
We’ve done a lot to clean up our Makefile through the different iterations of ourcode. Remember, your build and test processes are part of your code as well. They should be clean and readable as they are the starting point newcomers.
Notice how all the tips I’ve mentioned are about streamlining your process? That’s because maintaining a popular repository is all about staying above water. You’ll be making lots of little changes throughout the day as new issues are filed, and if you don’t optimize your development process all of your free time will evaporate.
Not only that, but the quality of your library will suffer. Without a good build system, automated testing, and a clean codebase, fixing small bugs becomes a chore, so issues start taking longer and longer to resolve. No one wants that.
We’ve learned a lot when it comes to managing a repo of our size, but we still have a long ways to go in terms of managing a really big project. Managing a large open source project takes a lot of work. As the codebase grows, it will be harder and harder to make major overhauls of the code. More and more people will start depending on it, and the number of pull requests will start growing.
We still have a few more TODOs that will hopefully make maintaining Analytics.js even easier:
We want to split up our tests even more to make them as manageable as possible. Right now the file sizes are getting pretty out of control, which means it’s hard for newcomers to keep everything in their head at once.
Add better contributor documentation. It’s kind of ridiculous that we haven’t done this yet. We’re surprised we’ve even gotten any pull requests at all without it, so this is very high on our list.
Start pull requesting every change we make to the repository ourselves as well. This way we can always peer review each other’s changes, and other contributors can get involved with discussions.
Lots of these tactics come from the Node.js source, which has great guidelines for new contributors.
Every Node.js commit is first pull requested, and reviewed by a core contributor before it is merged. New features are discussed first as issues or pull requests, so multiple opinions are considered. Node commit logs are clean, yet detailed. They have an extensive guide for new contributors, and a linter to serve as a rough style guide.
No project is perfect, but learning these lessons firsthand has helped us adopt better practices across the rest of our libraries as well. Hopefully, they’ll help you too!
Calvin French-Owen on February 4th 2013
It’s been said that “constraints drive creativity.” If that’s true, then PHP is a language which is ripe for creative solutions. I just spent the past week building our PHP library for Segment, and discovered a variety of approaches used to get good performance making server-side requests.
When designing client libraries to send data to our API, one of our top priorities is to make sure that none of our code affects the performance of your core application. That is tricky when you have a single-threaded, “shared-nothing” language like PHP.
To make matters more complicated, hosted PHP installations come in many flavors. If you’re lucky, your hosting provider will let you fork processes, write files, and install your own extensions. If you’re not, you’re sharing an installation with some noisy neighbors and can only upload
Ideally, we like to keep the setup process minimal and address a wide variety of use cases. As long as it runs with PHP (and possibly a common script or two), you should be ready to dive right in.
We ended up experimenting with three main approaches to make requests in PHP. Here’s what we learned.
The top search results for PHP async requests all use the same method: write to a socket and then close it before waiting for a response.
The idea here is that you open a connection to the server and then write to it as soon as it is ready. Since the socket write is fast and you don’t need the response at all, you close the connection immediately post-write. This saves you from waiting on a single round-trip time.
But as you can see from some of the comments on StackOverflow, there’s some debate about what’s actually going on here. It left me wondering: “How asynchronous is the socket approach?”
Here’s what our own code using sockets looks like:
The initial results weren’t promising. A single
fsockopen call was taking upwards of 300 milliseconds, and occasionally much longer.
As it turns out,
fsockopen is blocking - not very asynchronous at all! To see what’s really going on here, you have to dive into the internals of how
fsockopen actually works.
As a refresher, the basic protocol on which the internet run is called TCP. It ensures that messages between computers are transmitted reliably and get ordered properly. Since nearly all HTTP runs over TCP, we use it for our API to make writing custom clients simple.
Here’s the gist of how a TCP socket gets started:
The client sends a
syn message to the server.
The server responds with an
The client sends a final
ack message and starts sending data.
For those of you counting, that’s a full roundtrip before we can send data to the server, and before
fsockopen will even return. Once the connection is open, we can write our data to the socket. Typically this can take anywhere from
30-100msto establish a connection to our servers.
While TCP connections are relatively fast, the chief culprit here is the extra handshake required for SSL.
The SSL implementation works on top of TCP. After the TCP handshake happens, the client then begins a TLS handshake.
It ends up being 3 round trips to establish an SSL connection, not to mention the time required to set up the public key encryption.
SSL connections in the browser can avoid some of these round-trips by reusing a shared secret which has been agreed upon by client and server. Since normal sockets aren’t shared between PHP executions, we have to use a fresh socket each time and can’t re-use the secret!
It’s possible to use the
socket_set_nonblock to create a “non-blocking” socket. This won’t block on the open call but you’ll still have to wait before writing to it. Unless you’re able to schedule time-intensive work in between opening the socket and writing data, your page load will still be slowed by
A better approach is to open a persistent socket using
pfsockopen. This will re-use earlier socket connections made by the PHP process, which doesn’t require a TCP handshake each time. Though the initial latency is higher during the first time a request is made, I was able to send over 1000 events/sec from my development machine. Additionally we can decide to read from the responsebuffer when debugging, or choose to ignore it in production.
To sum it up:
Sockets can still be used when the daemon running PHP has limited privileges.
fsockopen is blocking and even non-blocking sockets must wait before writing.
Using SSL creates significant slowdown due to extra round-trips and crypto setup.
Opening a connection sets every page request back
pfsockopen will block the first time, but can re-use earlier connections without a handshake.
Sockets are great if you don’t have access to other parts of the machine, but an approach which will give you better performance is to log all of the events to a file. This log file can then be processed “out of band” by a worker process or a cron job.
The file-based approach has the advantage of minimizing outbound requests to the API. Instead of making a request whenever we call
identify from our PHP code, our worker process can make requests for
100 events at a time.
Another advantage is that a PHP process can log to a file relatively quickly, processing a write in only a few milliseconds. Once PHP has opened the file handle, appending to it with
fwrite is a simple task. The log file essentially acts as the “shared memory queue” which is difficult to achieve in pure PHP.
To read from the analytics log file, we wrote a simple python uploader which uses of our batching
analytics-python library. To ensure that the log files don’t get too large, the uploading script renames the file atomically. Activitely writing PHP files are still able to write to their existing file handles in memory, and new requests create a new log file where the old one used to be.
There’s not too much magic to this approach. It does require a more work on the side of the developer to set up the cron job and separately install our python library through PyPI. The key takeaways are:
Writing to a file is fast and takes few system resources.
Logging requires some drive space, and the daemon must have capabilities to write to the file.
You must run a worker process to process the logged messages out of band.
As a last alternative, your server can run
exec to make requests using a forked
curl process. The
curl request can complete as part of a separate process, allowing your PHP code to render without blocking on the socket connection.
In terms of performance, forking a process sits between the two of our earlier approaches. It is much quicker than opening a socket, but more resource intensive than opening a handle to a file.
To execute the forked curl process, our condensed code looks like this:
If we’re running in production mode, we want to make sure that we aren’t waiting on the forked process for output. That’s why we add the
"> /dev/null 2>&1 &" to our command, to ensure the process gets properly forked and doesn’t log anywhere.
The equivalent shell command looks like this:
It takes a little over
1ms to fork the process, which then uses around 4k of resident memory. While the
curl process takes the standard SSL
300ms to make the request, the
exec call can return to the PHP script right away! This lets us serve up the page to our clients much more quickly.
On my moderately sized machine, I can fork around
curl requests per second without them stacking up in memory. Without SSL, it can do significantly more:
Forking a process without waiting for the output is fast.
curl takes the same time to make a request as socket, but it is processed out of band.
Forking curl requires only normal unix primitives.
Forking sets a single request back only a few milliseconds, but many concurrent forks will start to slow your servers.
While not an approach to making async requests, we found that destructor functions help us batch API requests.
To reduce the number of requests we make to our API, we want to queue these requests in memory and then batch them to the API. Without using runtime extensions, this can only happen on a single script execution of PHP.
To do this we create a queue on initialization. When the script ends its execution, we send all the queued requests in batch:
We establish the queue when the object is created, and then flush the queue when the object is ready to be destroyed. This guarantees that our queue is only flushed once per request.
Additionally, we can create the socket itself in a non-blocking way in the constructor, then attempt to write to it in the destructor. This gives the connection more time to be established while the PHP interpreter is busy trying to render the page - but we will still have to wait before actually writing to the socket.
Our holy grail is a pure-PHP implementation which doesn’t interface with other processes, yet still is conservative when it comes to making requests. We’d like to make developer setup as easy as possible without requiring a dedicated queue on a separate host.
In practice, this is extremely hard to achieve. Each one of our methods have caveats and restrictions depending on how much traffic you’re dealing with and what your system allows you to do. Since no single approach can cover every use case, we built different adapters to support different users with different needs.
Originally we used the
curl forking approach as our default. Forking a process doesn’t cause a significant performance hit for page load, and is still able to scale out to many requests per second per host. However, this is limited to the configuration of the host, and can have scary consequences if your PHP program starts forking too many processes at once.
After switching to persistent sockets, we decided to make the socket approach our default. Without the TCP handshake per every request, the sockets can deal with thousands of request per second. This approach also has significantly better portability than the
For really high traffic clients who have bit more control over their own hardware, we still support the log file system. If the process which actually serves the PHP can’t re-use socket connections, then this is the best option from a performance perspective.
Ultimately, it comes to knowing a little bit about the limitations of your system and its load profile. It’s all about determining which trade-offs you’re comfortable making.
Edit 2/6/13: Originally I had stated that we used the
curl forking approach for our default and had ignored persistent sockets altoghether. After switching to persistent sockets, the performance of the socket approach increased enough to make it our default approach. It also has better portability across PHP installations.
PS. If you’re running WordPress, we also released a WordPress plugin that handles everything for you!