Go back to Blog

Engineering

Leif Dreizler on March 31st 2020

Segment receives billions of events every day from thousands of customers that trust Segment to keep their data safe. At Segment, we believe that good security is an essential part of building high-quality software, similar to reliability or scalability. In addition to the tools and processes developed by our Security Org to help software engineers make good security choices, we also rely on reviews from both traditional and crowdsourced security companies.

Bug bounties are often seen as a significant burden for the security teams that run them. You’ll hear horror stories about companies launching a bug bounty, their security team getting inundated with low-quality reports, duplicate submissions, and researchers going outside the scope of the program.

Shortly after, you'll hear about their engineering team being overwhelmed with newly discovered vulnerabilities to fix. From those that survive the opening salvo, you may hear complaints that, over time, they have stopped receiving impactful submissions.

A few years ago, when the space was less mature, critics questioned whether running a program was worth it. Now, it is expected that organizations of a certain size and maturity run a bug bounty program.

In late 2019, the Cybersecurity and Infrastructure Security Agency (CISA) of the United States published a draft directive recommending all agencies develop a mechanism to accept information related to security vulnerabilities from external parties.

In this blog post, we’ll break down how to start and manage a bug bounty program, consistently achieve good results, and maintain healthy relationships with the people that power the program.

If you’re short on time, check out the “Top Tips” section at the bottom of this post.

What’s a bug bounty?

A bug bounty program is like a Wanted Poster for security vulnerabilities. 

Companies running bug bounty programs pay independent security researchers from across the world for security vulnerabilities in their own products and infrastructure.

https://scottschober.com/bug-bounties-created-equal/

Bug bounty programs are run by hundreds of organizations, including MozillaFiat Chrysler, the U.S. Department of Defense, and Segment.

Why bug bounty?

I assume most readers are already bought into the benefits of running a bug bounty.

If you’re on the fence, there are tons of independent articles, marketing docs from companies like HackerOne and Bugcrowd, and recorded conference talks, so I’m not going to cover this in detail. 

Most companies that have an internet presence or make an internet-connected device should consider running a bounty, or at least have a way for security researchers to report security issues.

It is also part of the Vendor Security Alliance questionnaire, so it may be something your customers ask you about if you are in the B2B space. 

If you don’t have a way for researchers to report issues, they will email people at your company (or any alias they can find on your website), connect with you on LinkedIn, or just tweet about the issues they think they’ve found.

It’s a much better experience for researchers, your employees, and your customers if you give the security community a clear avenue to report vulnerabilities.

Your security and engineering orgs will be regularly impressed by the creativity of the researcher community. These are people that, without internal knowledge, can find critical vulnerabilities in organizations across the world. 

Why use a bug bounty platform?

I strongly recommend using a bug bounty platform like HackerOne or Bugcrowd (we use Bugcrowd here at Segment) to help manage this process. These companies provide a platform and services to help run an efficient program.

Severity baselines

Severity baselines make it easier to tell how serious a vulnerability is, and how much time you should be spending on review and remediation.

When running a program on your own, you’ll frequently have researchers overhyping vulnerabilities. Platforms have a guided submission form, which helps researchers pick the appropriate category and rating.

Reputation system

The reputation systems reward researchers that accurately rank vulnerabilities and creates a competitive environment that benefits both researchers and program owners.

It also helps reinforce good behavior. Any researcher discipline issues have stricter consequences. If a researcher misbehaves, they may be banned from the platform. 

Disclosure agreements

To submit vulnerabilities via these platforms, researchers have to agree not to disclose the vulnerability without approval from your company.

Triage services

Both platforms also provide triage services, which I highly recommend paying for. These are the first line of defense for your internal security resources. These globally distributed teams will help clean up researcher reports, mark submissions as duplicates, and filter out low-quality reports.

Management services

These companies also serve as a knowledge base for you to learn about running a program and ask questions. You can bounce ideas off of someone that works at a company running hundreds of programs.

Integrations

Platforms have structured input with required fields and integrations with popular tools, like Jira. These make it much easier to turn a submission into a ticket for your engineering org.

Private programs

For most companies, it isn’t possible to run a private program without the help of a bug bounty platform.

We’ll talk about private programs in more depth later, but this is the recommended starting point for companies launching a bug bounty for the first time.

All of the above features free your team to focus on the security challenges unique to your business.

Laying the groundwork

Having a successful program starts with a good foundation, and it’s your job as a program owner to help set your organization up for success.

Vulnerability management

Think about your current process for handling a security vulnerability. What happens when someone internally or externally finds a security bug?

You will inevitably get more of these after starting a bug bounty program, so you must have a good way to handle these reports.

Your vulnerability process doesn’t have to be perfect, but you should have a way to prioritize and assign bug reports to the appropriate engineering team without high overhead.

Engineering buy-in

As someone starting a program, you’ll also need to get buy-in from your Engineering org. They are the ones that will have to fix the issues and will likely be the ones responding to alerts triggered by researchers.

Your alerting story doesn’t need to be perfect. But you also don’t want engineers to be woken up every time someone triggers an error because some input validation was working correctly and stopped a researcher from submitting a < symbol into an email address field.

Remember, your team doesn't have to fix every valid vulnerability immediately.

Vulnerabilities are just bugs and should be prioritized appropriately. If you’re having trouble getting your engineering org to fix critical vulnerabilities in a timely manner, you may want to direct your efforts to job-hunting instead of starting a bug bounty program 🙃

Getting started

Once your organization is bought-in, you can focus on getting things ready for the researchers.

Bounty Brief

Your bounty brief is what researchers will read to determine if they’re interested in working on your program. It’s part instructions, part rules, and part advertisement.

Keep in mind you’re competing for researchers’ time; they don’t have to work on your program when there are so many other programs.

Your bounty brief should be clear, concise, and should set expectations with the researchers. You can find the Segment program hosted on Bugcrowd.

Scope

Where do you want researchers to focus their testing? What’s in scope? What’s out of scope?

I recommend starting with assets that are easy for the researchers to access, ideally something free, that anyone can sign up for.

You should also try to pick a target that has at least medium size, complexity, and business impact. This will help you show value early, which will help you expand the program. 

Access

How do researchers get access to the scope? Are there docs they can read to help them get up to speed? We instruct researchers to sign up for our app using their @bugcrowdninja.com email address and to include -bugcrowdninja as part of their workspace slug. 

This makes it easier for us to determine if someone is part of our Bugcrowd program when we review logs and alerts. If we notice someone causing problems in our app, we can ask Bugcrowd to provide researcher coaching.

Severity

How are you going to rate submissions? Consistent severity is important because it impacts how much the researcher gets paid for a submission. HackerOne uses Mitre’s Common Weakness Enumeration (CWE) and Bugcrowd uses the Bugcrowd Vulnerability Rating Taxonomy (VRT).

Rewards

How much are you going to pay for vulnerabilities? Researchers need to know upfront how much you’ll pay for vulnerabilities so they can assess if it is worth their time to hunt on your program. 

Think about using different reward ranges for different assets. This can help control costs and also helps researchers understand which targets are more important. For example, we describe specific objects that will net a higher reward:

https://bugcrowd.com/segment

Swag

A handful of years ago, getting a T-shirt as a reward was pretty standard. I’d strongly encourage anyone thinking about running a swag-based bounty to reconsider.

T-shirts don’t pay rent and are more work for your team than sending a researcher money. What do you do when that T-shirt you sent to another continent is lost in the mail or doesn’t fit?

We reserve swag for our top performers. Sending a T-shirt requires the researcher to trust us enough to give us their address and requires me to go to the post office.

Safe harbor

Researchers want to know that they’ll be safe from prosecution when operating within the rules of your program. Check out disclose.io for more info.

Researchers want to know that they’ll be safe from legal action and exempt from your end-user license agreement(EULA) when operating within the rules of your bounty brief.

Take the time to explain what a bug bounty is, why it's important, and have a few examples from recognizable organizations ready to show them.

Understand the platform

Learn a little bit about the platform you’re using. Your actions on the platform impact the researcher. If you mistreat researchers, they will go elsewhere; without researchers, your program isn’t providing value to your organization. 

The same report status can have different meanings and impact on different platforms.

For example, on HackerOne Not Applicable reduces the researcher’s site-wide score by 5 points, and should be used for reports that don’t contain a valid issue.

On Bugcrowd, Not Applicable does not impact the researcher’s score, and is commonly used for reports that should neither be accepted or rejected. To achieve this result on HackerOne, you would use the Informative status.

If you have any questions about the platform you’re using, I strongly recommend reviewing documentation or reaching out to your account manager for help.

Start Small

Regardless of how big your company’s internet footprint is, you can start with a small scope open only to a handful of individuals as part of a private program.

In mid-2017, Segment was running a private program with 25 researchers and a single target: our app.

The early researchers invited will be some of the platform’s most trusted, and they will generally be more accepting of companies that are learning how to manage a program, as long as you pay them fairly and treat them with respect.

Starting small allows your organization to learn how to run a program in a safer environment. If your vulnerability management program has some gaps, you can fix them; if your bounty brief is unclear, you can rewrite it; if your alerts aren’t tuned properly, you can invest time into improving them. If you need to pause your program, you can relaunch later with a less negative impact.

Even while we had a private program, we would direct researchers that reached out via email to our Bugcrowd program. This allowed us to receive the benefits of the platform and triage services for all submissions before running a public program.

It’s much easier to explain to a researcher why you won’t be paying for a low-effort submission when you have a prioritization system established and enforced by a third party.

Growth over time

Like any multi-year project, your bug bounty will evolve and require ongoing effort to keep it healthy.

Additional scope and researchers

Researchers love testing new features and assets; in most bug bounty programs, only the first person to find a vulnerability receives a monetary reward.

If you started with a small scope, your program is steady, and you’re ready for more submissions, this is a great time to add more targets to your brief. 

As you add scope, keep in mind that not all assets are of equal value to an adversary. It is encouraged to specify different reward ranges for different assets based on their security maturity and value.

Open scope

Over time, you should also consider an “open scope” bounty if it is appropriate for your organization. We have listed as a target, Any host/web property verified to be owned by Segment (domains/IP space/etc.), which serves as a catch-all for anything not explicitly listed in our “In Scope” or “Out of Scope” sections.

Having an open scope bounty is enticing to researchers. Not only does it show you take running a bug bounty program seriously. It also shows that regardless of where they find a vulnerability, it will likely be rewarded (assuming it is valid, unique, and not out of scope).

Many researchers specialize in finding forgotten internet-facing assets as part of open-scope programs, and have developed their own infrastructure to identify assets and vulnerabilities to be able to efficiently earn rewards.

It’s also worth noting that there is no scope for an attacker trying to compromise your company’s security. Working towards an open scope means that it is more likely a bug bounty researcher will find and report a vulnerability before an attacker exploits it.

Early access

Over time, you’ll build trust and form relationships with particular researchers. These are great people to give early access to upcoming features. Many times, these features require manual provisioning, making them less suitable for wide-scale testing.

Early access is a mutually beneficial system in which you will receive security vulnerabilities prior to release, which makes them easier to fix. Researchers will be able to test features with less competition, which makes them more likely to earn a reward and continue testing on your program.

If the effort to set up these features is medium or higher, consider paying the researcher a grant before they start working.

Clearly communicate what you're looking for and what you expect from them. When offering a researcher grant, we want to see a short write-up of what didn't work in addition to any findings they submit. Rewardable findings should be paid in addition to the grant.

Public program

Once you’ve been running a program for a while and are confident in your company’s ability to receive vulnerabilities from the global researcher community, you should consider evolving it into a public program. 

If you don’t have a wide scope, this is a great time to revisit that decision.

Maximizing your scope (while private) will reduce the uptick in submissions when your program is launched publicly. You should also invite as many researchers as possible to your private program before going public for the same reason.

Because public programs are open to anyone, you will inevitably receive testing from a lot of newer folks that will pay less attention to your bounty brief, so having a wide scope helps in this regard as well.

Segment has run a public bug bounty program since late 2018, roughly 18 months after launching our private program.

Train your triagers

Hopefully over time, you will think of your outsourced triage team as an extension of your internal team. Spending the time to let them know how you want your program run will pay dividends in the future. Any submission they can validate without asking your team questions saves time for everyone involved.

Here are some examples of guidance we’ve given to the Bugcrowd triage team:

Identify duplicates for non-rewardable submissions

Many programs do not bother to mark informational, out of scope, or other non-rewardable submissions as duplicates. We do this for two reasons: 

The first is that if we decide to fix one of these issues later, we can go back and mark the original submission as resolved and pay the researcher. Any duplicates of this issue will still receive points. 

The second is that when there is a false positive identified by a tool commonly used by bug bounty hunters, you will get this submitted to your program a lot.

If we repeatedly see an out-of-scope or not reproducible submission, we can add a specific item in our bounty brief to warn researchers; it will be marked as out-of-scope or not reproducible without a working proof of concept.

Don’t be afraid to deduct points for undesired behavior

While we are generally laid-back and understanding program owners, we aren’t afraid to deduct points from a researcher’s score for when it’s warranted.

Many programs shy away from deducting points, but we want to ensure participants in our program thoroughly read our brief and think that it helps the larger bug bounty community to slightly penalize those that disregard clearly documented rules. 

Two of the common arguments against bug bounty programs is that the submissions are often low-value and that researchers don’t respect scope.

For example, we have a very small out-of-scope section, which includes: CORS or crossdomain.xml issues on api.segment.io without proof of concept.

This is identified by a tool commonly used by bug bounty participants is a finding we have received dozens of times, but never with any impact.

We do this to save time for both researchers and our triage team. If a researcher submits this finding without a proof of concept, we encourage Bugcrowd to mark this as out-of-scope. If a researcher submits a finding that showed impact, we would be more than happy to reward, fix, and update our brief.

Explain deviations in ratings

If you need to deviate from the baseline rating established in your bounty brief, take the time to explain to the researcher why the rating and reward are higher or lower than they might expect. 

Researchers are generally understanding, as long as your rating, reward, and explanation are fair and make sense. If you find yourself commonly deviating from the ratings, it may be time to make changes to your bounty brief so that researchers know what to expect in advance. If you make severity or scope changes as the result of a submission, reward the researcher at whichever rate is more favorable to them.

In addition to explaining why something was rated lower than the baseline, take the time to explain why something was rated higher than the baseline. This is a great way to encourage further testing in these areas and is a great way to build trust with a researcher.

These explanations also help your triage team learn more about your program, and allow them to more accurately triage future submissions.

Build and maintain good relationships with researchers

Take time to build relationships and trust with researchers, especially those that repeatedly submit to your program. Treat researchers fairly, with respect, and consider paying for anything that brings value to your organization.

You’re competing for a researcher’s time, especially the ones that are the most talented. They can likely work on almost any bug bounty program available; think about ways you can encourage them to work on yours.

Keep in mind that all researchers, even those that are unskilled, are human beings. Assume that they want to help you secure your organization, learn more about security and technology, and get paid.

Pay for anything that brings value

If there is one sentence you remember from this blog, I hope it is “pay for anything that brings value.” 

Bug bounty hunters put in a lot of time and effort that doesn’t result in getting paid. This could be time spent developing tooling, hunting without finding any bugs, or having a valid bug marked as a duplicate.

Try to avoid thinking about individual bug costs. Instead, think about the overall value the program brings to your organization in terms of bugs found, time saved, and peace of mind. If you’re debating between two severities, pick the higher one and pay the researcher at that rate. You can always change the severity in your internal ticketing system later.

Once you’ve received a rewardable submission, try to triage and pay quickly. Sometimes determining the full impact takes time; if this is the case, add a comment letting the researcher know you appreciate their work but need some extra time to determine the appropriate reward.

Work collaboratively with the researcher

As an employee of your company, you should know more about the codebase and infrastructure than a security researcher outside your organization (although occasionally I question this based on the creative and impactful submissions we receive 😅). 

Sometimes when running a bug bounty program, you’ll get a submission that makes you think, “What’s the next level the researcher could get to?” If this is a researcher you trust, it may be appropriate to give them some hints to help further their testing. If you give them hints, you can also issue some cautionary advice to help them continue in a way that is safe for your organization and customers. 

Giving the researcher hints helps show them you value their testing and saves your team from spending time on something that may not be possible. If the hint is helpful, the researcher will be submitting a higher-severity finding, which positively impacts their researcher score and earns a higher monetary reward. It also allows you to get the vulnerability fixed faster due to the higher severity.

Sometimes, it isn’t appropriate to include the researcher in this phase of the process. If our team continues the investigation, and it leads to the discovery of a higher-impact weakness in our systems, we reward the researcher as if their report contained the full impact. We also explain why we paid them at this higher rate, but let them know we are unable to share the details. This is a great way to show the researcher you value their work and build trust.

Share progress with the researcher

If a researcher submits a vulnerability that leads to a systemic fix for a vulnerability class, share this with them! Researchers are generally excited to hear that their work led to meaningful change within your organization. It also is a cue for them to attempt to bypass the new protections. 

Pay for Dupes

At Segment, we commonly pay researchers for well-written duplicates, and frequently reach out to let them know that we appreciated their submission. We also let them know that we don’t always pay for duplicates to make sure that expectations are set appropriately. 

This has worked out incredibly well for us. All of our most critical submissions have come from researchers that were originally rewarded for a well-written duplicate. Segment is a complex product that takes time to set up and fully understand. Researchers that put in the effort to fully set up a Segment workspace have additional context and understanding that take time to acquire—these people are hard to replace, and you want to keep them happy.

Pay bonuses for well-written reports

We also pay extra for well-written reports. Valid submissions need to get turned into Jira tickets which are assigned to engineering teams. Reports that are concise, easy to follow, have clear impact, and are well-formatted take less time for us to turn into tickets. We want to encourage researchers to save us time so we make sure to reward appropriately and let them know that we appreciate their efforts.

Final thoughts

Running a successful bug bounty program requires consistent effort from your team, but can bring tremendous value to your company and customers. Any vulnerability reported and fixed is one fewer vulnerability an attacker could use to get a foothold in your organization. Bug bounty submissions can help illuminate vulnerability trends, which can help prioritize where you spend resources to fix systemic issues in your applications or infrastructure.

Bug bounty programs are people-powered. Spend the time to make those involved in your program feel valued, help them understand the motivations behind your decisions, and be excellent to each other!

Top tips

Thanks for taking the time to read my post! I hope you learned a few things to help your company run a successful program. Here are some of my top tips to reference later:

  • Pay for anything that brings value

  • Pay extra for well-written reports, even if they’re dupes

  • Avoid thinking about individual bug costs

  • Reward quickly

  • Partner with a bug bounty platform and pay for triage services

  • If you make changes to your bounty brief as the result of a submission, reward the researcher at the more favorable rate

  • Invest time into building and maintaining relationships with your researchers and triage team

  • Don’t be afraid to deduct points for bad behavior

  • Start small and partner early with Engineering

  • Write a clear and concise bounty brief to set expectations with the researchers

Thank you! 💚💙

A special thanks to the Segment Engineering organization for fixing vulnerabilities and responding to alerts.

To Edis from Bugcrowd for helping us triage three years of vulnerabilities and truly being an extension of our team.

To all the researchers that have helped keep our customers safe by looking for vulnerabilities as part of our program.

And finally, to researchers danieloizo and sheddow, you have both submitted numerous well-written and high impact findings and are an absolute pleasure to work with.

All Engineering articles

Kevin Burke on May 17th 2017

Recently we shared the techniques we used to save more than a million dollars annually on our AWS bill. While we went into detail about the various problems and solutions, the most common question we heard was: "I know I’m spending a ton on AWS, but how do I actually break that into understandable pieces?" 

At face value, this sounds like a fairly straightforward problem. 

You can easily split your spend by AWS service per month and call it a day. Ten thousand dollars of EC2, one thousand to S3, five hundred dollars to network traffic, etc. But what’s still missing is a synthesis of which products and engineering teams are dominating your costs. 

Then, add in the fact that you may have hundreds of instances and millions of containers that come and go. Soon, what started as simple analysis problem has quickly become unimaginably complex. 

In this follow-up post, we’d like to share details on the toolkit we used. Our hope is to offer up a few ideas to help you analyze your AWS spend, no matter whether you’re running only a handful of instances, or tens of thousands.

Grouping by ‘product areas’

If you’re operating AWS at scale–it’s likely that you’ve hit two major problems.

First, it’s difficult to notice if one part of the engineering team suddenly starts spending a lot more than it used to. 

Our AWS bill is six figures per month, and the charges for each AWS component change rapidly. In a given week, we might deploy five new services, optimize our DynamoDB throughput, and add hundreds of customers. In this environment it’s easy to overlook that a single team spent $20,000 more on EC2 this month than they did last month.

Second, it can be difficult to predict how much new customers will cost. 

As background, Segment offers a single API which can send analytics data to any number of third-party tools, data warehouses, S3, or internal data pipelines. 

While customers are good at predicting how much traffic they will have and the products they’d like to use, we’ve historically had trouble translating this usage information to a dollar figure. Ideally we’d like to be able to say "1 million new API calls will cost us $X so we should make sure we are charging at least $Y."

Our solution to these problems was to bucket our infrastructure into what we dubbed ‘product areas’. In our case, these product areas are loosely defined as:

  1. integrations (the code that sends data from Segment to various analytics providers)

  2. API (the service that receives data customer libraries sent to Segment)

  3. warehouses (the pipeline that loads Segment data into a customer's data warehouse)

  4. website and CDN

  5. internal (shared support logic for the four above)

In scoping the project, we realized it would be next to impossible to measure everything. So instead, we decided to target a percentage of the costs in the bill, say, 80%, and try to get that measurement working end-to-end. 

It's better to deliver business value analyzing 80% of the bill than to shoot for 100%, get bogged down in the collection step, and never deliver any results. Shooting for 80% completeness (being willing to say "it's good enough") ended up saving us again and again from rabbit-holing into analysis that didn’t meaningfully impact our spend.

Gather, then analyze

To break out costs by product area, we need to gather data for the billing system which we had to collect and then subsequently join together:

  1. the AWS billing CSV - the CSV generated by AWS to provide the full billing line items

  2. tagged AWS resources – resources which could be tagged within the billing CSV

  3. untagged resources – services like EBS and ECS that required custom pipelines to tag usage with ‘product areas’

Once we calculated the product areas for each of these pieces of data, we could load them into Redshift for analysis.

1. The AWS Billing CSV

The place to start to understand your spend is the AWS Billing CSV. You can enable a setting in the billing portal and Amazon will write a CSV with detailed billing information to S3 every day.

By detailed, I mean VERY detailed. Here is a typical billing row:

That row is a charge for a whopping $0.00000001, or one one-millionth of a penny, for DynamoDB storage on a single table between 3AM and 4AM on February 7th. There are about six million rows in our billing CSV for a typical month. (Unfortunately, most cost more than a millionth of a penny.)

We use Heroku's awsdetailedbilling tool to copy the billing data from S3 to Redshift. This was a good first step, but we didn't have a great way to correlate a specific AWS cost with our own product areas (e.g. whether a given instance-hour is used for the integrations or warehouses product areas).

What’s more, about 60% of the bill is consumed by EC2. Despite being the lions’ share of the cost, understanding how a given EC2 instance mapped to a product area was impossible with the data provided by the billing CSV.

There’s a good reason why we couldn’t just use instance names to determine product areas. Instead of running a single process per host, we make heavy use of ECS (Elastic Container Service), to stack hundreds of containers on a host and achieve much higher utilization. 

Unfortunately, Amazon bills only for the EC2 instance costs, so we had zero visibility into the costs of the containers running on an instance: how many containers we were running at a typical time, how much of the pool we were using, and how many CPU and memory units we were using.

Even worse, information about container auto-scaling isn’t reflected anywhere in the billing CSV. To get this data for analysis, we had to write our own tooling to gather and then process it. I’ll cover how this pipeline works in the following sections.

Still, the AWS Billing CSV will provide very good granular usage data that will become the basis for our analysis. We just need to associate that data with our product areas.

Note: This problem isn’t going away either. Billing by the instance-hour is going to be a bigger and bigger problem from a "what am I spending money on?" perspective, since more companies are running fleets of containers across a set of instances, with tools like ECS, Kubernetes and Mesos. In a slight twist of irony, Amazon has had this same problem for years - each EC2 instance is a Xen hypervisor, being run on the same bare metal machine as other instances.

2. Cost data from tagged AWS resources

The most important and readily available data comes from ‘tagged’ AWS resources.

Out of the box, the AWS billing CSV doesn’t include any tags in its analysis. As such, it’s impossible to discern how one EC2 instance or bucket might be used vs another.

However, you can enable certain tags to appear alongside your line item costs using cost allocation tags

These tags are officially supported by many AWS resources, S3 buckets, DynamoDB tables, etc. You can toggle a setting in the AWS billing console to make a cost allocation tag show up in the CSV. After a day or so, your chosen tag (we chose product_area) will start showing up as a new column next to the associated resources in the detailed billing CSV. 

If you are doing nothing else, start by using cost allocation tags to tag your infrastructure. It’s essentially ‘free’ and requires zero infrastructure to run.

After we enabled cost allocation tags, we had two challenges: 1) tagging all of the existing infrastructure, and 2) ensuring that any new resources would automatically have tags.

Tagging your existing infrastructure

Tagging your existing infrastructure is pretty easy: for a given AWS product, query Redshift for the resources with the highest costs, bug people in Slack until they tell you how those resources should be tagged, and stop when you've tagged 90% or more of the resources by cost.

However, enforcing that new resources stay tagged requires some automation and tooling. 

To do this, we use Terraform. In most cases, Terraform's configuration supports adding the same cost allocation tags that you can add via the AWS console. Here's an example Terraform configuration for a S3 bucket:

Though Terraform provided the base configuration, we wanted to verify that every time someone wrote resource "aws_s3_bucket" into a Terraform file, they included a product_area tag. 

Fortunately Terraform configurations are written in HCL (Hashicorp Configuration Language), which ships with a comment preserving configuration parser. So we wrote a checker that walks every Terraform file looking for taggable resources lacking a product_area tag.

We set up continuous integration for the repo with Terraform configs, and then added these checks, so the tests will fail if anyone tries to check in a tag-able resource that's not tagged with a product area. 

This isn't perfect - the tests are finicky, and people can still technically create untagged resources directly in the AWS console, but it's good enough for now–the easiest way to provision new infrastructure is via Terraform.

Rolling up cost allocation tag data

Once you've tagged resources, accounting for them is fairly simple.

  1. Find the product_area tags for each resource, so you have a map of resource id => product area tags.

  2. Sum the unblended costs for each resource

  3. Sum those costs by product area, and write the result to a rollup table.

    SELECT sum(unblended_cost) FROM awsbilling.line_items WHERE statement_month = $1 AND product_name='Amazon DynamoDB';

You might also want to break out data by AWS product - we have two separate tables, one for Segment product areas, and one for AWS products.

We were able to account for about 35% of the bill using traditional cost allocation tags.

Analyzing Reserved Instances

This approach works great for tagged, on-demand instances. But in some cases, may have paid AWS up front for a ‘reservation’. Reservations guarantee a certain amount of capacity, in exchange for up-front payment at a lower fixed rate.

In our case, this means several large charges that show up in the December 2016 billing CSV need to be amortized across each month in the year. 

To properly account for these costs, we wanted to use the unblended cost that was incurred in the desired time period. The query looks like this:

Subscription costs take the form "$X0000 of DynamoDB," so they are impossible to attribute to a single resource or product area. 

Instead, we sum the per-resource costs by product area and then amortize the subscription costs according to the percentages. If the warehouses pipeline used 60% of our EC2 compute costs, we assume it used 60% of the reservation as well. 

This isn't perfect. If a large percentage of your bill is reserved up front, this amortization strategy will be distorted by small changes in the on-demand costs. In that case you'll want to amortize based on the usage for each resource, which is more difficult to sum than the costs.

3. Cost data from untagged AWS resources

While tagging instances and DynamoDB tables is great, other AWS resources don't support cost allocation tags. These resources required that we build a Rube Goldberg-ian-style workflow to successfully get the cost data into Redshift. 

The two biggest untagged resources groups we had to deal with were ECS and EBS.

ECS

ECS is constantly scaling our services up and down, depending on how many containers a given service needs. It’s also responsible for re-balancing and bin-packing containers across individual instances.

ECS starts containers on hosts based upon “CPU and memory reservation”. A given service indicates how many CPU shares it requires, and ECS will either put new containers on a host with capacity, or scale up the number of instances to add more capacity. 

None of these ECS actions are directly reflected within our AWS Billing CSV–but ECS is still responsible for triggering the auto-scaling for each of our instances. 

Put simply, we wanted to understand what ‘slice’ of each machine a given container was using, but the billing CSV only gives us ‘whole unit’ breakdown by instance.

To determine the cost of a given service, we built our own pipeline that makes use of the following pieces:

  1. Set up a Cloudwatch subscription any time an ECS task gets started or stopped.

  2. Push the relevant data (Service name, CPU/memory usage, starting or stopping, EC2 instance ID) from the event to Kinesis Firehose (to aggregate individual events).

  3. Push the data from Kinesis Firehose to Redshift.

Once all of the task start/stop/size data is in Redshift, we multiply the amount of time a given ECS task ran (say, 120 seconds) by the number of CPU units it used on that machine (up to 4096 - this info is available in the task definition), to get a number of CPU-seconds for each service that ran on the instance. 

The total bill for the instance is then divided across services according to the number of CPU-seconds each one used.

It's not a perfect method. EC2 instances aren't running at 100% capacity all the time, and the excess currently gets divided across the services running on the instance, which may or may not be the right culprits for that overhead. But (and you may recognize this as a common theme in this post), it's good enough.

Additionally, we want to map the right product area for each ECS service. However we can't tag those services in AWS because ECS doesn't support cost allocation tags.

Instead we added a product_area key to the Terraform module for each ECS service. This key doesn't lead to any metadata being sent to AWS, but it does populate a script script that reads the product_area keys for each service. 

That script then publishes the service name => b64encoded product area mappings to DynamoDB on every new push to the master branch. 

Finally, our tests then validate that each new service has been tagged with a product area.

EBS

Elastic Block Storage (EBS) also makes up a significant portion of our bill. EBS volumes are typically attached to an EC2 instance, and for accounting purposes it makes sense to count the EBS volume costs together with the EC2 instance. However, the AWS billing CSV doesn't show you which EBS volume was attached to which instance.

We again used Cloudwatch for this - we subscribe to any "volume attached" or "volume unattached" events, and then record the EBS => EC2 mappings in a DynamoDB table. 

We can then add EBS volume costs to the relevant EC2 instances before accounting for ECS costs.

Combining data across accounts

So far we’ve talked about all of our costs within the context of a single AWS account. However, this doesn’t actually reflect our AWS setup, which is spread across different physical AWS accounts.

We use an ops account not only for consolidated, cross-account billing, but to help provide a single access point for engineers making changes to production. We separate staging from production to ensure that an API call which might, say, delete a DynamoDB table, can be run safely with the appropriate checks. 

Of these accounts, prod dominates the cost–but our staging costs are still a significant percentage of the overall AWS bill. 

Where this gets tricky is when we need to write the data about ECS services in the stage realm to the production Redshift cluster. 

To achieve writing ‘cross account’, we needed to allow the Cloudwatch subscription handlers to assume a role in production that can write to Firehose (for ECS) or to DynamoDB (for EBS). These are tricky to set up because you have to add the correct permissions to the right role in the staging account (sts.AssumeRole) and in the production account, and any mistake will lead to a confusing permission error.

For us, this means that we don't have a staging realm for our accounting code, since the accounting code in stage is writing to the production database.

While it’s possible to add a second service in stage that subscribes to the same data but doesn't write it, we decided that we can swallow the occasional problems with the stage accounting code.

Rolling up the statistics

Finally we have all of the pieces we need to run proper analysis: 

  1. tagged resources in the AWS billing CSV

  2. data about when every ECS event started and stopped

  3. a mapping between ECS service names and the relevant product areas

  4. a mapping between EBS volumes and the instances they are attached to

To roll all of this up for the analytics team, I broke out the analysis by AWS product. For each AWS product, I totaled the Segment product areas and their costs, for that AWS product. 

The data gets rolled up into three different tables:

  1. Total costs for a given ECS service in a given month

  2. Total costs for a given product area in a given month

  3. Total costs for a (AWS product, Segment product area) in a given month. For example, "The warehouses product area used $1000 worth of DynamoDB last month."

The total costs for a given product area look like this:

And the costs for an AWS product combined with Segment product area look like this:

For each of these tables, we have a finalized table that contains the finalized numbers for each month, and a rollup append-only table that writes new data for a month as it updates every day. A unique identifier in the rollup table identifies a given run, so you can sum the AWS bill by finding all of the rows in a given run.

Finalized data effectively becomes our golden ‘source of truth’ that we use for top-level metrics and board reporting. Rollup tables are used to monitor our spend over the course of the month.

Note: AWS does not "finalize" your bill until several days after the end of the month, so any sort of logic that marks the billing record as complete when the month flips over is incorrect. You can detect when the bill becomes "final" because the invoice_id field in the billing CSV will be an integer instead of the word "Estimated".

A few last gotchas

Before closing, we realized that there are a few places where a little bit of preparation and knowledge could have saved us a lot of time. In no particular order, they are:

  • Scripts that aggregate data or copy it from one place to another are infrequently touched and often under-monitored. As an example, we had a script that copied the Amazon billing CSV from one S3 bucket to another, but it failed on the 27th-28th of each month because the Lambda handler doing the copying ran out of memory as the CSV got large. It took a while to notice this, because the Redshift database had a lot of data and the right-ish numbers for each month. We’ve since added monitoring to the Lambda function to ensure that it runs without errors.

  • Be sure these scripts are well documented, especially with information about how they are deployed and what configuration they need. Link to the source code in other places where they are referenced - for example, any place you pull data out of an S3 bucket, link to the script that puts the data in the bucket. Also consider putting a README in the S3 bucket root.

  • Redshift queries can be really slow without optimization. Consult with the Redshift specialist at your company, and think about the queries you need, before creating new tables in Redshift. In our case we were missing the right sortkey on the billing CSV tables. You cannot add sortkeys after you create the table, so if you don't do it up front you have to create a second table with the right keys, send writes to that one and then copy all the data over.

  • Using the right sortkeys took the query portion of the rollup run from about 7 minutes to 10-30 seconds.

  • Initially we planned to run the rollup scripts on a schedule - Cloudwatch would trigger an AWS Lambda function a few times a day. However the run length was variable (especially when it involved writing data to Redshift) and exceeded the maximum Lambda timeout, so we moved it to an ECS service instead. 

  • We chose Javascript for the rollup code initially because it runs on Lambda and most of the other scripts at the company were in Javascript. If I had realized I was going to need to switch it to ECS, I would have chosen a language with better support for 64 bit integer addition, and parallelization and cancellation of work.

  • Any time you start writing new data to Redshift, the data in Redshift changes (say, new columns are added), or you fix integrity errors in the way the data is analyzed, add a note in the README with the date and information about what changed. This will be extremely helpful to your data analysis team.

  • The blended costs are not useful for this type of analysis - stick to the unblended costs, which show what AWS actually charged you for a given resource.

  • There are 8 or 9 rows in the billing CSV that don't have an Amazon product name attached. These represent the total invoice amount, but throw off any attempt to sum the unblended costs for a given month. Be sure to exclude these before trying to sum costs.

The bottom line

As you might imagine, getting visibility into your AWS bill takes a large amount of work–both in terms of custom tooling and identifying expensive resources within AWS.

The biggest win we’ve found comes from making it easy to continuously estimate your spend rather than running the occasional ‘one-time-analysis’.

To do that, we’ve automated all of the data collection, enforced tagging within Terraform and our CI, and educated the entire engineering team how to properly tag their infrastructure. 

Rather than sitting within a PDF,  all of our data is continuously updated within Redshift. If we want to answer new questions or generate new reports, we can instantly get results via a new SQL query. 

Additionally we’ve exported that data into an Excel model so we can estimate exactly how much a new customer will cost. And we can also see if a single service or a single product area is suddenly costing a lot more, before that causes too much of a hit to our bottom line.

While it may not exactly mirror your infrastructure, hopefully this case study will be useful for helping you get a better sense of your costs and manage them as you scale!

Joe Christiani on May 11th 2017

In a scrappy B2B startup, user feedback is super valuable, but guerrilla research won’t cut it when you need a more targeted group of users. The Segment Design team found the users we needed and developed an automated process for recruitment and coordinating interviews using our own product and a few integrated applications. 

Guerrilla research includes a range of fast and inexpensive techniques for designers and UX researchers (often the same person) to observe how users engage with products in the wild. Rather than recruiting participants, these ad hoc experiments are usually done on friends, peers, or strangers in coffee shops. But what if the users of your product are other businesses? Even worse, what if your product is (gasp) technically complex for a niche audience? The ‘check-out-my-mixtape’ methods that work well for consumer-facing products will get you a lot of blank stares.  

Sourcing users for research and feedback is particularly challenging for the Design Team at Segment because our product is deeply technical.  While the customer feedback we gather through success tickets (Zendesk) and NPS surveys is valuable, we needed a way to explore our user’s behaviors and needs in more depth.  

To do this, we designed an automated workflow to recruit participants for a UX research program and and to coordinate testing and feedback sessions throughout the product development process. This article explains how we developed this workflow and describes the final system, which you can borrow from and improve.

Our First Attempt

Preliminary Goals:

  1. Find users willing to give us feedback.

  2. Interview users to better understand how they were using our product.

  3. Identify pain points in the current experience of using our app.

  4. Do all of the above at regular intervals without spamming our customers! 

Preliminary Solution:

Our initial process was inspired by Mesosphere, who wrote about their experience bootstrapping a UX research program. We used Customer.io to email new users, asking if they would be willing to join our User Experience program. 

Since we shared a Customer.io account with other teams at Segment, we could specify that only users who had not been recently contacted by other teams would receive the emails. (Sorry, thirsty UX researchers: empathy for your users includes not spamming them.) Recipients of the opt-in email would fill out a Google Form, which recorded their email in a spreadsheet.  We would then periodically email our pool of opted-in users with invitations to remote or in-person research.

The upside of this process was that it didn’t take very long to set up, and the Google Suite tools were free.  But as the pool of participants grew, it became clear that the time-intensive nature of manually managing the spreadsheet and sending emails wasn’t scalable. Recalling the proverb, ‘Physician, heal thyself,’ we took a Design Thinking approach and treated our user research program like any other user experience challenge.

Version 1.1: Automating the Workflow

We want to spend our time researching and designing, not sourcing users and coordinating sessions - Segment Design Team

For this iteration, we took a much more rigorous design thinking approach.  We approached the process holistically, considering both the researcher and the participant as users when iterating on our process.

Jobs To Be Done:

  1. As a product designer, when I am exploring or validating an idea, I want to be able to interact with users so I can learn more about them and incorporate this understanding into the product.

  2. When I am doing user research and testing, I want to be able to find users easily, so I can spend my time learning more about them rather than coordinating the process.

  3. When I am engaging in user research as a participant, I want to be able to give feedback quickly and easily, so I can move on to my primary responsibilities.

Problems: 

The first iteration of our user research recruiting and coordinating process required too much manual input on both sides. The opt-in experience for our participants was not ideal, since we were sending an email which, somewhat paradoxically, requested that they enter their email in a Google form.  The pain points on our side centered on the way email addresses of participants and the records of when they had been contacted were trapped in spreadsheets and not accessible in other tools.

What worked well from the last iteration:

Allowing users to explicitly opt in to our program made sure that we weren’t spamming people who weren’t interested in participating.

Solution:

We began to explore ways to streamline the recruitment and coordination process. When we mentioned the project to our developer counterparts, they were aghast at the existing manual process. Apparently repetition is the Comic Sans of engineering.  

As we ideated on how to automate the workflow, a crazy idea emerged: What if we used our product, a platform for customer data, to collect and manage other kinds of data? 

Sidebar: We did use our own product as part of this solution. No, this is not a sales pitch. Yes, there’s a free plan that should let you achieve this workflow.

Roll your own UX Research recruitment system in 20 minutes

1. Fill the top of the funnel

We continued to use Customer.io to send a triggered email to new users who signed up for Segment and ask if they’d like to opt in to the UX Research program.

2. Tag users who opt in

We designed a landing page and launched it with WebFlow. When users reached the page, we used Segment to assign an attribute to their user IDs with a tracking API. In this case, we assigned the attribute ux_research_opt_in=true to the user ID. If users chose to opt out of the program, we simply changed this attribute to ux_research_opt_in=false to remove them from the UX Research program without unsubscribing them from all Segment emails.

Pro tip: If you’re a designer who isn’t also a developer, get some help from a friend with this step. If you’re a designer who doesn’t know any developers, you might have some trouble shipping products.

3. Invite participants to sessions

Using Segment meant that the the attribute assigned to the participant’s user ID was available in hundreds of third party tools. Having the opt-in attribute associated with user IDs allowed us to have consistent cohorts across various channels, in this case Customer.io for emails and Intercom for in-app messaging.

We planned to contact participants no more than once a month and sent them Calendly links so they could self-schedule remote or in-person sessions at their convenience for the subsequent two weeks.

Pro tip: Create an email alias for your UX Research program. Otherwise replies and out-of-office messages will end up in your inbox.  This also allows you to create a research and testing calendar that the entire Design Team or broader organization can view.

In addition to the remote and in-person sessions, we also sent out surveys and exercises like card sorting and tree testing asynchronously via Verify and Optimal Workshop.  This was important because it allowed users who live in different time zones across the world to share their perspective.

4. Show some love

Finally, we sent thank you gifts to our participants using Printfection, which lets customers select from various Segment swag and handles all aspects of fulfillment.

The overall flow ended up looking like this:

From an end-user perspective, an invitation email allows people to easily opt in to the program by entering their email address.  They then receive monthly invitations to participate in specific feature tests or research modules with the option to schedule either remote or in-person sessions. Then, all they have to do is attend the session itself.  Everything from the anti-spam opt-in process and self-scheduling to the customizable gifts are designed to give our participants flexibility and control over how often and when they choose to give us feedback.

We’d love to hear from folks tackling similar challenges in user research and design for B2B products.  And if you’d like to participate in our User Experience Program, just sign up for a free Segment account and we’ll be in touch— automatically.

Calvin French-Owen on April 20th 2017

The barrier holding back most open source projects is surprisingly mundane. It’s not test coverage. It’s not performance. It’s not code quality.

No, adoption often comes down to a single question: “is the tool easy to install and use?” 

Do I have to re-compile the project from source? Do I have to wait 15 minutes while I fetch hundreds of dependencies? Do I have to download a new package manager? Do I really need a picture of Guy Fieri?

Most open source authors recognize that configuration and setup is a barrier. And yet, we continue to underestimate how large of a barrier it is. The most successful projects make it as easy as possible to install and deploy a tool out of the box.

And some projects nail it; the massive adoption of Bootstrap, Redis, and Rails is largely due to their ‘out-of-the-box’ nature. And we have package managers like yarn , brew , and docker to thank for such an easy setup. 

But hundreds of projects don’t have that “zero-to-60-in-5-seconds” approach.

That got me wondering: if package managers have revolutionized how easy it is to install and run software locally, why is it so much harder to get full applications into production? Why is installing a node module trivial, but running full services with a webserver, database, and job queue so difficult?

Where is our “package manager for the cloud?”

Managed vs Packaged

Before imagining what a cloud package manager might look like, it’s worth defining some terms. In particular, differentiating packaged infrastructure from managed infrastructure.

Packaging gives the user a consistent means of installing and then running software. Docker images, Debian packages, VMs, and dependency managers all come to mind here. The promise of the package manager is that you install it once, and then run all other code through it. 

Managed infrastructure is different. While it is packaged in various ways–the core value proposition is that you don’t handle operations. Instead, hard drive, machine, and system failures are off-loaded to a cloud provider. 

As a client, you agree to respect the limits set by whomever is maintaining the infrastructure. And in return, you get a set of guarantees related to uptime, SLOs, and fault-tolerance.

Managed infrastructure is appealing because it reduces the expertise needed to run complex software. It certainly wont excuse poor architecture decisions or bad query patterns. But it is incredible that with a single click, you can create a high-throughput database which automatically handles re-balancing, backups, and scaling. 

By using managed infrastructure, you’ve effectively ‘leased’ the expertise of site reliability engineers from Google or Amazon, and am willing to pay a small margin on top of the hardware costs to outsource a large portion of that skillset. 

What’s most interesting about cloud primitives like RDS or Cloud Spanner is that outsourcing this expertise puts maintenance in the hands of, well, experts. We’re not talking about your usual “cheap outsourced devshop that churns out PHP apps.” Instead, startups get to borrow from internal teams at Amazon and Google–arguably some of the most experienced engineers on the planet when it comes to running large, multi-tenant infrastructure. 

That’s why, for the vast majority of use cases, managed infrastructure just makes sense. It’s not only cheaper, but ultimately more reliable as well.

Of course, leveraging ‘expert knowledge’ is only half the benefit. There’s also the ability to leverage complex (and proprietary) distributed systems like Dynamo or Bigtable. Building your own auto-scaling version of either system requires millions of dollars in R&D. But with the cloud, you can rent hourly access to them for pennies on the dollar.

It’s a trend that is continuing to grow in popularity, and it doesn’t seem to show any signs of slowing.

Unsurprisingly, both packaged and managed software have been undergoing a bit of a renaissance when it comes to new tech. And in particular, local development has been completely re-invented in the last 5 years with the advent of Docker and high-level package managers.

The Rising Tide of Packaged Infrastructure

Ever since the rise of containers and Docker, packaging has come a long way.

It’s now relatively trivial to spin up an entire environment that runs only within containers. 

Running individual containers is straightforward, simply grab the image you want and supply the right configuration:

But what if you have an entire application run across multiple containers that you would like to coordinate? 

As a trivial example, suppose we want to run a Wordpress installation that connects to a database.

One way to do this would be using Docker-compose. You can see that we lay out our two services, one for our DB: 

And one for our Wordpress site:

If those are both defined in our docker-compose.yml file, we can then run both containers on any Docker infrastructure using:

Or we could use Kubernetes Pods, which would even get us encrypted secrets automatically injected into the container. We can even grab the full YAML provided in the examples to boot the resources we need.

Because every cloud provider, container startup, and grad student is writing their own scheduler these days, there’s really no limit to coordinating containers. And while each implementation comes with a certain set of trade-offs and quirks, they all do a pretty good job of orchestrating many different services together.

But here’s the weird thing: we’ve had good packaging for a long time and we’ve started to see really nice ways to coordinate infrastructure we manage–and yet, there’s still not a lot of good tools for configuring managed infrastructure. 

Who Manages the Managed?

In the not-so-distant past, there was a reasonable solution to quickly launching managed services: The Heroku Button

With one click, you could instantly get applications running within your Heroku account, along with whatever databases or queues you’d need. No obscure cron jobs, kernel tuning, or arcane init scripts required.

And that was great. IF you run your infrastructure on Heroku. But if you don’t, it’s a totally different story. 

If you’re running on AWS or GCE, now you also have to setup the proper security and network permissions, pay for additional data transfer outside your AWS account, and make sure that Heroku is actually scaling properly. Oh, and it probably uses a totally separate monitoring and logging pipelines.

Unless you end up booting that infrastructure as a one-off or totally separate service, there’s still a lot of work to be done to mold an open source project to run successfully within your infrastructure. We’ve done this three or four times now for Segment–and it’s invariably annoying.

As more companies have transitioned off Heroku and onto the bigger clouds, the simplicity and speed of the Heroku button has really fallen by the wayside. And the need for cross-cloud managed infrastructure has appeared. 

So let’s explore the offerings on the market. Each is attacking this problem from a slightly different angle, at different layers of the stack.

An Example: Discourse

Before looking at the different stacks, I’d like to quickly introduce Discourse as a sort of prototypical app that can really benefit from managed infrastructure. 

Discourse is an open source platform for community discussion that you can run within your own infrastructure. 

Users wanting to run Discourse require three separate processes:

  • a Ruby on Rails server running Discourse on a single machine

  • a Postgres server

  • a Redis server

If this is an internal tool under relatively light load, that’s probably it. But if Discourse is running with a large number of clients, we’ll also need:

  • a load balancer

  • auto-scaling rules for the web service

And oh yeah, we’ll probably want some way of backing up that Postgres instance so we don’t drop everything accidentally, as well as a nice way to scale up the number of clients and the size of the database.

Now it’s starting to sound like we either require a full-fledged ops person to spin up this infrastructure, or a nice way of outsourcing it to managed products.

So how can we quickly provision it? 

There are a few major players who all seem well-positioned to tackle this problem: Terraform, Kubernetes, and a handful of cloud startups. Each has a different strategic position, so I’d like to explore how each tool can provide cross-cloud managed infrastructure.

The Terraform Solution

The product with the most cross-cloud adoption to date is Hashicorp’s Terraform.

First, a 30-second intro of Terraform: Terraform is a CLI tool for managing cloud resources. You can use it to provision, and then change the configuration of load balancers, auto-scaling groups, instances, and more. 

Terraform uses a static DSL to create resources. It then tracks these resources in a “state file” and creates “diffs” between your desired state and the current state of your infrastructure. 

Creating instances is simple. As a quick example, the configuration for a bastion resource might look like this:

It has a type (aws_instance) and an identifier (bastion), as well as a bunch of attributes that are passed in as configuration. Whenever we plan and apply, any changes will be modified using the appropriate AWS API to update our resource.

What’s more interesting is that groups of resources can be created together using modules. Modules are re-usable pieces of configuration that will automatically create and manage all the resources within it. 

Modules effectively serve as a higher-level API to collections of managed infrastructure. Just what we needed!

As an example for Discourse, we could imagine that the repo itself could package a Terraform folder:

And then anyone interested in booting up Discourse within their own infrastructure could simply reference the module and pass in their own VPC ID.

Internally this module would then create the internal resources that it needs:

Under the hood, it would give us:

  • auto-scaling instances with an AMI running discourse

  • a managed RDS instance running postgres with hourly backups

  • a managed elasticache instance running redis

And voila! We have our production infrastructure, ready to go! By referencing the module, and passing in our required infrastructure–Terraform will automatically boot all the pieces that we need.

The best part here is that modules can support multiple providers just by using different Terraform. 

Need the GCE version instead? If the author has built an adapter for it, just reference a new path:

Still want our good ol’ Heroku button? No problem, Terraform supports that too (with managed addons):

Under the hood, this might look something like this:

It solves both the problems of running a service within your own infrastructure and relying on managed infrastructure for any team, regardless of whether they use Heroku or not. 

The Kubernetes Solution

Where Terraform is primarily used to provision the ‘base infrastructure’ and machine images, Kubernetes lives at a different part of the stack. It’s a scheduler and service orchestrator–designed to coordinate services both locally and in the cloud.

As a user, you first describe services within Kubernetes, and then install a Kublet (the Kubernetes agent) on each host. Kubernetes will then determine how to optimally utilize the cluster such that multiple containers are run and exposed on each machine. 

In short, Kubernetes a scheduler that runs applications and handles service discovery, load balancing, and configuration–all across a cluster of machines. 

Ordinarily I wouldn’t even put Terraform and Kubernetes in the same category–since Kubernetes is focused on orchestrating infrastructure that you are responsible for running yourself. It’s not really about booting up managed infrastructure at all. 

However, Kubernetes does support one important cloud primitive: managed load balancers. When booting your service, you can specify as part of the pod that it should manage a load balancer to send traffic to:

This load balancer will use the managed load balancers within any of the big clouds: AWS, GCE, Azure, and Openstack to name a few.

By combining primitives like the LoadBalancer with simple package management using Helm charts, Kubernetes is starting to become the common ‘substrate’ that any application can build upon.

The CLI command to install our discourse app might look like this:

As a developer who wants to support Helm, all we’d have to do is add the corresponding Helm charts: 

And we’d allow anyone running Kubernetes to immediately run Discourse within their infrastructure.

What makes this approach interesting is the fact that Kubernetes has been gaining so much traction at the application level. It feels like it would be relatively doable to build a Kubernetes service type of Database or ObjectStore that maps to RDS or S3. It could be the trojan horse that slowly beings to spread Kubernetes pods as the ‘unit’ of managed hosting. 

That said, there’s an immense amount of work involved to gain the sort of provider coverage that Terraform has. Even with the tight application integration that Kubernetes can provide, moving to support the managed infrastructure of just the major clouds would still be a major undertaking. 

The Managed Managed Solution?

As some last food for thought–it seems like there’s a big opportunity here for a startup that manages to smooth over the wrinkles and inconsistencies between various cloud providers. 

Imagine if you really could run your infrastructure dynamically on whichever cloud was cheapest. Or whichever provided the lowest latency. 

It effectively turns the decision of which cloud to use from an annual procurement to a real-time auction. 

There are a handful of startups doing interesting things when it comes to making managed infrastructure more user-friendly:

Convox: Convox provides a CLI and layer for working over AWS. It’s an open source tool that imagines where the fundamental abstraction is ‘applications’ just like Heroku. But unlike Heroku, Convox is built entirely atop an AWS pipeline, leveraging ECS for scheduling, Cloudwatch for logs, and Lambda to coordinate scheduled jobs. Because apps and logs are the core abstraction, Convox isn’t tied to the underlying infrastructure, though it certainly leverages AWS utilities heavily. 

Skyliner: Like Convox, Skyliner launched more recently as a nice UI and deploy pipeline that layers over AWS. Their whole pitch is that they will give you a ‘best practice’ AWS setup that’s extremely easy to interface with. While they don’t yet support other clouds, being the ‘owner’ of the customer opens them up to start moving and building higher level primitives that can work across clouds. They ‘own’ the customer in the sense that the user really only interfaces with Skyliner, not AWS.

Zeit: Zeit’s Now takes the most radical approach (and the most similar to  ‘serverless’) of these three providers. Zeit provides the user the ability to upload a node module that it will run and host using only a single API command. Zeit abstracts almost all managed infrastructure away from the user, so they aren’t really sure what cloud they are running upon, or what resources are being consumed under the hood. It’s less about managing cloud infrastructure, and more about solving the problem of running code with applications as the focus. 

We’ve still yet to see these approaches gain more widespread adoption. But it’s still early days for most of them. 

The Second-order Cycle

Now, amongst all the rosy ideas of a universal package manager for the cloud, I intentionally skipped past the elephant in the room. Creating ‘one package manager for any cloud’ is incredibly difficult because cloud providers are incentivized to lock-in customers to their platform.

And overcoming that incentive is tough. It’s a good part of the reason why each cloud provider maintains idiosyncratic APIs, workflows, and specialized tooling. It certainly makes the job of a ‘blanket’ API that papers over these inconsistencies difficult at best.

This barrier is particularly acute with pieces of technology that are “hard” to develop. I’m talking about the R&D-heavy efforts behind proprietary platforms like DynamoDB, Cloud Spanner, BigQuery and Redshift.

But that said, there’s a second-order cycle at play with these sorts of expensive pieces of proprietary software. In many cases, we see databases and streaming systems mature along the following milestones:

  1. the software is developed internally at a larger tech company (Google, FB, etc)

  2. the software architecture is published as an academic paper

  3. an open-sourced reference implementation is released

  4. commercial and cloud support follow the tool’s user adoption

You can see it this cycle again and again with the likes of Hadoop, Cassandra, HBase, Kafka, and Kubernetes. All these tools were borne out of companies initially, and then gained widspread adoption over time, either as Apache projects or sponsored directly by the company that conceived them.

As a textbook example, take Hadoop. 

Hadoop’s journey from open source project to managed infrastructure hits the following milestones [2]:

And Hadoop isn’t the only example: we’ve started to see the emergence of hosted Kafka on Heroku and Spark on AWS.

It’s software in this cycle that does stand a fighting chance to become cross-cloud infrastructure. Databases like Redis, MySQL, and Postgres are so ubiquitous that they have effectively become table stakes that cloud providers must to offer as managed products. 

We’ll continue to see these sorts of products emerge as new technology gains widespread adoption. It would come as no surprise if we started seeing hosted Kubernetes deployments emerging outside of just GCE.

Now, there’s certainly still a lot of ground to be made up when it comes to booting up infrastructure “anywhere” and running applications at a higher level of abstraction than instances and networks. But in a world where it’s so easy to boot similar types of infrastructure in different clouds–there’s a huge opportunity to pave over all those differences and create a unified abstraction layer across all of them. 

Instead of treating infrastructure at the machine or even container level–I’m expecting that we’ll start seeing infrastructure where “applications” are the core unit of abstraction. And I’m excited to see how that world plays out.


[1]: Kafka is a great example of this. Simple architecture combined with extremely high throughput. The Kubernetes architecture is another one that allows people to build on it quickly. But these seem to be the exception rather than the rule.

[2]: https://en.wikipedia.org/wiki/Apache_Hadoop#Timeline

Peter Reinhardt on April 12th 2017

Running QA tests for Segment’s UI was taking way too long. Sure, we had strong component-level tests for our UI kit. But to test our whole app we needed to painstakingly poke around looking for oddities.

Manual testing like this is extremely time-consuming, and you can easily miss accidental, small visual differences that degrade the user experience. Shipping even the smallest of these bugs to production then creates an even costlier bug reporting cycle that involves customers and the support team. That’s no good, we wanted a better way!

So we began experimenting with perceptual diffing. Perceptual diffing compares screenshots of new releases by comparing pixel-values, and then highlights those differences.

This article explains exactly what perceptual diffing looks like and how to set up perceptual diffing easily with Nightmare and Niffy — a new open-source library we’re releasing today.

Let’s Play a Game… Can you Spot the Regression?

Below is a real release of Segment’s UI from September 2015. This is a screenshot of our “Workspaces” page on staging (left) and production (right):

Can you see the regression?

Well, there’s actually two regressions! And I didn’t see either of them when I was testing this manually in 2015.

This is where perceptual diffing comes in: it highlights every pixel change. Here’s what Niffy sees:

As you can see, perceptual diffing makes both regressions immediately obvious:

  1. The lock icon is missing from the bottom paragraph of text.

  2. The “Enterprise Plan” text under the “Segment” workspace has been replaced with “Business Plan” (broken logic that should standardize the naming).

That said, not all perceptual diffing highlights are regressions. If you ship an update to part of the product, the diffing will go nuts with red highlights. But that’s a good thing! Perceptual diffing really shines by catching bugs on all the other views, where you expect to see zero changes.

Implementation

When we first heard about perceptual diffing Somewhere on the Internet™, we were quite intrigued. Demos like the one above felt extremely promising for reducing our manual testing burden, and we wanted to get this working for Segment. But as we researched the available tools, they bifurcated into two groups: (1) hosted tools like VisualPing are designed for change detection on public static sites, (2) open source tools like pdiff are aging and also work best on public static sites. The existing tools weren’t the right solution for us because they weren’t able to navigate into our app, click around, and test workflows.

So we decided to build a lightweight perceptual diffing layer on top of Nightmare, our browser automation library. It’s called Niffy and we’ll show you how to use both Nightmare and Niffy below.

The Basics with Nightmare

Perceptual diffing has three main steps:

  1. Capture screenshots of pages and views in your app.

  2. Diff two sets of screenshots and produce a diff-highlight.

  3. Trigger these capture and diff steps at the appropriate moment in the release process.

Capture

Nightmare makes it easy to capture a screenshot. Here’s a fully-functional example:

But capturing static urls is not that interesting. Where Nightmare really shines is more complex interactions and app states. For example, you likely want to (1) login, (2) navigate to some part of the app, (3) open up a modal and then take screenshots to make sure core workflows are tested. 

Here’s a working example you can copy+paste and run:

With simple Nightmare scripts like this you’re able to get the UI into complex states and easily capture screenshots.

Diff

Once you have matched screenshots of two versions of the same UI, you need a way to generate a highlighted difference. The naive solution is to just take the difference of the pixel values and display that, but this turns out to be unreadable because you just get a giant black image. If you average opacity value instead of taking the difference, you still just get a few randomly colored pixels here and there:

So we dug into other perceptual diffing tools more closely, and then approximately copied what they do: copy over equivalent pixels with partial transparency, and make mismatched pixels red (you can see the exact diffing algorithm we use in Niffy here.

Trigger

We’ve looked at several different triggers for doing this perceptual diffing. There are a few challenges:

  1. Where do you (reliably) store screenshots of sequential versions?

  2. When exactly is your new release deployed to staging and ready to be diffed?

The answers to these questions are pretty different depending on each company’s cloud provider, continuous integration environment, and deployment process. We’ve found so far that the simplest trigger is to run the diffing manually (make test), comparing staging and production. This is the method we outline next with Niffy.

Niffy Makes this Simple

Niffy is designed to bundle up the capture and diff steps into a library that can be easily used in a mocha test. Niffy exposes the internal Nightmare instance so that you can do arbitrary clicking, typing, checkboxing, etc. before you take your diffing screenshot (see Logged In example below).

Here’s the output of our Niffy tests run at time-of-writing:

All you need to do is run those open /tmp/niffy/…  commands to see immediately what broke…

First, it looks like our Settings Overview page got a big update!

And second, we’re seeing an error alert on staging on the Settings Move Source page that we should fix in our staging environment for better testing:

Setting up Niffy

To help you get started with Niffy, here’s an abbreviated snippet from the diffing test suite (test/index.js) we use on Segment itself (and there’s a ready-made example test suite in the iffy repo that you can run with make test):

With Makefile :

And test/mocha.opts :

And package.json :

To get started with perceptual diffing, head over to the Niffy repo, or use Nightmare directly. And lastly, if you like building software to solve complicated business problems like this, we’re hiring! Or if you’re working on open source full-time, check out our Open Fellowship!

Fouad Matin on March 23rd 2017

Today we’re proud to announce the Segment Open Fellowship. The Fellowship is a three month long program supporting three to five open-source developers with $8k per month to focus full-time on their project, no other strings attached.

Since the very beginning, open-source has played a critical role in Segment’s journey. As we’ve grown, we’ve continued to try and create useful tools and utilities that we open source to continue giving back to the community that gave us our start. As Segment scales, we’d like to continue scaling our support for the open-source community even beyond our internal needs.

We’ll award grants to 3-5 developers to work on an open-source project for 3 months. They can work out of our office in SF or remotely. We’re hoping to give them a chance to accelerate the adoption of a new, fast-growing project.

The primary goal of the fellowship is to enable participants to fully dedicate themselves to a project for a few months. We’re hoping to give them a chance to speed the adoption of a new, fast-growing project. Or maybe help them build some long-awaited key features of a library that’s already widely used. Or perhaps even jump-start an entirely new idea altogether.

It’s certainly an experiment–but one which we believe has the potential to benefit developers all over the world. And we couldn’t be more excited to help support progress in the open-source community.

Get the details and apply here →

Know someone who should apply? Let them know on Twitter or on Facebook

Thank you Stripe for the Open Source Retreat and Google for Summer of Code as inspiration for this program.

Achille Roussel, Rick Branson on March 14th 2017

For an early startup, using the cloud isn’t even a question these days. No RFPs, provisioning orders, or physical shipments of servers. Just the promise of getting up and running on “infinitely scalable” compute power within minutes.

But, the ability to provision thousands of dollars worth of infrastructure with a single API call comes with a very large hidden cost. And it’s something you won’t find on any pricing page.

Because outsourcing infrastructure is so damn easy (RDS, Redshift, S3, etc), it’s easy to fall into a cycle where the first response to any problem is to spend more money__.

And if your startup is trying to move as quickly as possible, the company may soon be staring at a five, six, or seven figure bill at the end of every month.

At Segment, we found ourselves in a similar situation near the end of last year. We were hitting the classic startup scaling problems, and our costs were starting to grow a bit too quickly. So we decided to focus on reducing the primary contributor: our AWS bill.

After a three months of focused work, we managed to cut our AWS bill by over one million dollars annually. Here is the story of how we did it.

Cash rules everything around me

Before diving in, it’s worth explaining the business reasons that really pushed us to build discipline around our infrastructure costs.

The costs for most SaaS products tend to find economies of scale early. If you are just selling software, distribution is essentially free, and you can support millions of users after the initial development. But the cost for infrastructure-as-a-service products (like Segment) tends to grow linearly with adoption. Not sub-linearly.

As a concrete example: a single Salesforce server supports thousands or millions of users, since each user generates a handful of requests per second. A single Segment container, on the other hand, has to process thousands of messages per second–all of which may come from a single customer.

By the end of Q3 2016, two thirds of our cost of goods sold (COGS) was the bill from AWS. Here’s the graph of the spend on a monthly basis, normalized against our May spend.

Our infrastructure cost was unacceptably high, and starting to impact our efforts to create a sustainable long-term business. It was time for a change.

Getting a lay of the land

If the first step in cost reduction is “admitting you have a problem”, the second is “identifying potential savings.” And with AWS, that turns out to be a surprisingly hard thing to do.

How do you determine the costs of an environment that is billed hourly with blended annual commits, auto-scaling instances, and bandwidth costs?

There are plenty of tools out there that promise to help optimize your infrastructure spend, but let’s get this out of the way: there is no magic bullet.

In our case, this meant digging through the bill line-by-line and scrutinizing every single resource.

To do this, we enabled AWS Detailed billing. It dumps the full raw logs of instance-hours, provisioned databases, and all other resources into S3. In turn, we then imported that data into Redshift using Heroku’s AWSBilling worker for further analysis.

It was a messy dataset, but some deep analysis netted a list of the top ~15 problem areas, which totaled up to around 40% of our monthly bill.

Some issues were fairly pedestrian: hundreds of large EBS drives, over-provisioned cache and RDS instances. Relics left over from incidents of increased load that had not been sized back down.

But some issues required clear investment and dedicated engineering effort to solve. Of these, there were three fixes which stood out to us above all else:

  • DynamoDB hot shards ($300,000 annually)

  • Service auto-scaling ($60,000 annually)

  • Bin-packing and consolidating instance types ($240,000 annually)

The long-tail of cost reductions accounted for the remaining $400,000/year. And while there were a handful of lessons from eliminating those pieces, we’ll focus on the top three.

DynamoDB hot shards

Segment makes heavy use of DynamoDB for various parts of our processing pipeline. Dynamo is Amazon’s hosted version of Cassandra–it’s a NoSQL database that acts as a combination K/V and document store. It has support for secondary indexes to do multiple queries and scans efficiently, and abstracts away the underlying partitioning and replication schemes.

The Dynamo pricing model works in terms of throughput. As a user, you pay for a certain capacity on a given table (in terms of reads and writes per second), and Dynamo will throttle any reads or writes that go over your capacity. At face value, it feels like a fairly straightforward model: the more you pay, the more throughput you get.

However, correctly provisioning the throughput required is a bit more nuanced, and requires understanding what’s going on under the hood.

According to the official documentation, DynamoDB servers split partitions based upon a consistent hashing scheme:

Under the hood, that means that all writes for a given key will go to the same server and same partition.

Now, it makes common sense that we should distribute reads and writes so they are uniformly distributed. You don’t want a hot partition or single server which is being constantly overloaded with writes, while your other servers are sitting idle.

Unfortunately, we were seeing a ton of throttling even though we’d provisioned significantly more capacity on our DynamoDB instances.

To get an understanding of the upstream events, our dynamo setup looks something like this:

We have a bunch of unpartitioned, randomly distributed queues that are read by multiple consumers. These objects are then written into Dynamo. If Dynamo slowed down, it would cause the entire queue to back up. And what’s more, we would have to increase throughput capacity far more significantly than the required write throughput in order to drain the queue.

What had us confused was that our keys are partitioned by the end tracked user. And tracking keys across hundreds of millions of users per day should evenly distribute the write load uniformly. We’d followed the exact recommendation from the AWS documentation:

So why was Dynamo still getting throttled? It appeared there were two answers.

The first was the fact that the throughput pricing for dynamo actually dictates the number of partitions rather than the total throughput.

It’s easy to overlook, but the Amazon DynamoDB docs state the following when it comes to partitions:

A single partition can hold approximately 10 GB of data, and can support a maximum of 3,000 read capacity units or 1,000 write capacity units.

The implication here is that you aren’t paying for total throughput, but rather partition count. And if you happen to have a few keys which saturate the same individual partitions, you have to double capacity to split a single hot partition onto their own partitions rather than scale the capacity linearly. And even there you are limited to the throughput for a single partition.

When we talked with the AWS team, their internal monitoring told a different story than our imagined ‘uniform distribution’. And it explained why we were seeing throughput far below what we had provisioned:

This is a heatmap they provided of the total partitions, along with the key pressure on each. The Y-axis maps partitions (we had 647 partitions on this table) and the X-access marks time over the course of the hour. More frequently accessed ‘hot’ partitions show up as red, while partitions that aren’t accessed show up as blue.

Vertical, non-blue, lines are good–they indicate that a bulk load happened, and was evenly spread across the keyspace, maximizing our throughput. However, if you look down at the 19th partition, you can see a thin streak of red:

Uh oh. We’d found our smoking gun: a single slow partition.

It was clear something needed to be done. The heat map they provided was a major key, but it’s granularity is at the partition-level, not the key. And unfortunately, there’s no provided out-of-the-box way to identify hot keys (hint hint!).

So we dreamt up a simple hack to give us the data we needed: anytime we were throttled by DynamoDB, we logged the key. The table’s provisioned capacity was temporarily reduced to induce the throttling behavior. And then logs were aggregated together and the top keys were extracted.

The findings? A number of keys that were the result of, shall-we-say, “creative” uses of Segment.

Here’s an example of what we were seeing:

Spot the issue?

At a certain time every day, it appeared as though there was a daily automated test against our production API that resulted in a burst of hundreds of thousands of events attached to a single userID (literallyuser_id in this case). And that userId that was either set statically, or incorrectly interpolated.

While we can fix bugs in our own code, we can’t control our customers.

It was clear from examining each case that there was no value in properly handling this data, so a set of blocked keys (“userId”, “user_id”, “#{user_id}” and variants) was built from the throttling logs. Over a few days we slowly decreased the provisioned capacity, blocking any new discovered badly behaved keys. Eventually we reduced capacity by 4x.

Of course, fixing individual partitions and blacklisting keys is only half the battle. We’re in the process of moving from NSQ to Kafka which will provide proper partitioning upstream of Dynamo. Partitioning upstream of Dynamo will ensure that we are batching writes efficiently and merging changes on a small subset of servers rather than spreading writes globally.

Service auto-scaling

A little bit of background on our stack: Segment adopted a micro-service architecture early on. We were among the first users of ECS (EC2 Container Service) for container orchestration, and Terraform for managing all of our AWS resources.

ECS manages all of our container scheduling. It’s a hosted AWS service, which requires each instance to run a local ECS-agent. You submit jobs to the ECS API, and it communicates with the agent running on each host to determine which containers should run on which instances.

When we first started using ECS, it was easy to auto-scale instances, but there was no convenient way to auto-scale individual containers.

The recommended approach was to build a frankensteinian pipeline of Cloudwatch alerts which would trigger a Lambda function that updated the ECS API. But in May 2016, the ECS team launched first class auto-scaling for services.

The approach is fairly simple. It’s effectively the same as the automated approach, but requires a lot fewer moving parts.

Step one: set limits on CPU and memory thresholds for the ECS service:

It takes about 30 seconds to do, and then the service will automatically scale the number of tasks up and down in relation to the amount of resources it’s using.

Step two: we enabled our instances to scale based upon the desired ECS resource allocation. That means if a cluster no longer had enough CPU or memory to place a given task, AWS would automatically add a new instance to the auto-scaling-group (ASG).

How are the results?

In practice this works really well (as modeled by our API containers):

Our traffic load pretty closely follows the U.S. peaks and troughs (large rise at 9:00am EST). Because we only have 60% of peak traffic at nights and on weekends, we’ve been able to save substantially by adding auto-scaling, and not have to worry about sudden traffic spikes.

The additional benefit has been automatically scaling down after over-provisioning to deal with excess load. We no longer have to run at 2x the capacity, since the capacity is set dynamically. Which brings us to the last improvement: bin packing.

Bin packing and consolidating instance types

We’ve long contemplating switching to bigger instances, and then packing them with containers. But until we started on “project benjamin” (the internal name for our cost-cutting effort), we didn’t have a clear plan to get there.

There’s been a lot written about getting better performance from running on bigger virtual hosts. The general argument is that you can get less steal from noisy neighbors if you are the only one on a physical machine. And there’s more likelihood that you will be the sole VM on a physical machine if you are running at the largest possible instance size.

There’s a handful of additional benefits as well: fewer hosts means a lower cost of per-host monitoring and quicker image rollouts.

Moreover, if you are using the same instance type (big or small) you can get a much cheaper bill using reserved instances. Reserved instances are nearly 40% off the per-hour price, but require an annual commit.

So, we realized it was in our best interest to start consolidating the instances we were running on, and start building an army of c4.8xlarges (our workload is largely compute and I/O bound). But to get there, we needed a necessary requisite: moving off elastic load balancers (ELBs) to the new application load balancers (ALBs).

To understand what moving to ALBs gives us vs the classic ELB, it’s worth talking through how they work under the hood.

From our best estimation, ELBs are essentially built atop an army of small, auto-scaling instances running HAProxy.

When using ECS with ELBs, each container runs on a single host port specified by the service definition. The ELB then connects to that port and forwards traffic to each instance.

This has three major ramifications:

  1. If you want to run more than one service on a given host, each service must listen on a unique port so they don’t collide.

  2. You cannot run two containers of the same service on a single host because they will collide and attempt to listen on the same port. (no bin packing)

  3. If you have n running containers, you must keep n+1 hosts available to deploy new containers (assuming that you want to maintain a 100% healthy containers during deploys).

In short, using ELBs in combination with ECS required us to over-provision instances and stack only a few services per instance. Hello cost city, population: us.

Fortunately for us, the port collision problem was solved with the introduction of the ALB.

The ALB allows ECS to set ports dynamically for individual containers, and then pack as many containers as can fit onto a given instance. Additionally, the ALB uses a mesh routing system vs individual hosts, meaning that it does not need to be ‘pre-warmed’ and can scale automatically to meet traffic demands.

In some cases, we’re currently packing 100-200 containers per instance. It’s dramatically increased our utilization and cut the number of instances required to run our infrastructure (at the same time as we 4x’d api volume).

Utilization over time

Easy by default with Terraform

Of course, it’s easy to cut costs with these sorts of focused ‘one-time’ efforts. The hardest part of maintaining solid margins is systematically keeping costs low as your team and product scale. Otherwise, we knew we would be doomed to repeat the process in another 6 months.

To do that, we had to make the easy way, the right way. Whenever a member of the eng team wanted to add a new service, we had to ensure that it would get all of our efficiency measures for free without extra boilerplate or configuration.

That’s where Terraform comes in. It’s the configuration language we use at Segment to provision and apply changes to our production infrastructure.

As part of our efforts, we created the following modules to give our teammates a high-level set of primitives that are “efficient by default”. They don’t have to supply any extra configuration, and they’ll automatically get the following by using our modules:

  • Clusters which configures an Autoscaling Groups linked to an ECS cluster.

  • Services to setup ECS services that are exposed behind an ALB (Application Load Balancer).

  • Workers to setup ECS services that consume jobs from queues but don’t expose a remote API.

  • Auto-Scaling as a default behavior for all hosts and containers running on the infrastructure.

If you’re curious about how they fit together, you can check out our open-sourced version on Github: The Segment Stack. It contains all of these pieces out of the box, and will soon support per-service autoscaling automatically.

Takeaways

After being in the weeds for three months, we managed to hit our goal. We eliminated over $1m dollars in annual spend off our AWS bill. And managed to increase our average utilization by 20%.

While we hoped to share some insights behind a few of the very specific issues we encountered in our effort to reduce costs, there are a few bigger takeaways that should be useful for anyone looking to increase the efficiency of their infrastructure:

Efficient By Default: It’s important that efficiency efforts aren’t just a rule book or a one-time strategy. While cost management does requires ongoing vigilance, the most important investment is to prevent problems from occurring in the first place. The easy-mode should be efficient. We accomplished this by providing an environment and building blocks in Terraform that made services efficient by default.

However, this extends beyond configuration tools, and includes picking infrastructure that simplifies capacity planning. S3 is notoriously great at this: it requires zero up-front capacity planning. When considering a SQL database, where the team may have picked MySQL or PostgreSQL, consider using something like Amazon’s Aurora. Aurora automatically scales disk capacity in 10GB increments, eliminating the need to plan capacity ahead of time. After this project efficiency became our default, and is now part of how our infrastructure is planned.

Auto-scaling: During this effort we found that auto-scaling was incredibly important for efficiency, but not only for the obvious reason of scaling along with demand. In practice, engineers would configure their service to give them a few months of headroom before they had to re-evaluate their capacity allocation. This meant that services were actually being allocated far above their weekly peak requirements. That configuration itself is often imperfect, and wastes precious engineering time tuning these settings. At this point, we’d say that ubiquitous auto-scaling is a practical requirement for a micro-services architecture. It’s relatively easy to manage capacity for a monolithic system, but with dozens of services, this becomes a nightmare.

Elbow Grease: There are some tools that aid with cloud efficiency efforts, but in practice it requires serious effort from the engineering team. Don’t fall for vendor hype. Only you know your systems, your requirements, your financial objectives, and thus the right trade-offs to make. Tools can make this process easier, but they’re no magic bullet.


For any growing startup, cost management is a discipline that has to be built over time. And like security, or policies, it’s often far easier to institute the earlier you start measuring it.

Now that all is said and done, we’re glad cost-management and measurement is a muscle we’ve started exercising early. And it should continue to have compounding effects as we continue to scale and grow.

Peter Reinhardt on February 27th 2017

Nightmare is a browser automation library for node.js, designed to be much simpler and easier to use than Phantomjs. We originally built Nightmare to create integration logos with 99Designs Tasks before they had an API, and we still use it in Sherlock. But the vast majority of Nightmare developers—now 55k+ downloads per month—use it for web UI testing and crawling.

This article is a quick introduction to using Nightmare for web UI testing. It uses Mocha as the testing framework, but you could similarly use Jest.

Overview

Nightmare’s API methods are designed to mimic real user actions:

  • .goto(url)

  • .type(elementSelector, text)

  • .click(elementSelector)

This makes testing with Nightmare very similar to how a human tester would navigate, click and type into your actual web app. In the next few sections we’ll dive into how to set your repo, then how to test page loads, submitting forms, and interacting with an app.

Repo Setup

First we need to install mocha and nightmare, and make sure our basic test harness is working.

Starting on the command line in your repo folder…

In test/test.js you can get started with:

const Nightmare = require('nightmare') const assert = require('assert')

describe('Load a Page', function() { // Recommended: 5s locally, 10s to remote server, 30s from airplane ¯\_(ツ)_/¯ this.timeout('30s')

let nightmare = null beforeEach(() => { nightmare = new Nightmare() })

describe('/ (Home Page)', () => { it('should load without error', done => { // your actual testing urls will likely be `http://localhost:port/path` nightmare.goto('https://gethoodie.com') .end() .then(function (result) { done() }) .catch(done) }) }) })

Add mocha as the test script to your package.json:

"scripts": { "test": "mocha" }

Finally, to test this complete setup you can run npm test on the command line…

npm test > Load a Page > ✓ should load a web page (12223ms) > 1 passing (12s)

Loading a Page

Most web products have a set of public pages used for documentation, support, marketing, authentication and signup. Here’s how you can test that these pages load successfully:

describe('Public Pages', function() { // Recommended: 5s locally, 10s to remote server, 30s from airplane ¯\_(ツ)_/¯ this.timeout('30s')

let nightmare = null beforeEach(() => { nightmare = new Nightmare() })

describe('/ (Home Page)', () => { it('should load without error', done => { // your actual testing urls will likely be `http://localhost:port/path` nightmare.goto('https://gethoodie.com') .end() .then(function (result) { done() }) .catch(done) }) })

describe('/auth (Login Page)', () => { it('should load without error', done => { nightmare.goto('https://gethoodie.com/auth') .end() .then(result => { done() }) .catch(done) }) }) })

Submitting a Form

This example tests that Hoodie’s login function fails with bad credentials. It’s always worth testing failed states as well as successful states. 🤖

describe('Login Page', function () { this.timeout('30s')

let nightmare = null beforeEach(() => { // show true lets you see wth is actually happening :) nightmare = new Nightmare({ show: true }) })

describe('given bad data', () => { it('should fail', done => { nightmare .goto('https://gethoodie.com/auth') .on('page', (type, message) => { if (type == 'alert') done() }) .type('.login-email-input', 'notgonnawork') .type('.login-password-input', 'invalid password') .click('.login-submit') .wait(2000) .end() .then() .catch(done) }) }) })

Using the App

This example is more involved, and includes signing up with text fields, select fields, and clicking and waiting through a flow that spans multiple pages.

describe('Using the App', function () { this.timeout('60s')

let nightmare = null beforeEach(() => { // show true lets you see wth is actually happening :) nightmare = new Nightmare({ show: true }) })

describe('signing up and finishing setup', () => { it('should work without timing out', done => { nightmare .goto('https://gethoodie.com/auth') .type('.signup-email-input', 't'+Math.round(Math.random()*100000)+'@test.com') .type('.signup-password-input', 'valid password') .type('.signup-password-confirm-input', 'valid password') .click('.signup-submit') .wait(2000) .select('.sizes-jeans-select', '30W x 30L') .select('.sizes-shoes-select', '9.5') .click('.sizes-submit') .wait('.shipit') // this selector only appears on the catalog page .end() .then(result => { done() }) .catch(done) }) }) })

All Together Now

The final example ties all these together into a cleanly formatted test/test.js:

const Nightmare = require('nightmare') const assert = require('assert')

describe('UI Flow Tests', function() { this.timeout('60s')

let nightmare = null beforeEach(() => { nightmare = new Nightmare({ show: true }) })

describe('Public Pages', function() { describe('/ (Home Page)', () => { it('should load without error', done => { // your actual testing urls will likely be `http://localhost:port/path` nightmare.goto('https://gethoodie.com') .end() .then(function (result) { done() }) .catch(done) }) }) describe('/auth (Login Page)', () => { it('should load without error', done => { nightmare.goto('https://gethoodie.com/auth') .end() .then(result => { done() }) .catch(done) }) }) })

describe('Login Page', function () { describe('given bad data', () => { it('should fail', done => { nightmare .goto('https://gethoodie.com/auth') .on('page', (type, message) => { if (type == 'alert') done() }) .type('.login-email-input', 'notgonnawork') .type('.login-password-input', 'invalid password') .click('.login-submit') .wait(2000) .end() .then() .catch(done) }) }) })

describe('Using the App', function () { describe('signing up and finishing setup', () => { it('should work without timing out', done => { nightmare .goto('https://gethoodie.com/auth') .type('.signup-email-input', 'test+'+Math.round(Math.random()*1000000)+'@test.com') .type('.signup-password-input', 'valid password') .type('.signup-password-confirm-input', 'valid password') .click('.signup-submit') .wait(2000) .select('.sizes-jeans-select', '30W x 30L') .select('.sizes-shoes-select', '9.5') .click('.sizes-submit') .wait('.shipit') // this selector only appears on the catalog page .end() .then(result => { done() }) .catch(done) }) }) }) })

If you have additional questions or want to join the 90+ people who have contributed to Nightmare, head over to the Github repo. Happy testing.

Peter Reinhardt on October 19th 2016

At Segment, focus is one of our four core values. But it was difficult for team members to focus in the office, so in June we ran an internal team survey about what helps and hurts focus. The results showed that “chatter and noise” was one of the biggest culprits for distraction around the office. “Slack group channels” came in second.

These answers left us with two difficult questions: how do you solve a noise problem in an open floor plan? And where is the noise even coming from?

To get to the bottom of it, I decided to build an iOS app to collect decibel levels from around the office. We found that noise levels varied widely throughout the office, and using the new data, we changed the office layout to increase our ability to focus. Numerically speaking, the increased focused time (as measured by survey) has been equivalent to hiring 10–15 teammates. And beyond the numbers, it feels great to focus more. 😃

Where is the noise coming from?

At first we thought we were just being a bit too chatty. But demanding “be quiet!” is horrible in a collaborative work environment. We also noticed something odd: when people walked into our office they’d often say, “Wow! This is one of the quietest offices I’ve ever been to.” Of course, the survey said the opposite… that the office was noisy and distracting.

This discrepancy was particularly confusing because our office is an open floor plan. Sound ought to travel well around such a big open space. It was strange that we had two widely divergent stories around the quietness and loudness of the office.

Here’s a picture showing the high ceilings, plants and open layout:

And here’s our floorplan showing the lack of walls… lots of ways for sound to bounce around (view above shown in blue):

The conflicting anecdotal stories from visitors and teammates were perplexing. To get to the bottom of things, we needed hard data. And what better way to collect that data than the ambient sensors called iPhones already sitting around our office? So I built an app to record decibel levels in different areas and give us some real data to understand the situation.

The iOS app is tiny: it uses the Apple AVAudioRecorder class’s level-metering to passively collect and report average and maximum ambient decibel levels every 10 seconds to our server monitoring tool Datadog. We’ve open-sourced our Decibel noise-recording app for you to use. Just add your Datadog API key and off you go.

Originally I planned to ask a bunch of people to install it on their phones around the office. But then our VP Engineering had a much better idea: deploy it on the iPads we have outside every conference room.

Below you can see our data collection points (iPad minis) as red dots outside each room:

The graph below shows measurements from August 31, 2016, clearly showing spikes of noise in the office throughout the day (the absolute values are arbitrary, but the data is good for relative comparison):

Among other things, you can see a full 10 dB difference between the quietest and loudest parts of the office! 10 dB feels roughly “twice as loud,” so this is a big difference!

Finally, here are the average noise level results (plus some manual interpolation of the sparse data collection points) overlaid on the floor plan of the office (red loudest, green quietest):

This resolved the mystery. The front of the office where visitors hang out was twice as quiet (10dB quieter) compared to the areas where the people work. Both sets of anecdotal stories were right. When I showed this graph to Tony, one of our security guards, he said “Oh yeah, it’s WAY quieter up at the front of the office, even at night.”

What to do about it?

We can’t immediately ditch our open floor plan (although we’re looking at various options for our next office.) But this new noise level data gave us an obvious way to reduce distractions: the teams needing the quietest work area (engineering, product and design) should move to the quietest part of the office.

Last month we made the big move. The teams needing the most verbal collaboration — Segment’s sales, support, and marketing teams — moved to the naturally louder parts of the office. The teams needing the most quiet — engineering, product, and design — moved to the quietest parts of the office.

We’re still dialing in parts of the office that became a bit cramped, but a re-run of our original focus survey with the team showed that total focus time had increased from 45% to 60% of time in the office! In a purely numerical sense, you could equate that to hiring 10–15 people. It also feels awesome to focus.

The combination of survey and noise data has been incredibly helpful in iterating towards a more productive office, and we still have lots of ideas to test. For example, we’ve started an experiment with some teams in a new “war room” layout, and will keep looking for other ways to optimize the environment to be productive and fun, measuring results as we go. Would love to share results if you’ve experimented and measured results in your own space.

Lauren Venell on October 11th 2016

When it comes to your app, size makes a difference. Bigger apps have fewer downloads, worse reviews, and a harder time penetrating the international market. We measured the exact impact of increased app size, shown below. We’ve also included learnings on how to prevent bloat in your own app.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.