Leif Dreizler on March 31st 2020
Segment receives billions of events every day from thousands of customers that trust Segment to keep their data safe. At Segment, we believe that good security is an essential part of building high-quality software, similar to reliability or scalability. In addition to the tools and processes developed by our Security Org to help software engineers make good security choices, we also rely on reviews from both traditional and crowdsourced security companies.
Bug bounties are often seen as a significant burden for the security teams that run them. You’ll hear horror stories about companies launching a bug bounty, their security team getting inundated with low-quality reports, duplicate submissions, and researchers going outside the scope of the program.
Shortly after, you'll hear about their engineering team being overwhelmed with newly discovered vulnerabilities to fix. From those that survive the opening salvo, you may hear complaints that, over time, they have stopped receiving impactful submissions.
A few years ago, when the space was less mature, critics questioned whether running a program was worth it. Now, it is expected that organizations of a certain size and maturity run a bug bounty program.
In late 2019, the Cybersecurity and Infrastructure Security Agency (CISA) of the United States published a draft directive recommending all agencies develop a mechanism to accept information related to security vulnerabilities from external parties.
In this blog post, we’ll break down how to start and manage a bug bounty program, consistently achieve good results, and maintain healthy relationships with the people that power the program.
If you’re short on time, check out the “Top Tips” section at the bottom of this post.
A bug bounty program is like a Wanted Poster for security vulnerabilities.
Companies running bug bounty programs pay independent security researchers from across the world for security vulnerabilities in their own products and infrastructure.
I assume most readers are already bought into the benefits of running a bug bounty.
Most companies that have an internet presence or make an internet-connected device should consider running a bounty, or at least have a way for security researchers to report security issues.
It is also part of the Vendor Security Alliance questionnaire, so it may be something your customers ask you about if you are in the B2B space.
If you don’t have a way for researchers to report issues, they will email people at your company (or any alias they can find on your website), connect with you on LinkedIn, or just tweet about the issues they think they’ve found.
It’s a much better experience for researchers, your employees, and your customers if you give the security community a clear avenue to report vulnerabilities.
Your security and engineering orgs will be regularly impressed by the creativity of the researcher community. These are people that, without internal knowledge, can find critical vulnerabilities in organizations across the world.
I strongly recommend using a bug bounty platform like HackerOne or Bugcrowd (we use Bugcrowd here at Segment) to help manage this process. These companies provide a platform and services to help run an efficient program.
Severity baselines make it easier to tell how serious a vulnerability is, and how much time you should be spending on review and remediation.
When running a program on your own, you’ll frequently have researchers overhyping vulnerabilities. Platforms have a guided submission form, which helps researchers pick the appropriate category and rating.
The reputation systems reward researchers that accurately rank vulnerabilities and creates a competitive environment that benefits both researchers and program owners.
It also helps reinforce good behavior. Any researcher discipline issues have stricter consequences. If a researcher misbehaves, they may be banned from the platform.
To submit vulnerabilities via these platforms, researchers have to agree not to disclose the vulnerability without approval from your company.
Both platforms also provide triage services, which I highly recommend paying for. These are the first line of defense for your internal security resources. These globally distributed teams will help clean up researcher reports, mark submissions as duplicates, and filter out low-quality reports.
These companies also serve as a knowledge base for you to learn about running a program and ask questions. You can bounce ideas off of someone that works at a company running hundreds of programs.
Platforms have structured input with required fields and integrations with popular tools, like Jira. These make it much easier to turn a submission into a ticket for your engineering org.
For most companies, it isn’t possible to run a private program without the help of a bug bounty platform.
We’ll talk about private programs in more depth later, but this is the recommended starting point for companies launching a bug bounty for the first time.
All of the above features free your team to focus on the security challenges unique to your business.
Having a successful program starts with a good foundation, and it’s your job as a program owner to help set your organization up for success.
Think about your current process for handling a security vulnerability. What happens when someone internally or externally finds a security bug?
You will inevitably get more of these after starting a bug bounty program, so you must have a good way to handle these reports.
Your vulnerability process doesn’t have to be perfect, but you should have a way to prioritize and assign bug reports to the appropriate engineering team without high overhead.
As someone starting a program, you’ll also need to get buy-in from your Engineering org. They are the ones that will have to fix the issues and will likely be the ones responding to alerts triggered by researchers.
Your alerting story doesn’t need to be perfect. But you also don’t want engineers to be woken up every time someone triggers an error because some input validation was working correctly and stopped a researcher from submitting a
< symbol into an email address field.
Remember, your team doesn't have to fix every valid vulnerability immediately.
Vulnerabilities are just bugs and should be prioritized appropriately. If you’re having trouble getting your engineering org to fix critical vulnerabilities in a timely manner, you may want to direct your efforts to job-hunting instead of starting a bug bounty program 🙃
Once your organization is bought-in, you can focus on getting things ready for the researchers.
Your bounty brief is what researchers will read to determine if they’re interested in working on your program. It’s part instructions, part rules, and part advertisement.
Keep in mind you’re competing for researchers’ time; they don’t have to work on your program when there are so many other programs.
Your bounty brief should be clear, concise, and should set expectations with the researchers. You can find the Segment program hosted on Bugcrowd.
Where do you want researchers to focus their testing? What’s in scope? What’s out of scope?
I recommend starting with assets that are easy for the researchers to access, ideally something free, that anyone can sign up for.
You should also try to pick a target that has at least medium size, complexity, and business impact. This will help you show value early, which will help you expand the program.
How do researchers get access to the scope? Are there docs they can read to help them get up to speed? We instruct researchers to sign up for our app using their @bugcrowdninja.com email address and to include
-bugcrowdninja as part of their workspace slug.
This makes it easier for us to determine if someone is part of our Bugcrowd program when we review logs and alerts. If we notice someone causing problems in our app, we can ask Bugcrowd to provide researcher coaching.
How are you going to rate submissions? Consistent severity is important because it impacts how much the researcher gets paid for a submission. HackerOne uses Mitre’s Common Weakness Enumeration (CWE) and Bugcrowd uses the Bugcrowd Vulnerability Rating Taxonomy (VRT).
How much are you going to pay for vulnerabilities? Researchers need to know upfront how much you’ll pay for vulnerabilities so they can assess if it is worth their time to hunt on your program.
Think about using different reward ranges for different assets. This can help control costs and also helps researchers understand which targets are more important. For example, we describe specific objects that will net a higher reward:
A handful of years ago, getting a T-shirt as a reward was pretty standard. I’d strongly encourage anyone thinking about running a swag-based bounty to reconsider.
T-shirts don’t pay rent and are more work for your team than sending a researcher money. What do you do when that T-shirt you sent to another continent is lost in the mail or doesn’t fit?
We reserve swag for our top performers. Sending a T-shirt requires the researcher to trust us enough to give us their address and requires me to go to the post office.
Take the time to explain what a bug bounty is, why it's important, and have a few examples from recognizable organizations ready to show them.
Learn a little bit about the platform you’re using. Your actions on the platform impact the researcher. If you mistreat researchers, they will go elsewhere; without researchers, your program isn’t providing value to your organization.
The same report status can have different meanings and impact on different platforms.
For example, on HackerOne
Not Applicable reduces the researcher’s site-wide score by 5 points, and should be used for reports that don’t contain a valid issue.
Not Applicable does not impact the researcher’s score, and is commonly used for reports that should neither be accepted or rejected. To achieve this result on HackerOne, you would use the
If you have any questions about the platform you’re using, I strongly recommend reviewing documentation or reaching out to your account manager for help.
Regardless of how big your company’s internet footprint is, you can start with a small scope open only to a handful of individuals as part of a private program.
In mid-2017, Segment was running a private program with 25 researchers and a single target: our app.
The early researchers invited will be some of the platform’s most trusted, and they will generally be more accepting of companies that are learning how to manage a program, as long as you pay them fairly and treat them with respect.
Starting small allows your organization to learn how to run a program in a safer environment. If your vulnerability management program has some gaps, you can fix them; if your bounty brief is unclear, you can rewrite it; if your alerts aren’t tuned properly, you can invest time into improving them. If you need to pause your program, you can relaunch later with a less negative impact.
Even while we had a private program, we would direct researchers that reached out via email to our Bugcrowd program. This allowed us to receive the benefits of the platform and triage services for all submissions before running a public program.
It’s much easier to explain to a researcher why you won’t be paying for a low-effort submission when you have a prioritization system established and enforced by a third party.
Like any multi-year project, your bug bounty will evolve and require ongoing effort to keep it healthy.
Researchers love testing new features and assets; in most bug bounty programs, only the first person to find a vulnerability receives a monetary reward.
If you started with a small scope, your program is steady, and you’re ready for more submissions, this is a great time to add more targets to your brief.
As you add scope, keep in mind that not all assets are of equal value to an adversary. It is encouraged to specify different reward ranges for different assets based on their security maturity and value.
Over time, you should also consider an “open scope” bounty if it is appropriate for your organization. We have listed as a target,
Any host/web property verified to be owned by Segment (domains/IP space/etc.), which serves as a catch-all for anything not explicitly listed in our “In Scope” or “Out of Scope” sections.
Having an open scope bounty is enticing to researchers. Not only does it show you take running a bug bounty program seriously. It also shows that regardless of where they find a vulnerability, it will likely be rewarded (assuming it is valid, unique, and not out of scope).
Many researchers specialize in finding forgotten internet-facing assets as part of open-scope programs, and have developed their own infrastructure to identify assets and vulnerabilities to be able to efficiently earn rewards.
It’s also worth noting that there is no scope for an attacker trying to compromise your company’s security. Working towards an open scope means that it is more likely a bug bounty researcher will find and report a vulnerability before an attacker exploits it.
Over time, you’ll build trust and form relationships with particular researchers. These are great people to give early access to upcoming features. Many times, these features require manual provisioning, making them less suitable for wide-scale testing.
Early access is a mutually beneficial system in which you will receive security vulnerabilities prior to release, which makes them easier to fix. Researchers will be able to test features with less competition, which makes them more likely to earn a reward and continue testing on your program.
If the effort to set up these features is medium or higher, consider paying the researcher a grant before they start working.
Clearly communicate what you're looking for and what you expect from them. When offering a researcher grant, we want to see a short write-up of what didn't work in addition to any findings they submit. Rewardable findings should be paid in addition to the grant.
Once you’ve been running a program for a while and are confident in your company’s ability to receive vulnerabilities from the global researcher community, you should consider evolving it into a public program.
If you don’t have a wide scope, this is a great time to revisit that decision.
Maximizing your scope (while private) will reduce the uptick in submissions when your program is launched publicly. You should also invite as many researchers as possible to your private program before going public for the same reason.
Because public programs are open to anyone, you will inevitably receive testing from a lot of newer folks that will pay less attention to your bounty brief, so having a wide scope helps in this regard as well.
Segment has run a public bug bounty program since late 2018, roughly 18 months after launching our private program.
Hopefully over time, you will think of your outsourced triage team as an extension of your internal team. Spending the time to let them know how you want your program run will pay dividends in the future. Any submission they can validate without asking your team questions saves time for everyone involved.
Here are some examples of guidance we’ve given to the Bugcrowd triage team:
Identify duplicates for non-rewardable submissions
Many programs do not bother to mark informational, out of scope, or other non-rewardable submissions as duplicates. We do this for two reasons:
The first is that if we decide to fix one of these issues later, we can go back and mark the original submission as resolved and pay the researcher. Any duplicates of this issue will still receive points.
The second is that when there is a false positive identified by a tool commonly used by bug bounty hunters, you will get this submitted to your program a lot.
If we repeatedly see an out-of-scope or not reproducible submission, we can add a specific item in our bounty brief to warn researchers; it will be marked as out-of-scope or not reproducible without a working proof of concept.
Don’t be afraid to deduct points for undesired behavior
While we are generally laid-back and understanding program owners, we aren’t afraid to deduct points from a researcher’s score for when it’s warranted.
Many programs shy away from deducting points, but we want to ensure participants in our program thoroughly read our brief and think that it helps the larger bug bounty community to slightly penalize those that disregard clearly documented rules.
Two of the common arguments against bug bounty programs is that the submissions are often low-value and that researchers don’t respect scope.
For example, we have a very small out-of-scope section, which includes:
CORS or crossdomain.xml issues on api.segment.io without proof of concept.
This is identified by a tool commonly used by bug bounty participants is a finding we have received dozens of times, but never with any impact.
We do this to save time for both researchers and our triage team. If a researcher submits this finding without a proof of concept, we encourage Bugcrowd to mark this as out-of-scope. If a researcher submits a finding that showed impact, we would be more than happy to reward, fix, and update our brief.
If you need to deviate from the baseline rating established in your bounty brief, take the time to explain to the researcher why the rating and reward are higher or lower than they might expect.
Researchers are generally understanding, as long as your rating, reward, and explanation are fair and make sense. If you find yourself commonly deviating from the ratings, it may be time to make changes to your bounty brief so that researchers know what to expect in advance. If you make severity or scope changes as the result of a submission, reward the researcher at whichever rate is more favorable to them.
In addition to explaining why something was rated lower than the baseline, take the time to explain why something was rated higher than the baseline. This is a great way to encourage further testing in these areas and is a great way to build trust with a researcher.
These explanations also help your triage team learn more about your program, and allow them to more accurately triage future submissions.
Take time to build relationships and trust with researchers, especially those that repeatedly submit to your program. Treat researchers fairly, with respect, and consider paying for anything that brings value to your organization.
You’re competing for a researcher’s time, especially the ones that are the most talented. They can likely work on almost any bug bounty program available; think about ways you can encourage them to work on yours.
Keep in mind that all researchers, even those that are unskilled, are human beings. Assume that they want to help you secure your organization, learn more about security and technology, and get paid.
If there is one sentence you remember from this blog, I hope it is “pay for anything that brings value.”
Bug bounty hunters put in a lot of time and effort that doesn’t result in getting paid. This could be time spent developing tooling, hunting without finding any bugs, or having a valid bug marked as a duplicate.
Try to avoid thinking about individual bug costs. Instead, think about the overall value the program brings to your organization in terms of bugs found, time saved, and peace of mind. If you’re debating between two severities, pick the higher one and pay the researcher at that rate. You can always change the severity in your internal ticketing system later.
Once you’ve received a rewardable submission, try to triage and pay quickly. Sometimes determining the full impact takes time; if this is the case, add a comment letting the researcher know you appreciate their work but need some extra time to determine the appropriate reward.
Work collaboratively with the researcher
As an employee of your company, you should know more about the codebase and infrastructure than a security researcher outside your organization (although occasionally I question this based on the creative and impactful submissions we receive 😅).
Sometimes when running a bug bounty program, you’ll get a submission that makes you think, “What’s the next level the researcher could get to?” If this is a researcher you trust, it may be appropriate to give them some hints to help further their testing. If you give them hints, you can also issue some cautionary advice to help them continue in a way that is safe for your organization and customers.
Giving the researcher hints helps show them you value their testing and saves your team from spending time on something that may not be possible. If the hint is helpful, the researcher will be submitting a higher-severity finding, which positively impacts their researcher score and earns a higher monetary reward. It also allows you to get the vulnerability fixed faster due to the higher severity.
Sometimes, it isn’t appropriate to include the researcher in this phase of the process. If our team continues the investigation, and it leads to the discovery of a higher-impact weakness in our systems, we reward the researcher as if their report contained the full impact. We also explain why we paid them at this higher rate, but let them know we are unable to share the details. This is a great way to show the researcher you value their work and build trust.
Share progress with the researcher
If a researcher submits a vulnerability that leads to a systemic fix for a vulnerability class, share this with them! Researchers are generally excited to hear that their work led to meaningful change within your organization. It also is a cue for them to attempt to bypass the new protections.
Pay for Dupes
At Segment, we commonly pay researchers for well-written duplicates, and frequently reach out to let them know that we appreciated their submission. We also let them know that we don’t always pay for duplicates to make sure that expectations are set appropriately.
This has worked out incredibly well for us. All of our most critical submissions have come from researchers that were originally rewarded for a well-written duplicate. Segment is a complex product that takes time to set up and fully understand. Researchers that put in the effort to fully set up a Segment workspace have additional context and understanding that take time to acquire—these people are hard to replace, and you want to keep them happy.
Pay bonuses for well-written reports
We also pay extra for well-written reports. Valid submissions need to get turned into Jira tickets which are assigned to engineering teams. Reports that are concise, easy to follow, have clear impact, and are well-formatted take less time for us to turn into tickets. We want to encourage researchers to save us time so we make sure to reward appropriately and let them know that we appreciate their efforts.
Running a successful bug bounty program requires consistent effort from your team, but can bring tremendous value to your company and customers. Any vulnerability reported and fixed is one fewer vulnerability an attacker could use to get a foothold in your organization. Bug bounty submissions can help illuminate vulnerability trends, which can help prioritize where you spend resources to fix systemic issues in your applications or infrastructure.
Bug bounty programs are people-powered. Spend the time to make those involved in your program feel valued, help them understand the motivations behind your decisions, and be excellent to each other!
Thanks for taking the time to read my post! I hope you learned a few things to help your company run a successful program. Here are some of my top tips to reference later:
Pay for anything that brings value
Pay extra for well-written reports, even if they’re dupes
Avoid thinking about individual bug costs
Partner with a bug bounty platform and pay for triage services
If you make changes to your bounty brief as the result of a submission, reward the researcher at the more favorable rate
Invest time into building and maintaining relationships with your researchers and triage team
Don’t be afraid to deduct points for bad behavior
Start small and partner early with Engineering
Write a clear and concise bounty brief to set expectations with the researchers
A special thanks to the Segment Engineering organization for fixing vulnerabilities and responding to alerts.
To Edis from Bugcrowd for helping us triage three years of vulnerabilities and truly being an extension of our team.
To all the researchers that have helped keep our customers safe by looking for vulnerabilities as part of our program.
And finally, to researchers danieloizo and sheddow, you have both submitted numerous well-written and high impact findings and are an absolute pleasure to work with.
Peter Reinhardt on October 5th 2016
A few months ago we were at a conference in Half Moon Bay talking with a general manager at a large mobile company, and he said, “One of my projects for the next six months is to reduce SDK bloat in all our apps.” Six months is a substantial investment! So we asked why this mattered so much to him.
He gave three reasons:
They’re up against the 100MB app size cellular download limit, and they need to find ways to reduce the size of their app so that they don’t take a hit on installs.
Occasionally their SDK vendors have bugs that crash their app, giving them bad reviews and lowering installs.
Often their SDK vendors are unprepared for major iOS version upgrades, blocking key releases and big announcements.
While it makes intuitive sense that a large app download size would demotivate people and reduce installs, we wanted to dig deeper into these reasons and really assess the quantitative impact. Are we sure app size reduces installs? How big is the effect?
So, over the last few months we’ve done some extensive research into these issues… and in this article we’re going to share some new experimental results on how app size affects install rate.
To measure the impact of increased app size, we needed to buy a small app, with no active marketing activities but significant steady downloads. Then we needed to increase the app’s size, leaving everything else constant, and observe the impact on app install rate. To the best of our abilities, this would simulate the impact of SDK bloat, or anything else that just makes an app bulky.
So, we bought the Mortgage Calculator Free iOS app through some friends in the YC Founders network. It was a minuscule 3MB, had a steady pattern of organic installs (~50 installs per day for several years), and had no active marketing activities.
Then, without making any additional changes, we bloated the app from 3MB to 99MB, 123 MB and then finally 150MB, observing the isolated impact on install rate with each change in app size. In the real world, app sizes can increase substantially with the addition of seemingly simple things, like an SDK, an explainer video, a bunch of fonts, or a beautiful background picture for your loading screen. For the purposes of our experiment, we just bloated the app with a ton of hidden album art from our engineering team’s favorite artist.
To quantitatively measure the impact of each successive bloating, we looked at data provided directly by Apple in iTunes Analytics; specifically conversion from “Product Page Views” to “App Units” (or colloquially “installs”/”install rate”).
With the larger app sizes, we saw substantial losses in product page to app install rate. In particular, there was a substantial drop around the cellular download limit (~100MB), above which Apple does not let users download the app over 3G or 4G. (Note: the conversion rate can be greater than 100% due to installs direct from search results that skip product page view.)
From these results we estimate a linear decrease in install conversion rate below the cellular download limit from 3–99MB at 0.45% per MB. Above the cellular download limit we estimate a linear decrease in install rate of 0.32% per MB. To our best estimate, the gap between the two lines is covered by a 10% instantaneous install rate drop across the cellular download limit. (Although Apple says the cellular download limit is 100MB, we found in practice that a 101MB IPA did not trigger the cellular download block… the actual limit was somewhere between 101MB and 123MB and varied depending on the exact build.)
Increasing the size of our app from 3MB to 99MB reduced installs by 43%, and the increase to 150MB reduced installs by 66% in total. For mobile companies striving for growth, an increase in app size is extremely costly.How to destroy an app
In an attempt to be proper growth scientists, we tried to replicate the experiment by returning the app to its original 3MB size (plus other intermediate sizes) and re-measuring the install rate. Unfortunately, as a result of our earlier bloating, the app attracted several critical ratings and reviews, which stick around forever:
In our measurements, the app’s growth appears to be semi-permanently damaged and we only saw a minor rebound to 59% install conversion rate for the 16 days the 3MB version was available. (As a side note you can see these customers cite 140MB and 181MB as the download sizes… the true download size varies depending on the customer’s device and OS version.)
As an example, we randomly chose to inspect the NBC Sports app that’s been popular and featured on the App Store during the Olympic games. The app is 90.5MB in total. By the far the biggest part of app’s size is images, accounting for 51% of the entire app’s size, with code (23%), fonts (16%) and video (9%) accounting for most of the rest.
Images were both in the raw app package and in the asset catalog (hidden away inside assets.car). There are huge numbers of local station & team logos, startup screens, and soccer field layouts.
Within the “code” category (which is a single encrypted file) are around a dozen different SDKs, a handful of which we could detect and estimate for size contribution. SDKs contributed roughly 3.5MB to the total app size, with an estimated impact of -1.54% install rate.
Overall, there’s an incredible amount of low-hanging fruit in optimizing an app’s size. One particularly honorable mention is the (presumably accidental) inclusion of the Adobe VideoHeartbeat SDK docs (955 kb) in the production IPA package downloaded through iTunes.
While most SDKs are small individually, the net impact of installing many SDKs in your app is a significant increase to the app’s size, and therefore a meaningful negative impact on install rate, not to mention engineering time & maintenance. SDK vendors can be unprepared for new iOS releases, which can block key releases and announcements. Or, SDKs can occasionally introduce bugs, and even a single crash can mean a would-be user will never use your app again.
Therefore, it makes sense to limit the number of SDKs you bundle into your app in order to optimize for performance. But you still need to make sure you’re tracking and collecting key lifecycle events. Segment built an easy, lightweight solution called the Native Mobile Spec. The Native Mobile Spec automatically collects events that allow you measure top mobile metrics without any tracking code.
Now that we conclusively know how to destroy an app, we also know how to improve one. Here are a few key steps that you can take to make sure your app is ready for holiday season:
Use our mobile app size calculator to find out the exact size of your app.
Review our guide for getting your app ready for launch, for a complete set of tips and advice on how to make your app great.
Read about how Segment can help mobile teams.
A version of this blog post originally appeared on Recode.
Andy Jiang on July 28th 2016
At Segment, we’re working hard to make our mobile SDKs the best possible collection options for your analytics data. An SDK can make your data more durable, minimize data transfer, and optimize your app’s battery usage. In this article, we’ll take you through what happens under the hood as a piece of data flows through our iOS and Android SDKs from a button handler in your app all the way through to our API.
When you initialize Segment’s iOS and Android library, it will automatically start tracking a few app lifecycle events that are part of our Native Mobile Spec.
Application Installed — User installs your application for the first time.
Application Opened — User opens your application.
Application Updated — User updates to a newer version of your app.
The automatic lifecycle tracking allows you to focus on building your app and thinking less about your analytics tracking plan.
Now that we’ve initialized the library, we’re ready for user interaction.
Let’s take a look at a fictitious mobile app, Bluth’s Banana Stand, and see what happens with our SDKs when the user completes an order.
Listening on the purchase button handler, you’ll fire off a
It’s safe to invoke this call directly from your button handler, as our SDKs automatically dispatch it to a background queue thread.
These queues represent various threads on the mobile device. The .track() event is moved to the background queue so the UI queue can continue to operate for the user.
Once dispatched to the background thread, the event is immediately written to the device’s disk to maximize the chances of deliverability. In a world where power can be lost at any time, this method maximizes the durability of your data. (You can read about why we chose QueueFile for reliable request batching on Android).
On the way out of the queue and onto Segment’s tracking API, Segment’s SDKs use two different strategies to optimize your app’s bandwidth and power usage:
Batching — Combining multiple events into each outbound request allows us to minimize the amount of times we power up the wireless hardware
Compression — Compressing each request with gzip which allows us to drastically reduce the amount of bandwidth used
Before the request is sent outbound, the event is now batched together with other events before being sent to our servers. Our SDKs minimize battery use with auto-batching, since powering up the radio on every call can waste battery life. Auto-batching allows our SDKs to send events in batches, without powering up the radio on every call.
Visual representation of batched events, with the green circle representing a
We batch our requests to minimize the number of times we access the underlying hardware. This allows us to send fewer requests, spaced further apart, and therefore allow wireless hardware to remain offline for majority of the time.
Next, we’ll compress the outbound request body with gzip, reducing the total size drastically.
In our testing, we sent one thousand
.track() calls without the SDK (directly to our API) and with the SDK. The amount of data on the wire without the SDK was 1.1mb and with the SDK is 63kb — roughly 17x saving in bandwidth. We’re able to accomplish this via intelligent batching and aggressive data “middle-out compression” technologies.
These are the bandwidth impacts realized with our SDKs’ batching and compression.
The battery story is even better—we’re able to reduce the wasted energy by almost 3x from 56% overhead to 20% overhead. The net result of this is low average energy impact on the app and much more efficient battery usage—and happier users!
These are the battery enhancements our SDKs enable. (Xcode).
Finally, the request leaves the mobile device and makes its way to our servers.
By tracing the SDKs functionality from tap to API, we’re able to see how a few simple data transfer strategies can significantly strengthen your mobile analytics stack:
Increase durability — Immediately persist every message to a disk-backed queue, preventing you from losing data in the event of battery or network loss.
Auto-Retries — If the network is spotty or not available, our SDKs will retry transferring the batch until the request is successful. This drastically improves data deliverability.
Reduce Network Usage — Each batch is gzip compressed, decreasing the amount of bytes on the wire by 10x-20x.
Save Battery — Because of data batching and compression, Segment’s SDKs reduce energy overhead by 2-3x which means longer battery life for your app’s users.
These features would not have been possible without the help of our incredibly supportive open source community. With 80+ contributors, 550+ pull requests, 240+ releases, our iOS and Android libraries just kept improving. Thanks to their work, more than 3,000 apps now collect customer data using Segment’s mobile SDKs every day.
We can’t wait to keep improving the mobile analytics ecosystem alongside you. If you have any ideas, we’re all ears! Send a pull request or email us at firstname.lastname@example.org. 😃
Want to learn about and grow your mobile users? Sign up today.
Interested in building infrastructural components for mobile apps and SDKs? We’re hiring.
Calvin French-Owen on June 23rd 2016
As part of our push to open up what’s going on internally at Segment – we’d like to share how we run our CI builds. Most of our approaches follow standard practices, but we wanted to share a few tips and tricks we use to speed up our build pipeline.
Powering all of our builds are CircleCI, Github, and Docker Hub. Whenever there’s a push to Github, the repository triggers a build on CircleCI. If that build is a tagged release, and passes the tests, we build an image for that container.
The image is then pushed to Docker Hub, and is ready to be deployed to our production infrastructure.
Before going any further, I’d like to talk about the elephant in the room: Travis CI. Pretty much every discussion of CI tools has some debate around using Travis CI vs CircleCI. Both are sleek, hosted, and responsive. And both are incredibly easy to use.
Honestly, we love both tools. We use Travis CI for a lot of our open source libraries, while CircleCI powers many of our private repos. Both products work extremely well and do a fantastic job meeting our needs.
However, there’s one feature we really love about CircleCI: SSH access.
Most of the time, we don’t have any problems configuring our test environments. But when we do, the ability to SSH into the container running your code is invaluable.
To put this in perspective, Segment runs our entire infrastructure with hundreds of different microservices. Each one is pulled from a different repo, and runs with a few different dependencies via docker-compose (more on that later).
Most of our CI is relatively standard, but occasionally setting up a service with a fresh environment requires some custom work. It’s in a new repo, and will require it’s own set of dependencies and build steps. And that’s where being able to run commands from within the environment you’re testing against is so handy – you can tweak configuration right on the box.
No more hundreds of “fixing CI” commits!
To work with all of these different repos, we wanted to make it trivially easy to setup a repo so that it has CI enabled. We have three different circle commands we use regularly, which are shared amongst our common dotfiles. First, there’s circle() which sets up all the right environment variables and automatically enabled our slack notifications.
Additionally, we have a circle.open() command which automatically opens the test results right from your CLI in your browser
And finally there’s the circle.badge() command to automatically add to a badge to a repo
Now given the fact that we have hundreds of repos, we have the task of keeping all the testing scripts and repositories in-sync when we make changes to our circle.yml files.
Maintaining the same behavior across a few hundred repos is annoying, but we’ve decided we’d rather trade abstraction problems (hard to solve) for investing more heavily in tooling (generally easier).
For that, we use a common set of scripts inside a shared git repo. The scripts are pulled down every time the test runs, and handle the shared packaging and deployments. Each service’s circle.yml file looks something like this:
It means that if we change our deploy scheme, we only have to update the code in one place rather than updating each individual repo’s circle.yml. We can then reference different scripts depending on what sorts of builds we need within the individual service repo.
Finally, the entire build process wouldn’t be possible without Docker containers. Containers have greatly simplified how we push code to production, test against our internal services, and develop locally.
When testing our services, we make use of docker-compose.yml files to run our tests. That way, a given service can actually test against the exact same images in CI as are running in production. It reduces the need for mocks or stubs.
What’s more, when the images are built by CI–we can pull those same images down and run them locally as well.
To actually build that code and push it to production, CircleCI will first run the tests, then check whether the build is a tagged release. For any tagged releases, we have CircleCI build the container via a Dockerfile, then tag it and push the deploy to Docker Hub.
Instead of using latest everywhere, we explicitly deploy the tagged image to Docker Hub, along with the major version (1.x) and minor version (1.2.x).
This way, we’re able to specify rollbacks to a specific version when we need it – or deploy the latest build of a certain release branch if we don’t need a specific version (useful for local development and in docker-compose files).
The code to do this is relatively straightforward, first we detect the versions:
And then we build, tag, and push our docker images:
Once our images are pushed to Docker Hub, we’re guaranteed to have the right version of the code built so that we can deploy it to production and run it inside ECS.
Thanks to containers, our CI pipeline gives us much better confidence when deploying our microservices to production.
So there you have it: our CI build pipeline, heavily powered by Github, CircleCI, and Docker.
While we’re constantly trying to find ways to make the entire pipeline a bit more seamless, we’ve been happy with the low maintenance, parallelization and isolation provided by using third-party tools.
On that note, if you’re managing a large number of repos, we’d love to hear about your own techniques for managing your build pipeline. Drop us a note by email (friends@segment) or on Twitter!
Calvin French-Owen on June 15th 2016
AWS is the default for running production infrastructure. It’s cheap, scalable, and flexible to whatever configuration you’d like to run on top of it. But that flexibility comes with a cost: it makes AWS endlessly configurable.
You can build whatever you want on top of AWS, but that means it’s difficult to know whether you’re taking the right approach. Pretty much every startup we talk with has the same question: “What’s the right way to setup our infrastructure?”
To help solve that problem, we’re excited to open source the Segment AWS Stack. It’s our first pass at building a collection of Terraform modules for creating production-ready architecture on AWS. It’s largely based on the service architecture we use internally to process billions of messages every month, but built solely on AWS.
The steps are incredibly simple. Add 5 lines of Terraform, run terraform apply, and you’ll have your base infrastructure up and running in just three minutes.
It’s like a mini-Heroku that you host yourself. No magic, just AWS.
Our major goals with Stack are:
to provide a good set of defaults for production infrastructure
make the AWS setup process incredibly simple
allow users to easily customize the core abstractions and run their own infrastructure
To achieve those goals, Stack is built with Hashicorp’s Terraform.
Terraform provides a means of configuring infrastructure as code. You write code that represents things like EC2 instances, S3 buckets, and more–and then use Terraform to create them.
Terraform manages the state of your infrastructure internally by building a dependency graph of which resources depend on one another:
and then applies only the “diff” of changes to your production environment. Terraform makes changing your infrastructure incredibly seamless because it already knows which resources have to be re-created and which can remain untouched.
Terraform provides easy-to-use, high level abstractions for provisioning cloud infrastructure, but also exposes the low-level AWS resources for custom configuration. This low-level access provides a marvelous “escape hatch” for truly custom needs.
To give you a flavor of what the setup process looks like, run
terraform applyagainst this basic configuration:
It will automatically create all of the following:
Networking: Stack includes a new VPC, with public and private subnets. All routing tables, Internet Gateways, NAT Gateways, and basic security groups are automatically provisioned.
Auto-scaling default cluster: Stack ships with an auto-scaling group and basic lifecycle rules to automatically add new instances to the default cluster as they are needed.
ECS configuration: in Stack, all services run atop ECS. Simply create a new service, and the auto-scaling default cluster will automatically pick it up. Each instance ships with Docker and the latest ecs-agent.
CloudWatch logging & metrics: Stack sends all container logs to CloudWatch. Because all requests between services go through ELBs, metrics around latency and status codes are automatically collected as well.
Bastion: Stack also includes a bastion host for manual SSH access to your cluster. Besides the public services, it’s the only instance exposed to the outside world and acts as the “jump point” for manual access.
This basic setup uses the
stack module as a unit, but Terraform can also reference the components of Stack individually.
That means that you can reference any of the internal modules that the stack uses, while continuing to use your own custom networking and instance configuration.
Want to only create Stack services, but bring your own VPC? Just source the service module and pass in your existing VPC ID. Don’t need a bastion and want custom security groups? Source only the vpc and cluster modules to set up only the default networking.
You’re free to take the pieces you want and leave the rest.
If you’d like to dig into more about how this works in-depth, and each of the modules individually, check out the Architecture section of the Readme.
Now, let’s walkthrough how to provision a new app and add our internal services.
Note: this walkthrough assumes you have an AWS account and Terraform installed. If not, first get the pre-requisites from the requirements section.
For this tutorial, we’ll reference the pieces of the demo app we’ve built: Pingdummy, a web-based uptime monitoring system.
All of the Docker images we use in this example are public, so you can try them yourself!
The Pingdummy infrastructure runs a few different services to demonstrate how services can be deployed and integrated using Stack.
the pingdummy-frontend is the main webpage users hit to register and create healthchecks. It uses the web-service module to run as a service that is publicly accessible to the internet.
the pingdummy-beacon is an internal service which makes requests to other third-party services, and responds with their status. It uses the servicemodule, and is not internet facing. (though here it’s used for example purposes, this service could eventually be run in many regions for HA requests)
the pingdummy-worker is a worker which periodically sends requests to the pingdummy-beacon service. It uses the worker module as it only needs a service definition, not a load balancer.
an RDS instance used for persistence
First, you’ll want to add a Terraform file to define all of the pieces of your infrastructure on AWS. Start by creating a
terraform.tf file in your project directory.
Then, copy the basic stack setup to it:
And then use the Terraform CLI to actually apply the infrastructure:
This will create all the basic pieces of infrastructure we described in the first section.
Note: for managing Terraform’s remote state with more than a single user, we recommend configuring the remote state to use Terraform Enterprise or S3. You can use our pingdummy repo’s Makefile as an example.
Now we’ll add RDS as our persistence layer. We can pull the rds module from Stack, and then reference the outputs of the base networking and security groups we’ve already created. Terraform will automatically interpolate these and set up a dependency graph to re-create the resources if they change.
Again, we’ll need to run plan and apply again to create the new resources:
And presto! Our VPC now has an RDS cluster to use for persistence, managed by Terraform.
Now that we have our persistence and base layers setup, it’s time to add the services that run the Pingdummy app.
We can start with the internal beacon service for our health-checks. This service will listens on port 3001 and makes outbound HTTP requests to third-parties to check if a given URL is responding properly.
We’ll need to use the service module which creates an internal service that sits behind an ELB. That ELB will be automatically addressable at beacon.stack.local,and ECS will automatically add the service containers to the ELB once they pass the health check.
Next, we’ll add the pingdummy-worker service. It is responsible for making requests to our internal beacon service.
As you can see, we’ve used the worker module since this program doesn’t need a load balancer or DNS name. We can also pass custom configuration to it via environment variables or command line flags. In this case, it’s passed the address of the beacon service.
Finally, we can add our pingdummy-frontend web app which will be Internet-accessible. This will use the web-service module so that the ELB can serve requests from the public subnet.
In order to make the frontend work, we need a few extra pieces of configuration beyond just what the base web-service module provides.
We’ll first need to add an SSL certificate that’s been uploaded to AWS. Sadly, there’s no terraform configuration for doing this (it requires a manual step), but you can find instructions in the AWS docs.
From there, we can tell our module that we’d like it to be accessible on the public subnets and security groups and be externally facing. The stack module creates these all individually, so we can merely pass them in and we’ll be off to the races.
Finally, run the plan and apply commands one more time:
And we’re done! Just like that, we have a multi-AZ microservice architecture running on vanilla AWS.
Looking in the AWS console, you should see logs streaming into CloudWatch from our brand new services. And whenever a request is made to the service, you should see HTTP metrics on each of the service ELBs.
To deploy new versions of these services, simply change the versions in the Terraform configuration, then re-apply. New task definitions will be created and the appropriate containers will be cycled with zero downtime.
There’s a few other pieces you’ll need to add, which you can see examples for in the main Pingdummy terraform file. Keep in mind that the example is a dummy app, and is not how we’d recommend doing things like security groups or configuration in production. We’ll have more on that in terraform to come :).
Additionally, we’re excited to open source a few other pieces that were involved in releasing the Stack:
Amir Abu Shareb created terraform-docs, a command-line tool to automatically generate documentation for Terraform modules. You can think of it as the godoc of the Terraform world, automatically extracting inputs, outputs, and module usage in an easily consumable format.
We use terraform-docs to build all of the module reference documentation for Stack.
Achille Roussel created ecs-logs, an agent for sending logs from journald to CloudWatch. It provides all the built-in logging for Stack, and makes sure to create a log group for each service and a single log stream per container.
It’s our hope that this post gave you a brief look at the raw power of what can be achieved with the AWS APIs these days. The ease of Terraform paired with the flexibility and scale of AWS is an extremely powerful combination.
Stack is a “first pass” of what combining these technologies can achieve. It’s by no means finished, and only provides the foundation for many of the ideas that we’ve put into production. Additionally, we’re trying some new experiments around log drivers and instances (reflected by the 0.1 tag) which we think will pay off in the future.
Nonetheless, we’ve open sourced Stack today as the first step to gather as much community wisdom around running infrastructure atop AWS.
In that vein, we’ll happily accept pull requests for new modules that fall within the spirit of the project. It’s our goal to provide the community with a good set of Terraform modules that provide sane defaults and simpler abstractions on top of the raw AWS infrastructure.
So go ahead and try out the Stack today, and please let us know what you think!
Prateek Srivastava on June 14th 2016
Segment’s mobile SDKs are designed to track behavioral data from your app and translate and route that data to hundreds of downstream integrations. One of the SDK’s core tasks is to upload behavioral data to our servers. Since every network request requires your app to power up the device’s radio, uploading this data in real-time can quickly drain a battery.
To minimize the impact of the SDK on battery life, we queue these behavioral events and upload them in batches periodically. This results in 3x less battery drain over an implementation without batching.
Our Android queuing system is built on QueueFile, which was developed by Square. In this article we’re going to run though our queueing requirements, some of traditional solutions, and explain in detail why we chose to build on Queue File.
Queues are a deceptively simple concept, but there are a two main considerations that make it complicated in practice: durability and atomicity.
Durability - An element that has been added to the queue will survive permanently. An in-memory queue is easy to implement, but it lacks durability — events are queued until the process dies, and lost thereafter.
Atomicity - An element is either added to the queue or isn’t; the queue won’t ever be in an partial/invalid state. For our purposes, we needed to save events to disk as quickly and reliably as possible, and then worry about uploading them later.
We were looking for a solution that would guarantee these contracts would be be honored even in the face of process deaths or system crashes. Such scenarios are inevitable on mobile, e.g. the user could lose battery power in the middle of an operation, or the operating system could kill your application to reclaim memory.
File: The most obvious way to persist data to disk is to use a plain old Java file. However, writing to a file is not atomic (in most cases), and it’s easy to run into cases of a corrupted file. A trivial implementation would also keep the entire queue in memory, which is not ideal for devices that might go offline for a long time.
AtomicFile: AtomicFile (also available from the support library) is a simple helper that can guarantee atomic writes on a regular file. It guarantees atomicity by creating a backup of the original file before writing any changes, and waiting until the write is completely written to disk before deleting the backup. As long as the backup file exists, the original file is considered to be invalid. However, AtomicFile requires the caller to maintain track of the corrupted state and restore itself from one.
SharedPreferences: The simplest way to persist data onto disk is by using SharedPreferences from the Android framework. Although designed to store small key-value pairs, it can certainly be used to store any data in a pinch. The SharedPreferences class is a high level wrapper around its own implementation of an atomic file and an in-memory cache. Writes are committed to a map in memory first, and then saved to disk. However, writing to SharedPreferences is not durable. SharedPreferences silently swallows any disk write error, leaving the memory cache and disk out of sync, with callers left guessing about the result of an operation.
SQLite: The typical solution to designing a reliable disk queue on Android is to layer it on top a SQLite database. SQLite works great for large datasets that need multithreaded access, but being a fully featured database, it is designed with optimizations for advanced querying and not for the purposes of a queue. SQLite is complex with many moving parts, and this made us wary of relying on it to back such a critical section of our code.
Although all of the above solutions could have been coerced to work for us, none of them were designed specifically to be used as queues. Luckily, Square had also run into this problem for accepting payments, and they built QueueFile to accomplish this. QueueFile guarantees that all operations are atomic, writes are durable, and is designed to survive process and even system-level crashes.
It was a solution tailor-made for our use case. Adding and removing elements from QueueFile is ordered first in first out (FIFO) and takes constant time, both of which are a nice bonus. QueueFile’s tiny size also makes it perfect for us to embed in our SDK.
There are a few guarantees that the filesystem provides:
renaming a file is an atomic operation
fsync is durable
segment writes are atomic
QueueFile is particularly clever about the way it stores and updates data — Bob Lee has an excellent talk on the subject. QueueFile consists of a 16 byte file header, and a series of items called elements, as a circular buffer. The grey area (not to scale) represents empty space in the file.
The file header consists of four 4-byte integers that represent the length of the file, the number of elements in the file, and a pointer to the location of the first and last elements. Since the length of the file header is only 16 bytes (smaller than the size of a segment), changes to the file header are atomic. QueueFile relies on this by making modifications to the file visible only when the header is committed as well.
The components of a file header.
Each element itself is comprised of a 4-byte element header that stores the length of the element, and the element data (variable length) itself.
The components of an element.
Consider adding an element to the QueueFile below.
QueueFile first writes (and fsyncs) the element and its length. Notice that the file header remains unchanged. If writing the second element fails, the QueueFile is still left in a valid state since the header hasn’t been updated (even though the new data may be on disk). Upon restart, the QueueFile would still report that queue contains only a single element.
When the write operation completes successfully, the header is committed (and fsynced) as well. If updating the header fails, then the QueueFile remains the same as above, and the change is aborted. Otherwise, we’ve successfully added our data!
Calling fsync after every write prevents the filesystem from reordering our writes, and makes the transaction durable. This ensures that the QueueFile is never left in an invalid or corrupted state.
Removing and clearing do the reverse. Let’s start from the result of our previous operation.
QueueFile writes (and fsyncs) the header first. If committing the header fails, then the change is simply aborted and the QueueFile remains the same as above. When the header is written successfully, the removal is committed, and now has a size of 1 (even though the data is on disk).
QueueFile goes one step further and zeroes out the removed data, which leaves us just with the element we added previously.
QueueFile has been a key part of the queueing system we’ve built which powers data collection for over 1,400 Android apps running on more than 200 million devices. While the incredibly small size of the library helps prevent SDK bloat, its simplicity does also create a few limitations. For instance, QueueFile is limited to a size of 1GB, and it cannot be used by multiple processes concurrently. Neither were deal breakers for us — we don’t want your queued analytics data using too much disk space and we can create separate queues for different processes — but are good to be aware of.
We’re working on porting this approach over to iOS and have some ideas to expand QueueFile to work on a broader range of filesystems like iOS and desktop apps. If you’re interested in helping us build infrastructural components like this for mobile apps and SDKs, we’re hiring.
Stephen Mathieson, Calvin French-Owen on May 26th 2016
For the past year, we’ve been heavy users of Amazon’s EC2 Container Service (ECS). It’s given us an easy way to run and deploy thousands of containers across our infrastructure.
ECS gives us free Cloudwatch metrics, automatic healthchecks, and scheduling across hundreds of nodes. It’s radically simplified how we deploy code to production. In short, it’s pretty awesome.
But ECS had one major shortcoming: navigating the AWS Console is a massive pain. It’s hard to find which containers are running and what version of an image is deployed.
That’s why we built Specs
Specs gives you a high-level window into your ECS clusters and services.
It provides a handy search box to find particular services in production, and then helps diagnose what the service’s exact state. Specs lets us know if a container isn’t able to be placed due to insufficient capacity or if all containers in a service are dead.
It gives the entire development team a better picture of how our production environment is configured.
We run Specs behind an internal Google oAuth server, so any engineer can view it even if they don’t have an IAM account. It’s allowed a lot more of the team to debug issues without bothering the core infra team.
Getting started with Specs couldn’t be easier. Assuming you’re running Docker and have your AWS credentials exported, just run the following command:
You’ll get convenient dashboards with zero configuration.
If you’d like to help contribute, please check out the github repo. We’re excited to add features like log tailing and a real-time events feed to make working with ECS a truly seamless experience.
Lipei Wang on May 23rd 2016
Python, one of the most popular scripting languages, is also one of the most preferred tools for data analysis and visualization. In addition to the broader Python developer community, there is also a significant group that uses Python to analyze data, draw actionable insights, and make decisions.
With its extensive collection of helper libraries and platforms, Python is a great tool for quick, iterative data exploration. Python’s set of libraries includes everything from visualization to statistical analysis, making it convenient for its users to jump into the data and begin identifying patterns.
Together with the ability to iterate quickly in data and statistical analysis, there are great open source tools on managing data pipelines and workflows. A growing community of analysts are finding new ways of using Python to crunch numbers and understand their data.
We had a chance to catch up with the Chief Analyst at Mode, Benn Stancil, and ask him about the importance of Python, how to use it in day-to-day analysis, and to tell us about some key features of Mode’s new product, Mode Python Notebooks. Mode Python Notebooks is a hosted solution that allows analysts to use Python for exploratory analysis.
I’m interested in Python for the same reasons I like SQL: It gives me the power and flexibility to answer any question. The community is great and adoption is on the rise.
There are many easy to use Python libraries to make data exploration convenient and immediate. This allows for iterative data analysis. With Python, you can really chase your curiosities down the rabbit hole.
Lastly, Python’s utility and flexibility allows it to be used for a variety of tasks within the data science stack. For example, Luigi and Airflow both allow for managing data pipelines and workflows in Python. By completing exploratory analysis in Python, there can be times where the work carries over into production.
What are the most popular use cases for SQL as opposed to Python?
SQL is designed to query and extract data from a database. It’s a necessary first step to get the data into a usable format. For instance, SQL allows you to easily join several data sets to create a table that you can explore further.
SQL isn’t really designed for manipulating or transforming data in certain ways. Higher level data manipulation that is common with data science, such as statistical analysis, regressions, trend lines, and working with time series data, isn’t easy in SQL.
Despite these limitations, because SQL is necessary for extracting data, it’s still commonly used for complex operations. The query below, which calculates quantiles for different series in a data, is something I’ve used versions of many times.
When would Python come in?
Python has a ton of libraries (e.g. Pandas, StatsModel, and SciPy) that are designed for statistical and mathematical analysis. The libraries also do a great job of abstracting away the details so that you don’t need to calculate all the underlying math by hand. Moreover, you can get your results immediately, so you can use Python iteratively to explore your data.
Rather than saying “I want to do a regression analysis” and sitting down for half an hour figuring out where to begin in SQL, the Python libraries make it so that you can just run the analysis, see the results, and continue exploring the path your curiosity takes you down. With Python, there is not much lag between inspiration and action. With SQL, on the other hand, I often think twice before going down a path that may or may not be fruitful.
For example, I’d only write the query above if I really knew I wanted to present the quantiles of that dataset. Because the entire thing can be accomplished with the one line of Python below, I’d do it much earlier in my analytical process–and may discover something I wasn’t looking for as a result.
Another way to think about the difference between Python and SQL is that Python allows you to start with one large table, from which you branch off different analyses in different directions. One avenue of inspiration can bring you to another avenue and to another avenue. The speed and flexibility of analysis makes it easy to go down many exploratory paths.
Those kinds of analyses sound very different. Why combine SQL and Python in one place?
Because SQL and Python each have individual strengths and weaknesses. Tying the languages together gives analysts the best of both worlds.
First, SQL is needed to build the data set into a final table that has all of the necessary attributes. Then, from this large data set, you can use Python to spin off deeper analysis.
What would it take for a SQL analyst to learn Python?
Like many skills, the best way to learn how to use Python for analysis is by diving in to work on a problem you’re interested in, care about, and with which you’re somewhat familiar.
When you work on something that you’re interested in, you tend to go deeper. You uncover something in the core problem that piques your interest, and you want to learn more by analyzing the data set in a different way. You begin asking more and more questions. This curiosity can push you further than you would go otherwise, and a is a source of a lot of real learning.
You also should work on data that you’re somewhat familiar with, so you know when you do something wrong. You’ll have better instincts about what’s going on and what to expect. Compare this to when you’re working on data about which you know nothing—like flower petal sizes (an unusually popular data set found in many Python examples). If your analysis concludes that “all of these flowers have two centimeter-long petals” and you have no idea whether that is reasonable, you may just assume it’s right and move on.
Mode is also releasing a new Python tutorial that aims to help SQL users learn how and where to integrate Python into their workflow. In addition, the Python tutorials provide problems that are familiar to those in business settings, instead of academic problems.
How would you expect learning Python to help benefit analysts in their job and their careers?
Learning Python definitely can augment an analyst’s skill set.
An analyst needs to communicate the business value through data. One part of the job is to find the insights from the data, but the more effective job is also to include the right context and narrative around the insights that can compel your teammates towards action. And since using data and analytics to make decisions is becoming more important in the workplace, the role of the analyst to deliver comprehensive analysis is more important than ever.
Is it easy to get started with Python? What kind of setup and tooling do you need?
Until recently, getting started with Python for analysis requires installing a few things—Python, several main statistics and data analysis libraries, and Notebooksoftware that’ll run the analysis locally on your computer. Then, you’d have to run the Notebook, starting a local server to execute your Python commands.
The results generated from your commands on Notebooks will exist on your desktop. In order to share it, you basically download a Python Notebook HTML file (which has a mix of code and results) and send that around. In order to easily parse the results from the HTML, your colleagues would have to open that file in a browser. And at that point, unless they have Python set up too, the code isn’t re-executable.
When you run Python locally, you’re limited to the power of your computer, you have to leave your computer open to run it, and running scripts can slow other things down. By running it remotely, you can run it from any machine and you can run something, close your computer and walk away, and still have your results waiting when you get back.
With Mode Python Notebooks, all of that setup and hosting is taken care of for you. If you’re using Mode to query an existing database, there’s a tab for a Python Notebook. You can open it up and the query results table will be automatically populated. And after generating plots, time series analysis, summary statistics, etc., in the Notebook, it’s easy to curate the results into an instantly shareable report anyone can re-run. Moreover, you don’t have to run it on your local machine, which saves your computer’s processing power and memory. Finally, setting it all up is as easy as setting up a database connection.
Thanks for your time, Benn!
If you’re interested in learning how to run data analysis in Python, check out Mode’s new Python tutorial. And, if you’re a Segment customer looking to run Python scripts on your event and cloud app data check out Mode Python Notebooks.
Calvin French-Owen on May 16th 2016
Since Segment’s first launch in 2012, we’ve used queues everywhere. Our API queues messages immediately. Our workers communicate by consuming from one queue and then publishing to another. It’s given us a ton of leeway when it comes to dealing with sudden batches of events or ensuring fault tolerance between services.
We first started out with RabbitMQ, and a single Rabbit instance handled all of our pubsub. Rabbit had lot of nice tooling around message delivery, but it turned out to be really tough to cluster and scale as we grew. It didn’t help that our client library was a bit of a mess, and frequently dropped messages (anytime you have to do a seven-way handshake for a protocol that’s not TLS… maybe re-think the tech you’re using).
So in January 2014, we started the search for a new queue. We evaluated a few different systems: Darner, Redis, Kestrel, and Kafka (more on that later). Each queue had different delivery guarantees, but none of them seemed both scalable and operationally simple. And that’s when NSQ entered the picture… and it’s worked like a charm.
As of today, we’ve pushed 750 billion messages through NSQ, and we’re adding around 150,000 more every second.
Before discussing how NSQ works in practice, it’s worth understanding how the queue is architected. The design is so simple, it can be understood with only a few core concepts:
topics - a topic is the logical key where a program publishes messages. Topics are created when programs first publish to them.
channels - channels group related consumers and load balance between them–channels are the “queues” in a sense. Every time a publisher sends a message to a topic, that message is copied into each channel that consumes from it. Consumers will read messages from a particular channel and actually create the channel on the first subscription. Channels will queue messages (first in memory, and spill over to disk) if no consumers are reading from them.
messages - messages form the backbone of our data flow. Consumers can choose to finish messages, indicating they were processed normally, or requeue them to be delivered later. Each message contains a count for the number of delivery attempts. Clients should discard messages which pass a certain threshold of deliveries or handle them out of band.
NSQ also runs two programs during operation:
nsqd - the nsqd daemon is the core part of NSQ. It’s a standalone binary that listens for incoming messages on a single port. Each nsqd node operates independently and doesn’t share any state. When a node boots up, it registers with a set of nsqlookupd nodes and broadcasts which topics and channels are stored on the node.
Clients can publish or read from the nsqd daemon. Typically publishers will publish to a single, local nsqd. Consumers read remotely from the connected set of nsqd nodes with that topic. If you don’t care about adding more nodes dynamically, you can run nsqds standalone.
nsqlookupd – the nsqlookupd servers work like consul or etcd, only without coordination or strong consistency (by design). Each one acts as an ephemeral datastore that individual nsqd nodes register to. Consumers connect to these nodes to determine which nsqd nodes to read from.
Let’s walk through a more concrete example of how this works in practice.
NSQ recommends co-locating publishers with their corresponding nsqd instances. That means even in the face of a network partition, messages are stored locally until they are read by a consumer. What’s more, publishers don’t need to discover other nsqd nodes–they can always publish to the local instance.
First, a publisher sends a message to its local nsqd. To do this, it first opens up a connection, and then sends a PUB command with the topic and message body. In this case, we publish our messages to the events topic to be fanned out to our different workers.
The events topic will copy the message and queue it in each of the channels linked to the topic. In our case, there are three channels, one of them being the archives channel. Consumers will take these messages and upload them to S3.
Messages in each of the channels will queue until a worker consumes them. If the queue goes over the in-memory limit, messages will be written to disk.
The nsqd nodes will first broadcast their location to the nsqlookupds. Once they are registered, workers will discover all the nsqd nodes with the events topic from the lookup servers.
Then each worker subscribes to each nsqd host, indicating that it’s ready to receive messages. We don’t need a fully connected graph here, but we do need to ensure that individual nsqd instances have enough consumers to drain their messages (or else the channels will queue up).
Separate from the client library, here’s an example of what our message handling code might look like:
If the third party fails for any reason, we can handle the failure. In this snippet, we have three behaviors:
discard the message if it’s passed a certain number of delivery attempts
finish the message if it’s been processed successfully
requeue the message to be delivered later if an error occurs
As you can see, the queue behavior is both simple and explicit.
In our example, it’s easy to reason that we’ll tolerate
MAX_DELIVERY_ATTEMPTS * BACKOFF_TIME minutes of failure from our integration before discarding messages.
At Segment, we keep statsd counters for message attempts, discards, requeues, and finishes to ensure that we’re achieving a good QoS. We’ll alert for services any time the number of discards exceeds the thresholds we’ve set.
In production, we run nsqd daemons on pretty much all of our instances, co-located with the publishers that write to them. There are a few reasons NSQ works so well in practice:
Simple protocol - this isn’t a huge issue if you already have a good client library for your queue. But, it can make a massive difference if your existing client libraries are buggy or outdated.
NSQ has a fast binary protocol that is easy to implement with just a few days of work. We built our own pure-JS node driver, (at the time, only a coffeescript driver existed) which has been stable and reliable.
Easy to run - NSQ doesn’t have complicated watermark settings or JVM-level configuration. Instead, you can configure the number of in-memory messages and the max message size. If a queue fills up past this point, the messages will spill over to disk.
Distributed - because NSQ doesn’t share information between individual daemons, it’s built for distributed operation from the beginning. Individual machines can go up and down without affecting the rest of the system. Publishers can publish locally, even in the face of network partitions.
This ‘distributed-first’ design means that NSQ can essentially scale forever. Need more throughput? Add more nsqds!
The only shared state is kept in the lookup nodes, and even those don’t require a “global view of the universe.” It’s trivial to set up configurations where certain nsqds register with certain lookups. The only key is that the consumers can query to get the complete set.
Clear failure cases - NSQ sets up a clear set of trade-offs around components which might fail, and what that means for both deliverability and recovery.
I’m a firm believer in the principle of least surprise, particularly when it comes to distributed systems. Systems fail, we get it. But systems that fail in unexpected ways are impossible to build upon. You end up ignoring most failure cases because you can’t even begin to account for them.
While they might not be as strict of guarantees as a system like Kafka can provide, the simplicity of operating NSQ makes the failure conditions exceedingly apparent.
UNIX-y tooling - NSQ is a good general purpose tool. So it’s not surprising that the included utilities are multi-purpose and composable.
In addition to the TCP protocol, NSQ provides an HTTP interface for simple cURL-able maintenance operations. It ships with binaries for piping from the CLI, tailing a queue, piping from one queue to another, and HTTP pubsub.
There’s even an admin dashboard for monitoring and pausing queues (including the sweet animated counter above!) that wraps the HTTP API.
As I mentioned, that simplicity doesn’t come without trade-offs:
No replication - unlike other queues, NSQ doesn’t provide any sort of replication or clustering. This is part of what makes running it so simple, but it does force some hard guarantees on reliability for published messages.
We partially get around this by lowering the file sync time (configurable via a flag) and backing our queues with EBS. But there’s still the possibility that a queue could indicate it’s published a message and then die immediately, effectively losing the write.
Basic message routing - with NSQ, topics and channels are all you get. There’s no concept of routing or affinity based upon on key. It’s something that we’d love to support for various use cases, whether it’s to filter individual messages, or route certain ones conditionally. Instead we end up building routing workers, which sit in-between queues and act as a smarter pass-through filter.
No strict ordering - though Kafka is structured as an ordered log, NSQ is not. Messages could come through at any time, in any order. For our use case, this is generally okay as all the data is timestamped, but doesn’t fit cases which require strict ordering.
No de-duplication - Aphyr has talked extensively in his posts about the dangers of timeout-based systems. NSQ also falls into this trap by using a heartbeat mechanism to detect whether consumers are alive or dead. We’ve previously written about various reasons that would cause our workers to fail heartbeat checks, so there has to be a separate step in the workers to ensure idempotency.
As you can see, the underlying motif behind all of these benefits is simplicity. NSQ is a simple queue, which means that it’s easy to reason about and easy to spot bugs. Consumers can handle their own failure cases with confidence about how the rest of the system will behave.
In fact, simplicity was one of the primary reasons we decided to adopt NSQ in the first place (along with many of our other software choices). The payoff has been a queueing layer that has performed flawlessly even as we’ve increased throughput by several orders of magnitude.
Today, we face a more complex future. More and more of our workers require a stricter set of reliability and ordering guarantees than NSQ can easily provide.
We plan to start swapping out NSQ for Kafka in those pieces of the infrastructure and get much better at running the JVM in production. There’s a definite trade-off with Kafka; we’ll have to shoulder a lot more operational complexity ourselves. On the other hand, having a replicated, ordered log would provide much better behavior for many of our services.
But for any workers which fit NSQ’s trade-offs, it’s served us amazingly well. We’re looking forward to continuing to build on its rock-solid foundation.
TJ Holowaychuk for creating nsq.js and helping drive its adoption at Segment.
Jehiah Czebotar and Matt Reiferson for building NSQ and reading drafts of this article.
Julian Gruber, Garrett Johnson, and Amir Abu Shareb for maintaining nsq.js.
Tejas Manohar, Vince Prignano, Steven Miller, Tido Carriero, Andy Jiang, Achille Roussel, Peter Reinhardt, Stephen Mathieson, Brent Summers, Nathan Houle, Garrett Johnson, and Amir Abu Shareb for giving feedback on this article.
Calvin French-Owen on March 15th 2016
I recently jumped back into frontend development for the first time in months, and I was immediately struck by one thing: everything had changed.
When I was more active in the frontend community, the changes seemed minor. We’d occasionally make switches in packaging (RequireJS → Browserify), or frameworks (Backbone → Components). And sometimes we’d take advantage of new Node/v8 features. But for the most part, the updates were all incremental.
Years ago, a friend and I discussed choosing the ‘right’ module system for his company. At the time, he was leaning towards going with RequireJS–and I urged him to look at Browserify or Component (having just abandoned RequireJS ourselves).
We talked again last night, and he said that he’d chosen RequireJS. By now, his company had built a massive codebase around it –- “I guess we bet on the wrong horse there.”
Most of the tools we use today didn’t even really exist a year ago: React, JSX, Flux, Redux, ES6, Babel, etc. Even setting up a ‘modern’ project requires installing a swath of dependencies and build tools that are all new. No other language does anything remotely resembling that kind of thing. It’s enough to even warrant a “State of the Art” post so everyone knows what to use.
As it turns out, there are a ton of interesting dynamics at play: corporate self-interest, cries for open standards, Mozilla pushing language development, lagging IE releases that dominate the market, and tools that pave over a lot of the fragmentation.
ECMAScript has a weird and fascinating history of its own, largely broken up into different ‘major’ editions.
apply, and various
ES2 was released one year later in June 1998, but the changes were purely cosmetic. They had been added to comply with a corresponding ISO version of the spec without changing the specification for the language itself, so it’s typically omitted from compatibility tables.
ES3, released in December 1999, marked the first major additions to the language. It added support for regexes,
.bind, and try/catch error handling. ES3 became the “standard” for what the majority of browsers would support, underscored by the popularity of older IE browsers. Even in February 2012 (13 years after the ES3 release!!), ES3 IE browsers still had over 20% of the browser market.
But just as the spec was nearing completion, trouble struck! On the one hand Microsoft’s IE platform architect Chris Wilson, argued that the changes would effectively “break the web”. He stated that the ES4 changes were so backwards incompatible that they would significantly hurt adoption.
On the other hand, Brendan Eich, now the Mozilla CTO, argued for the changes. In an open letter to Chris Wilson, he objected to the fact that Microsoft was just now withdrawing support for a spec which had been in the works for years.
After meeting in Oslo, Brendan Eich sent a message to es-discuss (still the primary point for language development) outlining a plan for a near-term incremental release, along with a bigger reform to the language spec that would be known as ‘harmony’.. ES5 was finalized and published in December 2009.
With ES5 finally out the door, the committee started moving full-speed ahead on the next set of language extensions which they’d hoped to start incoporating nearly a decade earlier.
By now, browser-makers and committee members alike had seen the danger of adding a giant release without testing the features. So drafts of ES6 have been published frequently since 2011, and available behind flags (
--harmony) in Node and many of the major browsers. Additionally transpilers had become a part of the modern build chain (more on that later), allowing developers to use the bleeding-edge of the spec, without worrying about breaking old browsers. It was eventually released in June 2015.
So we have:
ECMAScript versions driven by the planning committee to standardize the language
With experiments and language additions to each dialect pushing the ES standard ahead.
Of course, we can’t talk just about the full history of JS without mentioning the platform it’s running on: browsers.
But an interesting thing happened when Chrome first emerged onto the scene in December, 2008. While Chrome had numerous features that made it technically interesting (tabs as processes, a new fast js runtime, and others), perhaps the thing that made it most revolutionary was its release schedule.
Chrome shipped from day-1 with auto-updates, first tested from a widespread number of early adopters using the dev and beta channels. If you wanted to use the bleeding edge, you could easily do so. Nowadays, updates to the stable channel happen every six weeks, with thousands of users testing and automatically reporting bugs they encounter on the dev and beta channels.
It meant that Chrome was able to ship updates far more frequently than its competitors. Where the other browsers would tell users to update every 6-12 months, Chrome updated the three channels weekly, monthly, and released updates on the stable channel every six weeks.
Most of the features are still not “safe” to use and hidden behind flags, but the fact that browsers will continue to auto-update means that the velocity of new JS features as increased dramatically in the last few years.
Imagine now that you’re a developer just learning JS, and knowing none of that history. It’d be pretty much impossible to keep track of which features are supported in what browsers, what’s to spec and what’s not, and what parts of the language should even be used.
Realizing this was a problem, John Resig released the first version of JQuery in 2006. It was a drop-in library designed to pave over a lot of the inconsistencies between the different browsers. And for the most part, developers became comfortable with always bundling jQuery into their projects. It became the de facto library for DOM manipulation, AJAX requests, and more.
But working with many different scripts was tedious. They had to be loaded in a certain order, or else bundle duplicated dependencies many times over. Individual libraries would overwrite the global scope, causing conflicts and monkey-patching default implementations.
So James Burke created RequireJS in 2009, spawned out of his work with Dojo. He created a module loader designed to specify dependencies, and load them asynchonously in the browser. It was one of the first frameworks that actually introduced the idea of a “module system” for isolating pieces of code and loading them asynchronously–dubbed AMD (Asynchronous Module Definition).
While AMD was easy to load on the fly, it was also verbose. It essentially added dependency injection for every single library you’d want to load:
The simplicity had the benefit that the browser could load libraries on-the-fly, but it was a lot of overhead for the programmer to really understand.
Since Node ran on the server, making requests for additional scripts was cheap. The scripts were cached, so it was as easy as reading and parsing an additional file.
Instead of using AMD modules, Node popularized the CommonJS format (the competing module format at the time), using synchronous
require statements for loading dependencies. It mirrored the same way that other languages worked, grabbing dependencies synchronously and then caching them. Isaac Schuetler built
npm around it as the de facto way to manage and install dependencies in Node. Soon, everyone was writing node scripts that looked like this:
Yet the frontend still lagged. Developers wanted to
So a new set of tools appeared on the frontend developer’s toolchain. Bower for pure dependency management, Browserify and Component/Duo for actually building scripts and bundling them together. Webpack promised a new build system that also handled CSS, and Grunt and Gulp for actually orchestrating them all and making them play nicely together.
After all, there are hundreds of programming languages that can run on the server, but there’s only one widely-supported language for the browser.
But it didn’t end there.
I didn’t even go into changes in frameworks (Ember, Backbone, Angular, React) or how we structure asynchronous programming (Callbacks, Promises, Iterators, Async/Await). But those have all churned relatively regularly over the years.
The most interesting thing about having a language where everyone is comfortable with build tools, transpilers, and syntax additions is that the language can advance at an astonishing rate. Really the only thing holding back new language features is consensus. As quickly as people agree and implement the spec, there’s code to support it.
I predict we’ll continue to see the cycle of rapid iteration for the next few years at least. Companies will be forced to update their codebase, or be happy with the horse they have.