Leif Dreizler on March 2nd 2021
Calvin French-Owen on December 15th 2015
At Segment, we’ve fully embraced the idea of microservices; but not for the reasons you might think.
The microservices vs. monoliths debate has been pretty thoroughly discussed, so I won’t completely re-hash it here. Microservices proponents say that they provide better scalability and are the best way to split responsibilty across software engineering teams. While the pro-monolith group say that microservices are too operationally complex to begin with.
But a major benefit of running microservices is largely absent from today’s discussions: visibility.
When we’re getting paged at 3am on a Tuesday, it’s a million times easier to see that a given worker is backing up compared to adding tracing through every single function call of a monolithic app.
That’s not to say you can’t get good visiblity from more tightly coupled code, it’s just rarer to have all the right visibility from day one.
Where does that visibility come from? Consider for a moment the standard tools that are part of our ops arsenal:
None of them monitor individual program execution: hot codepaths, stack size, etc. The battle-tested tools we’ve built over the past 20 years are all built around the concepts of hosts, processes, or drives.
With a distributed system, we can add in requests and network throughput to our metrics, but most tools still tend to aggregate at a host or service level.
The process-centric nature of monitoring tools makes it really difficult to get a sense of where a program is actually spending time. With a monolithic app, our best options to debug are either to run the program against a profiler or to implement our own timing metrics.
Now that’s kind of crazy when you think about it. Most of the reason flamegraphs are so useful is that we don’t have that detailed amount of monitoring at the level of individual function calls.
So instead of trying to shoe-horn lots of functionality into monoliths, at Segment we’ve doubled down on microservices. We’re betting that container scheduling and orchestration will continue to get easier and more powerful, while most metrics and monitoring will continue to be dominated by the idea of ‘hosts’ and ‘services’.
The caveat here is that microservices only work so long as it’s actually easy to create new services. Otherwise we’ve just traded a visibility problem for a provisioning problem.
In other posts, we’ve talked a little bit about what our services look like, and how we build them with terraform. And now, we’ve started splitting each service into modules, so we can re-use the exact configuration between stage and prod.
Here’s an example of a simple auth service, using terraform as our configuration to set up all of our resources:
For the curious, you can check out an example of the full module definition.
As long as there’s a singificant benefit (free metrics) and low cost (10-line terraform script), we remove the temptation to tack on different functionality into an existing service.
And so far, that approach has been working quite well.
Segment is a bit unusual–instead of microservices which are coordinating together, we have a lot of what I’d call “microworkers.” Fundamentally, it’s the same concept, but the worker doesn’t serve requests to clients. Instead, the typical Segment worker reads some data from a queue, does some processing on it, and then acks the message.
These workers end up being a lot simpler than services because there are no dependencies. There’s no coupling or worrying that a given problem with one worker will compound and disrupt the rest of a system. If a service is acting up, there’s just a single queue that ends up backing up. And we can scale additional workers to handle the load.
There are a few forces at work which make tiny workers the right call for us. But the biggest comes from our team size and relative complexity of what we’re trying to build.
Microservices are usually touted when the team grows to a size where there are too many people working on the same codebase. At that point, it makes sense to shard ownership of the codebase by team. But we’ve seen it be equally helpful with a small team as well.
Most folks I talk with are surprised at how small our engineering team is. To give you a rough sense of our scale:
400 private repos
70 different services (workers)
We’re in the postion of having a large product scope and a small engineering team. So if I’m currently on-call and get paged, it could be for code that I wrote 6-months ago and haven’t touched since.
And that’s the place where tiny, well-defined, services shine.
Here’s the typical scenario: first there’s an alert which gets triggered because a particular queue depth is backing up.
We can verify this is really the case (and isn’t getting better) by checking the queue depth in our monitoring tools.
At that point we know exactly which worker is backing up (since each worker subscribes to a single queue), and which logs to look at. Each service logs with its own tags, so we don’t have to worry about unrelated logs interleaving within a single app for multiple requests.
We can look at Datadog for a single dashboard containing that worker’s CPU, Memory, and the responses and latency coming from it’s ELB. Once we’ve identified the problem, it’s a question of reading through 50-100 line file to isolate exactly where the problem is happening (let’s play spot the memory leak!).
With a monolith, we could add individual monitoring specifically for each endpoint. But why bother when we get it for free by running code as part of its own process?
Not to mention the fact that we also get isolated CPU, memory, and latency (if the service sits behind an ELB) out of the box. It’s infinitely easier to track down a memory leak in a hundred-line worker with a single codepath than it is in a monolithic app with hundreds of endpoints.
I understand this approach won’t work for everyone. And it requires a pretty significant investment in up-front tooling to make sure that creating a new service from scratch has everything it needs. Depending on your team, workload, and product scope, it might not make sense.
But for any product operating with a high level of operational complexity and load, I’d choose the microservice architecture every time. It’s made our infrastructure flexible, scalable, and far easier to monitor–without sacrificing developer productivity.
Calvin French-Owen on November 20th 2015
Every month, Segment collects, transforms and routes over 50 billion API calls to hundreds of different business-critical applications. We’ve come a long way from the early days, where my co-founders and I were running just a handful of instances.
Today, we have a much deeper understanding of the problems we’re solving, and we’ve learned a ton. To keep moving quickly and avoid past mistakes, our team has started developing a list of engineering best practices.
Now that a lot of these “pro tips” have been tested, deployed and are currently in production… we wanted to share them with you. It’s worth noting that we’re standing on the shoulders of giants here, to The Zen of Python, Hints for Computer System Design, and the Twelve-Factor App for the inspiration.
Editor’s Note: This post was based off an internal wiki page for Segment “Pro Tips”. There are more tips recorded there, but we chose a handful that seemed most broadly applicable. They’re written as fact, but internally we treat them as guidelines, always weighing other trade-offs within the organization. Each practice is followed by a few bullet-points underscoring the main takeaways.
The best practices listed below have helped our engineering team move more efficiently and effectively, minimizing waste and boosting collaboration. Your engineering team might want to tweak these practices based on your goals or priorities, but by following the general principles laid out below you’ll have a good start.
When we first started out, we had one massive repo. Every module was filled with tightly coupled dependencies and was completely unversioned. Changing a single API required changing code globally. Developing with more than a handful of people would’ve been a nightmare.
So one of our first changes as the engineering team grew was splitting out the modules into separate repos (thanks TJ!). It was a massive task but it had huge payoff by making development with a larger team actually sane. Unfortunately, it was way harder than it should have been because we lumped everything together at the start.
It turns out this temptation to combine happens everywhere: in services, libraries, repos and tools. It’s so easy to add (just) one more feature to an existing codebase. But it has a long-term cost. Separation of concerns is the exact reason why UNIX-style systems are so successful; they give you the tools to compose many small building blocks into more complex programs.
structure code so that it’s easy to be split (or split from the beginning)
if a service or library doesn’t share concerns with existing ones, create a new one rather than shoe-horning it into an existing piece of code
testing and documenting libraries which perform a single function is much easier to understand
keep uptime, resource consumption and monitoring in mind when combining read/write concerns of a service
prefer libraries to frameworks, composing them together where possible
“Clever” code usually means “complicated” code. It’s hard to search for, and tough to track down where bugs are happening. We prefer simple code that’s explicit in it’s purpose rather than trying to create a magical API that relies on convention (go’s lack of “magic” is actually one of our favorite things about it).
As part of being explicit, always consider the “grep-ability” of your code. Imagine that you’re trying to find out where the implementation for the
post method lives, which is easier to find in a codebase?
Where possible, write code that is short, straightforward and easy to understand. Often that will come down to single functions that are easy to test and easy to document. Even libraries can perform just a single function and then be combined for more powerful functionality.
With comments, describe the “why” versus the typical “what” for a given process or routine. If a routine seems out of place but is necessary, it’s sometimes worth leaving a quick note as to why it exists at all.
avoid generating code dynamically or being overly ‘clever’ to shorten the line count
aim for functions that are <7 lines and <2 nested callbacks
Running code in production without metrics or alerting is flying blind. This has bitten our team more times than I’d care to admit, so we’ve increased our test coverage and monitoring extensively. Every time a user encounters a bug before we do, it damages their trust in us as a company. And that sucks.
Trust in our product is perhaps most valuable asset we have as a company. Losing that is almost completely irrecoverable; it’s the way we lose as a business. Our brand is built around data, and reliability is paramount to our success.
write test cases first to check for the broken behavior, then write the fix
all top-level apps should ship with metrics and monitoring
create ‘warning’ alerts for when an internal system is acting up, ‘critical’ ones when it starts affecting end customers
try to keep unrealistic failure scenarios in mind when designing the alerts
When building a product, there are three aspects you can optimize: Speed, quality, and scope. The catch… is that you can’t ever juggle all three simultaneously. Sacrificing quality by adding hacky fixes increases the amount of technical debt. It slows us down over the long-term, and we risk losing customer trust in the product. Not to mention, hacks are a giant pain to work on later.
At the same time, we can’t sacrifice speed either–that’s our main advantage as a startup. Long-running projects tend to drag on, use up a ton of resources and have no clearly defined “end.” By the time a monolithic project is finally ready to launch, releasing the finished product to customers becomes a daunting process.
When push comes to shove, it’s usually best to cut scope. It allows us to split shipments into smaller, more manageable chunks, and really focus on making each one great.
evaluate features for their benefit versus their effort
identify features that could be easily layered in later
cut features that create obvious technical debt
Separate code paths almost always become out of sync. One will get updated while another doesn’t, which makes for inconsistent behavior. At the architecture level, we want to try and optimize for a single code path.
Note that this is still consistent with splitting things apart, it just means that we need smaller pieces which are flexible enough to be combined together in different ways. If two pieces of code rely on the same functionality, they should use the same code path.
have a peer review your code; an objective opinion will almost always help
get someone else to sign-off on non-trivial pull-requests
if you ever find yourself copy-pasting code, consider pulling it into a library
if you need to frequently update a library, or keep state around, turn it into a service
Creating a loose mockup of a program is often the quickest way to understand the problem you’re solving. When you’re ready to write the real thing just
`rm -fr .git ` to start with a clean slate and better context.
Building something helps you learn more than you could ever hope to uncover through theorizing. Trust me, prototyping helps discover strange edge-cases and bottlenecks which may require you to rearchitect the solution. This process minimizes the impact of architectural changes.
don’t spend a lot of time with commit messages, keep them short but sensical
refactors typically come from a better understanding of the problem, the best way to get there is by building a version to “throw away”
Early on, it’s easy to write off automation as unimportant. But if you’ve done any time-consuming task more than 3 times you’ll probably want to automate it.
A key example of where we failed at this in the past was with Redshift’s cluster management. Investing in the tooling around provisioning clusters was a big endeavor, but it would have saved a ton of time if we’d started it sooner.
if you find yourself repeatedly spending more than a few minutes on a task, take a step back and consider tooling around it
ask yourself if you could be 20% more efficient, or if automation would help
share tools in dotfiles, vm, or task runner so the whole team can use them
Whenever you’re building out a new project or library, it’s worth considering which pieces can be pulled out and open sourced. At face value, it sounds like an extra constraint that doesn’t help ship product. But in practice, it actually creates much cleaner code. We’re guaranteed that the code’s API isn’t tightly coupled to anything we’re building internally, and that it’s more easily re-used across projects.
Open sourced code typically has a well-documented Readme, tests, CI, and more closely resembles the rest of the ecosystem. It’s a good sanity check that we’re not doing anything too weird internally, and the code is easier to forget about and re-visit 6-months later.
if you build a general purpose library without any dependencies, it’s a prime target for open sourcing
try and de-couple code so that it can be used standalone with a clear interface
never include custom configuration in a library, allow it to be passed in with sane defaults
Sometimes big problems arise in code and it may seem easier to write a work-around. Don’t do that. Hacking around the outskirts of a problem is only going to create a rat’s nest that will become an even bigger problem in the future. Tackle the root cause head-on.
A textbook example of this came from the first version of our integrations product. We proxied and transformed analytics calls through our servers to 30–40 different services, depending on what integrations the customer had enabled. On the backend, we had a single pool of integration workers that would read each incoming event from the queue, look up which settings were enabled, and then send copies of the event each enabled integration.
It worked great for the first year, but over time we started running into more and more problems. Because the workers were all shared, a single slow endpoint would grind the entire pool of workers to a halt. We kept adjusting and tweaking individual timeouts to no end, but the backlogs kept occurring. Since then, we’ve fixed the underlying issue by partitioning the data processing queues by endpoint so they operate completely independently. It was a large project, but one that had immediate pay-off, allowing us to scale our integrations platform.
Sometimes it’s worth taking a step back to solve the root cause or upstream problem rather than hacking around the periphery. Even if it requires a more significant restructuring, it can save you a lot of time and headache down the road, allowing you to achieve much greater scale.
whenever fixing a bug or infrastructure issue, ask yourself whether it’s a core fix or just a band-aid over one of the symptoms
keep tabs on where you’re spending the most time, if code is continually being tweaked, it probably needs a bigger overhaul
if there’s some bug or alert we didn’t catch, make sure the upstream cause is being monitored
When designing applications, coming up with a data model is one of the trickiest parts of implementation. The frontend, naturally, wants to match the user’s idea of how the data is formatted. Out of necessity, the backend has to match the actual data format. It must be stored in a way that is fast, performant and flexible.
So when starting with a new design, it’s best to first look at the user requirements and ask “which goals do we want to meet?” Then, look at the data we already have (or decide what new data you need) and figure out how it should be combined.
The frontend models should match the user’s idea of the data. We don’t want to have to change the data model every time we change the UI. It should remain the same, regardless of how the interface changes.
The service and backend models should allow for a flexible API from the programmer’s perspective, in a way that’s fast and efficient. It should be easy to combine individual services to build bigger pieces of functionality.
The controllers are the translation layer, tying together individual services into a format which makes sense to the frontend code. If there’s a piece of complicated logic which makes sense to be re-used, then it should be split into it’s own service.
the frontend models should match the user’s conception of the data
the services need to map to a data model that is performant and flexible
controllers can map between services and the frontend to assemble data
It’s easy to talk at length about best practices but actually following them requires discipline. Sometimes it’s tempting to cut corners or skip a step; but that doesn’t help long-term.
Now that we’ve codified these engineering best practices and the rationale behind each one, they have made their way into our default mode of operation. The act of explicitly writing them down has both clarified our thinking and helped us avoid making the same short-term mistakes over and over.
In practice, this means that we invest heavily in good tooling, modular libraries and microservices. In development, we keep a shared VM that auto-updates, with shared dotfiles for easily navigating our many small repositories. We put a focus on creating projects which increase functionality through composability rather than inheritance. And we’ve worked hard to streamline our process for running services in production.
All of this keeps our development team moving quickly and increases the quality of the product we ship. We’re able to accomplish a lot more with a lot less effort. And we’ll continue trying to improve and share that tooling with the community as it matures.
Andy Jiang, Vince Prignano on November 17th 2015
Growing a business is hard and growing the engineering team to support that is arguably harder, but doing both of those without a stable infrastructure is basically impossible. Particularly for high growth businesses, where every engineer must be empowered to write, test, and ship code with a high degree of autonomy.
Over the past year, we’ve added ~60 new integrations (to over 160), built a platform for partners to write their own integrations, released a Redshift integration, and have a few big product announcements on the way. And in that time, we’ve had many growing pains around managing multiple environments, deploying code, and general development workflows. Since our engineers are happiest and most productive when their time is spent shipping product, building tooling, and scaling services, it’s paramount that the development workflow and its supporting infrastructure are simple to use and flexible.
And that’s why we’ve automated many facets of our infrastructure. We’ll share our current setup in greater detail below, covering these main areas:
Let’s dive in!
As the code complexity and the engineering team grow, it can become harder to keep dev environments consistent across all engineers.
Before our current solution, one big problem our engineering team faced was keeping all dev environments in sync. We had a GitHub repo with a set of shell scripts that all new engineers executed to install the necessary tools and authentication tokens onto their local machines. These scripts would also setup Vagrant and a VM.
But this VM was built locally on each computer. If you modified the state of your VM, then in order to get it back to the same VM as the other engineers, you’d have to build everything again from scratch. And when one engineer updates the VM, you have to tell everyone on Slack to pull changes from our GitHub VM repo and rebuild. An awfully painful process, since Vagrant can be slow.
Not a great solution for a growing team that is trying to move fast.
When we first played with Docker, we liked the ability to run code in a reproducible and isolated environment. We wanted to reuse these Docker principles and experience in maintaining consistent dev environments across a growing engineering team.
We wrote a bunch of tools to set up the VM for new engineers to upgrade or to reset from the basic image state. When our engineers set up the VM for the first time, it asks for their GitHub credentials and AWS tokens, then pulls and builds from the latest image in Docker Hub.
On each run, we make sure that the VM is up-to-date by querying the Docker Hub API. This process updates packages, tools, etc. that our engineers use everyday. It takes around 5 seconds and is needed in order to make sure that everything is running correctly for the user.
Additionally, since our engineers use Macs, we switched from boot2dockervirtualbox machine to a Vagrant hosted boot2docker instance so that we could take advantage of NFS to share the volumes between the host and guest. Using NFS provides massive performance gains during local development. Lastly, NFS allows any changes our engineers make outside of the VM to be instantaneously reflected within the VM.
With this solution we have vastly reduced the number of dependencies needed to be installed on the host machine. The only things needed now are Docker, Docker Compose, Go, and a
The ideal situation is dev and prod environments running the same code, yet separated so code running on dev may never affect code running production.
Before we had the AWS state (generated by Terraform) stored alongside the Terraform files, but this wasn’t a perfect system. For example if two people asynchronously plan and apply different changes, the state will be modified and who pushes last is going to have hard times to figure out the merge collisions.
We achieved mirroring staging and production in the simplest way possible: copying files from one folder to another. Terraform enabled us to reduce the amount of hours taken to modify the infrastructure, deploy new services and making improvements.
We integrated Terraform with CircleCI writing a custom build process and ensuring that the right amount of security was taken in consideration before applying.
At the moment, we have one single repository hosted on GitHub named
infrastructure, which contains a collection of Terraform scripts that configure environmental variables and settings for each of our containers.
When we want to change something in our infrastructure, we make the necessary changes to the Terraform scripts and run them before opening a new pull request for someone else on the infra-team to review it. Once the pull request gets merged to master, CircleCI will start the deployment process: the state gets pulled, modified locally, and stored again on S3.
When developing locally, it’s important to populate local data stores with dummy data, so our app looks more realistic. As such, seeding databases is a common part of setting up the dev environment.
We rely on CircleCI, Docker, and volume containers to provide easy access to dummy data. Volume containers are portable images of static data. We decided to use volume containers because the data model and logic becomes less coupled and easier to maintain. Also just in case this data is useful in other places in our infrastructure (testing, etc., who knows).
Loading seed data into our local dev environment occurs automatically when we start the app server in development. For example, when the
app (our main application) container is started in a dev environment,
app‘s docker-compose.yml script will pull the latest
seed image from Docker Hub and mount the raw data in the VM.
seed image from Docker Hub is created from a GitHub repo
seed, that is just a collection of JSON files as the raw objects we import into our databases. To update the seed data, we have CircleCI setup on the repo so that any publishes to master will build (grabbing our mongodb and redis containers from Docker Hub) and publish a new
seed image to Docker Hub, which we can use in the app.
Due to the data-heavy nature of Segment, our app already relies on several microservices (db service, redis, nsq, etc). In order for our engineers to work on the app, we need to have an easy way to create these services locally.
Again, Docker makes this workflow extremely easy.
Similar to how we use
seed volume containers to mount data into the VM locally, we do the same with microservices. We use the docker compose file to grab images from Docker Hub to create locally, set addresses and aliases, and ultimately reduce the complexity to a single terminal command to get everything up and running.
If you write code, but never ship it to production, did it ever really happen? 😃
Deploying code to production is an integral part of the development workflow. At Segment, we prioritize easiness and flexibility around shipping code to production, since that encourages our engineers to move quickly and be productive. We’ve also created adequate tooling around safeguarding for errors, rolling back, and monitoring build statuses.
We use Docker, ECS, CircleCI, and Terraform to automate as much of the continuous deployment process as possible.
Whenever code is pushed or merged into its
master branch, the CircleCI script build the container and push it to Docker Hub.
With this setup, we can define the configuration once for any service, making it extremely easy for our engineers to create and deploy new microservices. As Calvin mentioned in a previous post, “Rebuilding Our Infrastructure with Docker, ECS, and Terraform”:
We no longer have a complex set of provisioning scripts or AMIs—we just hand the production cluster an image, and it runs. There’s no more stateful instances, and we’re guaranteed to run the same exact code on both staging and prod.
The automation and ease of use around deployment have positively impacted more than just our engineers. Our success and marketing teams can update markdown files in a handful of repos that, when merged to master, kick off an auto deploy process so that changes can be live in minutes.
Because we chose to invest effort into rethinking and automating our dev workflow and its supporting infrastructure, our engineering team move fasters and more confidently. We spend more time doing high leverage jobs that we love—shipping product and building internal tools—and less time yak shaving.
That said, this is by no means the final iteration of our infrastructure automation. We are constantly playing with new tools and testing new ideas, seeing what further efficiencies we can eek out.
This has been a tremendous learning process for us and we’d love to hear what others in the community have done with their dev workflows. If you end up implementing something like this (or have already), let us know! We’d love to hear what you’ve done, and what’s worked or hasn’t for others with similar problems.
Andy Jiang on October 20th 2015
A little while ago we open-sourced a static site generator called Metalsmith. We built Metalsmith to be flexible enough that it could build blogs (like the one you’re reading now), knowledge bases, and most importantly our technical documentation.
Using Metalsmith to build a simple blog is one thing, but building easy-to-maintain documentation isn’t as simple. There are different sections, libraries, various reusable code snippets and content that live in multiple places, and other stuff to consider. Metalsmith simplifies maintaining all of these moving parts and let’s us focus purely on creating helpful content. Here’s how we did it!
The first thing to know is that Metalsmith works by running a series of transformations on a directory of files. In the case of our docs, that directory is just a bunch of folders with Markdown files in them.
The directory structure mimics the URL structure:
And an individual Markdown file might look like this:
It’s structured this way because it makes the actual content easy to maintain. They are just regular folders with plain old Markdown files in them. That means that anyone on the team can easily edit the docs without any technical knowledge. You can even do it right form the GitHub editor:
So far so good. But how do just those simple Markdown files get transformed into our entire technical documentation? That’s where Metalsmith comes in. Like I mentioned earlier, Metalsmith is really just a series of transformations to run on the directory of Markdown files.
I’ll walk you through each of the transformations we use in order.
The first transformation we use is a custom plugin that takes all the files in a directory and exposes them as partials in Handlebars. That means we can keep content that we repeat a lot in a single place for easier maintenance.
The next transformation is a plugin that groups files together into “collections”. In our case, those collections are built into our sub-directories, so we have collections like: Libraries, Plugins, Tutorials, etc.
We also pass our own custom
sorter function that will return the order specified in the array and append the remaining files pseudo-alphabetized.
Having all of the collections grouped as simple arrays makes it easy for us to do things like automatically generate a top-level navigation to get to every collection:
Or to automatically generate a collection-level navigation for navigating between pages:
The plugin categorizes all files that fit the provided definition (in our case, providing file path patterns), adds a
collection array to each file that contains the name of the collection, and finally adds a
previous properties to files that points to the sibling file in the collection. This allows us to easily render collections later on with handlebars:
The key is that all of those pieces are automatically generated, so you never need to worry about remembering to link between pages.
Note that this plugin does not determine the final directory structure. By default, the directory structure is preserved from start to end, unless a plugin specifically modifies this.
The third transformation step is to template all of our Markdown files in placewith metalsmith-in-place. By that, I mean that we just run Handlebars over our Markdown files right where they are, so that we can take advantage of a bunch of helpers we’ve added.
For example, in any of our Markdown files we can use an
api-example helper like so:
Which, will render a language-agnostic code snippet that remembers the user’s language preference:
You can find the above code snippet here.
Then, we transform all of those Markdown files into HTML files with the metalsmith-markdown plugin. Pretty self-explanatory!
Now that we have all of our files as
.html instead of
.md, the next transformation is pretty simple using metalsmith-headings. It iterates over all of the files once more, extracting the text of all the
<h2> tags and adding that array as metadata of the file. So you might end up with a file object that looks like this:
Why would we want to do that? Because it means we can build the navigation in the sidebar automatically from the content of the file itself:
So you never need to worry about remembering to update the navigation yourself.
The next step is to use the permalinks plugin to transform files so that all of the content lives in
index.html files, so they can be served statically. For example, given a source directory like this:
The permalinks plugin would transform that into:
So that NGINX can serve those static files as:
The last step is to template all of our source files again (they’re not
.md anymore, they’re all
.html at this point) by rendering them into our top-level layout.
layout.html file is where all of the navigation rendering logic is contained, and we just dump the contents of each of the pages that started as Markdown into the global template, like so:
Once that’s done, we’re done! All of those files that started their life as simple Markdown have been run through a bunch of transformations. They now live as a bunch of static HTML files that each have automatically-generated navigations and sidebars (with active states too).
The last step is to deploy our documentation. This step isn’t to be forgotten, because our goal was to make our docs so simple to edit that everyone on the team can apply fixes as customers report problems.
To make our team as efficient as possible about shipping fixes and updates to our docs, we have our repo setup so that any branch merged to
master will kick off CircleCI to build and publish to production. Anyone can then make edits in a separate branch, submit a PR, then merge to
master, which will then automatically deploy the changes.
For the vast majority of text-only updates, this is perfect. Though, occasionally we may need more complex things.
For more information on the tech we use for our backend, check out Rebuilding Our Infrastructure with Docker, ECS, and Terraform.
Before we converted our docs to Metalsmith, they lived in a bunch of Jade files that were a pain in the butt to change. Because we had little incentive to edit them, we let typos run rampant and waited too long to fix misinformation. Obviously this was a bad situation for our customers.
Now that our docs are easy to edit in Markdown and quick to deploy, we fix problems much faster. The quickest way to fix docs issues is to make a permanent change, rather than repeat ourselves in ticket after ticket. With a simpler process, we’re able to serve our customers much better, and we hope you can too!
Calvin French-Owen on October 7th 2015
In Segment’s early days, our infrastructure was pretty hacked together. We provisioned instances through the AWS UI, had a graveyard of unused AMIs, and configuration was implemented three different ways.
As the business started taking off, we grew the size of the eng team and the complexity of our architecture. But working with production was still limited to a handful of folks who knew the arcane gotchas. We’d been improving the process incrementally, but we needed to give our infrastructure a deeper overhaul to keep moving quickly.
So a few months ago, we sat down and asked ourselves: “What would an infrastructure setup look like if we designed it today?”
Over the course of 10 weeks, we completely re-worked our infrastructure. We retired nearly every single instance and old config, moved our services to run in Docker containers, and switched over to use fresh AWS accounts.
We spent a lot of time thinking about how we could make a production setup that’s auditable, simple, and easy to use–while still allowing for the flexibility to scale and grow.
Here’s our solution.
Instead of using regions or tags to separate different staging and prod instances, we switched over totally separate AWS accounts. We need to ensure that our provisioning scripts wouldn’t affect our currently running services, and using fresh accounts meant that we had a blank slate to start with.
ops account serves as the jump point and centralized login. Everyone in the organization can have a IAM account for it.
The other environments have a set of IAM roles to switch between them. It means there’s only ever one login point for our admin accounts, and a single place to restrict access.
As an example, Alice might have access to all three environments, but Bob can only access dev (ever since he deleted the production load balancer). But they both enter through the
Instead of having complex IAM settings to restrict access, we can easily lock down users by environment and group them by role. Using each account from the interface is as simple as switching the currently active role.
Instead of worrying that a staging box might be unsecured or alter a production database, we get true isolation for free. No extra configuration required.
There’s the additional benefit of being able to share configuration code so that our staging environment actually mirrors prod. The only difference in configuration are the sizes of the instances and the number of containers.
Finally, we’ve also enabled consolidated billing across the accounts. We pay our monthly bill with the same invoicing and see a detailed breakdown of the costs split by environment.
As of today, we’re now running the majority of our services inside Docker containers, including our API and data pipeline. The containers receive thousands of requests per second and process 50 billion events every month.
The biggest single benefit of Docker is the extent that it’s empowered the team to build services from scratch. We no longer have a complex set of provisioning scripts or AMIs—we just hand the production cluster an image, and it runs. There’s no more stateful instances, and we’re guaranteed to run the same exact code on both staging and prod.
After configuring our services to run in containers, we chose ECS as the scheduler.
At a high level, ECS is responsible for actually running our containers in production. It takes care of scheduling services, placing them on separate hosts, and zero-downtime reloads when attached to an ELB. It can even schedule across AZs for better availability. If a container dies, ECS will make sure it’s re-scheduled on a new instance within that cluster.
The switch to ECS has vastly simplified running a service without needing to worry about upstart jobs or provisioning instances. It’s as easy as adding a Dockerfile, setting up the task definition, and associating it with a cluster.
In our setup, the Docker images are built by CI, and then pushed to Docker Hub. When a service boots up, it pulls the image from Docker Hub, and then ECS schedules it across machines.
We group our service clusters by their concern and load profile (e.g. different clusters for API, CDN, App, etc). Having separate clusters means that we get better visibility and can decide to use different instance types for each (since ECS has no concept of instance affinity).
Each service has a particular task definition indicating which version of the container to run, how many instances to run on, and which cluster to choose.
During operation, the service registers itself with an ELB and uses a healthcheck to confirm that the container is actually ready to go. We point a local Route53 entry at the ELB, so that services can talk to each other and simply reference via DNS.
The setup is nice because we don’t need any service discovery. The local DNS does all the bookkeeping for us.
ECS runs all the services and we get free cloudwatch metrics from the ELBs. It’s been a lot simpler than having to register services with a centralized authority at boot-time. And the best part is that we don’t have to deal with state conflicts ourselves.
Where Docker and ECS describe how to run each of our services, Terraform is the glue that holds them together. At a high level, it’s a set of provisioning scripts that create and update our infrastructure. You can think of it like a version of Cloudformation on steroids–but it doesn’t make you want to poke your eyes out.
Rather than running a set of servers for maintaining state, there’s just a set of scripts that describe the cluster. Configuration is run locally (and in the future, via CI) and committed to git, so we have a continuous record of what our production infrastructure actually looks like.
Here’s an sample of our Terraform module for setting up our bastion nodes. It creates all the security groups, instances, and AMIs, so that we’re able to easily set up new jump points for future environments.
We use the same module in both stage and prod to set up our individual bastions. The only thing we need to switch out are the IAM keys, and we’re ready to go.
Making changes is also painless. Instead of always tearing down the entire infrastructure, Terraform will make updates where it can.
When we wanted to change our ELB draining timeout to 60 seconds, it took a simple find/replace followed by a
terraform apply. And voilà, two minutes later we had a fully altered production setup for all of our ELBs.
It’s reproduceable, auditable, and self-documenting. No black boxes here.
We’ve put all the config in a central
infrastructure repo, so it’s easy to discover how a given service is setup.
We haven’t quite reached the holy grail yet though. We’d like to convert more of our Terraform config to take advantage of modules so that individual files can be combined and reduce the amount of shared boilerplate.
Along the way we found a few gotchas around the
.tfstate, since Terraform always first reads from the existing infrastructure and complains if the state gets out of sync. We ended up just committing our
.tfstate to the repo, and pushing it after making any changes, but we’re looking into Atlas or applying via CI to solve that problem.
By this point, we had our infrastructure, our provisioning, and our isolation. The last things left were metrics and monitoring to keep track of everything running in production.
In our new environment, we’ve switched all of our metrics and monitoring over to Datadog, and it’s been fantastic.
We’ve been incredibly happy with Datadog’s UI, API, and complete integration with AWS, but getting the most out of the tool comes from a few key pieces of setup.
The first thing we did was integrate with AWS and Cloudtrail. It gives a 10,000 foot view of what’s going on in each of our environments. Since we’re integrating with ECS, the Datadog feed updates everytime a task definition updates, so we end up getting notifications for deploys for free. Searching the feed is surprisingly snappy, and makes it easy to trace down the last time a service was deployed or rescheduled.
Next, we made sure to add the Datadog-agent as a container to our base AMI (datadog/docker-dd-agent). It not only gathers metrics from the host (CPU, Memory, etc) but also acts as a sink for our statsd metrics. Each of our services collects custom metrics on queries, latencies, and errors so that we can explore and alert on the in Datadog. Our go toolkit (soon to be open sourced) automatically collects the output of
pprof on a ticker and sends it as well, so we can monitor memory and goroutines.
What’s even cooler is that the agent can visualize instance utilization across hosts in the environment, so we can get a high level overview of instances or clusters which might be having issues:
Additionally, my teammate Vince created a Terraform provider for Datadog, so we can completely script our alerting against the actual production configuration. Our alerts will be recorded and stay in sync with what’s running in prod.
By convention, we specify two alert levels:
warning is there to let anyone currently online know that something looks suspicious and should be triggered well in advance of any potential problems. The
criticalalerts are reserved for ‘wake-you-up-in-the-middle-of-the-night’ problems where there’s a serious system failure.
What’s more, once we transition to Terraform modules and add the Datadog provider to our service description, then all services end up getting alerts for free. The data will be powered directly by our internal toolkit and Cloudwatch metrics.
Once we had all these pieces in place, the day had finally come to make the switch.
We first set up a VPC peering connection between our new production environment and our legacy one–allowing us to cluster databases and replicate across the two.
Next, we pre-warmed the ELBs in the new environment to make sure that they could handle the load. Amazon won’t provision automatically sized ELBs, so we had to ask them to ramp it ahead of time (or slowly scale it oursleves) to deal with the increased load.
From there, it was just a matter of steadily ramping up traffic from our old environment to our new one using weighted Route53 routes, and continuously monitoring that everything looked good.
Today, our API is humming along, handling thousands of requests per second and running entirely inside Docker containers.
But we’re not done yet. We’re still fine-tuning our service creation, and reducing the boilerplate so that anyone on the team can easily build services with proper monitoring and alerting. And we’d like to improve our tooling around working with containers, since services are no longer tied to instances.
We also plan to keep an eye on promising tech for this space. The Convox team is building awesome tooling around AWS infrastructure. Kubernetes, Mesosphere, Nomad, and Fleet seemed like incredibly cool schedulers, though we liked the simplicity and integration of ECS. It’s going to be exciting to see how they all shake out, and we’ll keep following them to see what we can adopt.
After all of these orchestration changes, we believe more strongly than ever in outsourcing our infrastructure to AWS. They’ve changed the game by productizing a lot of core services, while maintaining an incredibly competitive price point. It’s creating a new breed of startups that can build products efficiently and cheaply while spending less time on maintenance. And we’re bullish on the tools that will be built atop their ecosystem.
Calvin French-Owen, Chris Sperandio on August 4th 2015
It wasn’t long ago that building out an analytics pipeline took serious engineering chops. Buying racks and drives, scaling to thousands of requests a second, running ETL jobs, cleaning the data, etc. A team of engineers could easily spend months on it.
But these days, it’s getting easier and cheaper. We’re seeing the UNIX-ification of hosted services: each one designed to do one thing and do it well. And they’re all composable.
It made us wonder: just how quickly could a single person build their own pipeline without having to worry about maintaining it? An entirely managed data processing stream?
It sounded like analytics Zen. So we set out to find our inner joins peace armed with just a few tools: Terraform, Segment, DynamoDB and Lambda.
make (check it out on Github).
As a toy example, our data pipeline takes events uploaded to S3 and increments a time-series count for each event in Dynamo. It’s the simplest rollup we could possibly do to answer questions like: “How many purchases did I get in the past hour?” or “Are my signups increasing month over month?”
Here’s the general dataflow:
Event data enters your S3 bucket through Segment’s integration. The integration uploads all the analytics data sent to the Segment API on an hourly basis.
That’s where the composability of AWS comes in. S3 has a little-known feature called “Event Notifications”. We can configure the bucket to push a notification to a Lambda function on every file upload.
In theory, our Lambda function could do practically anything once it gets a file. In our example, we’ll extract the individual events, and then increment each
<event, hour> pair in Dynamo.
Once our function is up and running, we’ll have a very rudimentary timeseries database to query event counts over time.
From there, we’ll handle the incoming events, and update each item in Dynamo:
And finally, we can query for the events in our database using the CLI:
We could also build dashboards on it a la google analytics or geckoboard.
Even though we have our architecture and lambda function written, there’s still the task of having to describe and provision the pipeline on AWS.
Configuring these types of resources has been kind of a pain for us in the past. We’ve tried Cloudformation templates (verbose) and manually creating all the resources through the AWS UI (slooooow).
Neither of these options has been much fun for us, so we’ve started using Terraform as an alternative.
If you haven’t heard of Terraform, definitely give it a spin. It’s a badass tool designed to help provision and describe your infrastructure. It uses a much simpler configuration syntax than Cloudformation, and is far less error-prone than using the AWS Console UI.
As a taste, here’s what our
lambda.tf file looks like to provision our Lambda function:
The Terraform plan for this project creates an S3 bucket, the Lambda function, the necessary IAM roles and policies, and the Dynamo database we’ll use for storing the data. It runs in under 10 seconds and immediately sets up our infrastructure so that everything is working properly.
If we ever want to make a change, a simple
terraform apply, will cause the changes to update in our production environment. At Segment, we commit our configurations to source control so that they are easily audited and changelog’d.
We just walked through a basic example, but with Lambda there’s really no limit to what your functions might do. You could publish the events to Kinesis with additional Lambda handlers for further processing, or pull in extra fields from your database. The possibilities are pretty much endless thanks the APIs Amazon has created.
If you’d like to build your own pipeline, just clone or fork our example Github repo.
And that’s it! We’ll keep streaming. You keep dreaming.
Orta Therox on June 24th 2015
We’re excited to welcome Orta from CocoaPods to the blog to discuss the new Stats feature! We’re big fans of CocoaPods and are excited to help support the project.
CocoaPods is the Dependency Manager for iOS and Mac projects. It works similar to npm, rubygems, gradle or nuget. We’ve been running the open source project for 5 years and we’ve tried to keep the web infrastructure as minimal as possible.
Our users have been asking for years about getting feedback on how many downloads their libraries have received. We’ve been thinking about the problem for a while, and finally ended up asking Segment if they would sponsor a backend for the project.
It wasn’t just enough to offer just download counts. We spend a lot of time working around Xcode (Apple’s developer tool) project file intricacies, however in this context, it provides us with foundations for a really nice feature. CocoaPods Stats will be able to keep track of the unique number of installs within Apps / Watch Apps / Extensions / Unit Tests.
This means is that developers using continuous integration only register as 1 install, even if the server runs
pod install each time, separating total installations vs actual downloads.
Let’s go over how we check which pods get sent up for analytics, and how we do the unique installs. CocoaPods-Stats is a plugin that will be bundled with CocoaPods within a version or two. It registers as a post-install plugin and runs on every
pod install or
We’re very pessimistic about sending a Pod up to our stats server. We ensure that you have a CocoaPods/Specs repo set up as your master repo, then ensure that each pod to be sent is inside that repo before accepting it as a public domain pod.
First up, we don’t want to know anything about your app. So in order to know unique targets we use your project’s target UUID as an identifier. These are a hash of your MAC address, Xcode’s process id and the time of target creation (but we only know the UUID/hash, so your MAC address is unknown to us). These UUIDs never change in a project’s lifetime (contrary to, for example, the bundle identifier). We double hash it just to be super safe.
We then also send along the CocoaPods version that was used to generate this installation and about whether this installation is a
pod try [pod] rather than a real install.
My first attempt at a stats architecture was based on how npm does stats, roughly speaking they send all logs to S3 where they are map-reduced on a daily basis into individual package metrics. This is an elegant solution for a companywith people working full time on up-time and stability. As someone who wants to be building iOS apps, and not maintaining more infrastructure in my spare time, I wanted to avoid this.
We use Segment at Artsy, where I work, and our analytics team had really good things to say about Segment’s Redshift infrastructure. So I reached out about having Segment host the stats infrastructure for CocoaPods.
We were offered a lot of great advice around the data-modelling and were up and running really quickly. So you already know about the CocoaPods plugin, but from there it sends your anonymous Pod stats up to stats.cocoapods.org. This acts as a conduit sending analytics events to Segment. A daily task is triggered on the web site, this makes SQL requests against the Redshift instance which is then imported into metrics.cocoapods.org.
If you want to learn more about CocoaPods, check us out here.
Anthony Short on May 11th 2015
Over the past few months at Segment we’ve been rebuilding large parts of our app UI. A lot of it had become impossible to maintain because we were relying on models binding to the DOM via events.
Views that are data-bound to the DOM sound great but they are difficult to follow once they become complex and bi-directional. You’d often forget to bind some events and a portion of your UI would be out of sync, or you’d add a new feature and break 3 others.
So we decided to take on the challenge to build our own functional alternative to React.
We managed to get a prototype working in about a month. It could render DOM elements and the diffing wasn’t too bad. However, the only way to know if it was any good was to throw it into a real project. So that’s what we did. We built the Tracking Plan using the library. At this point it didn’t even have a real name.
It started simple, we found bugs and things we’d overlooked, then we started seeing patterns arising and ways to make the development experience better.
We were able to quickly try some ideas and trash them if they didn’t work. At first we started building it like a game engine. It had a rendering loop that would check to see if components were dirty and re-render on every frame, and a
scenethat managed all the components and inputs like a game world. This turned out to be annoying for debugging and made it overly complex.
Thanks to this process of iteration we were able to cut scope. We never needed
refs like React, so we didn’t add it. We started with a syntax that used prototypes and constructors but it was unnecessarily verbose. We haven’t had to worry about maintaining text selection because we haven’t run across it in real-world use. We also haven’t had any issues with element focus because we’re only supporting newer browsers.
We spent many late nights discussing the API on a white board and it’s something we care about a lot. We wanted it to be so simple that it would be almost invisible to the user. An API is just UI for developers so we treated it like any other design problem at Segment — build, test, iterate.
Performance is the most important feature of any UI library. We couldn’t be sure if the library was on the right path until we’d seen it running in a real app with real data and constraints. We managed to get decent performance on the first try and we’ve been fine-tuning performance as we add and remove new features.
We first ran into performance issues when we had to re-build the debugger. Some customers were sending hundreds of events per second and the animation wouldn’t work correctly if we were just trashing DOM elements every frame. We implemented a more optimized key diffing algorithm and now it renders hundreds of events per second at a smooth 60 fps with ease. Animations included.
Eventually everything started to settle down. We took the risk and implemented our own library and it now powers the a large portion of our app. We’ve stripped thousands of lines of code and now it’s incredibly easy to add new features and maintain thanks to this new library.
Finally, we think it’s ready to share with everyone else.
Deku is our library for building user interfaces. It supports many of the features you’re familiar with in React but aims to be small and functional. You define your UI as a tree of components and whenever a state change occurs it re-renders the entire tree to patch the DOM using a highly optimized diffing algorithm.
The whole library weighs in at less than 10kb and is easy to follow. It’s also using npm so some of those modules are probably being used elsewhere in your code anyway.
It uses the same concept of components as React. However, we don’t support older browsers, so the codebase is small and component API is almost non-existent. It even supports JSX thanks to Babel.
Here’s what a component looks like in Deku:
Then you can import that component and render your app:
You’ll notice there is no concept of classes or use of
this. We’re using plain objects and functions. The ES6 module syntax is used to define components and every lifecycle hook is passed the
component object which holds the
state you’ll use to render your template.
We never really needed classes. What’s the point when you never initialize them anyway? The beauty of using plain functions is that the user can use the ES6 module system to define them however they want! Best of all, there’s is no new syntax to learn.
Deku has many of the same lifecycle hooks but with two new ones -
afterRender. These are called every single render, including the first, unlike the update hooks. We’ve found these let us stop thinking about the lifecycle state so much.
Some of the lifecycle hooks are passed the
setState function so you can trigger side-effects to update state and re-render the app. DOM events are delegated to the root element and we don’t need to use any sort of synthesized event system because we’re not supporting IE9 and below. This means you never need to worry about handling or optimizing event binding.
To render the component to the DOM we need to create a
tree will manage loading data, communicating between components and allows us to use plugins on the entire application. For us it has eliminated the need for anything like Flux and there are no singletons in sight.
You can render the component tree anyway you’d like — you just need a renderer for it. So far we have a HTML renderer for the server and a DOM renderer for the client since those are the two we’ve needed. It would be possible to build a canvas or WebGL renderer.
The dbmonster performance mini-app written in Deku is also very fast and renders at roughly 15-16 fps compared to most other libraries which render at 11-12 fps. We’re always looking for more ways to optimize the diffing algorithm even further but it’s already we think it’s fast enough.
The first thing we usually get asked when we tell people about Deku is “Why didn’t you just use React?”. It could seem like a classic case of NIH syndrome.
We originally looked into this project because we use Duo as a front-end build tool. Duo is like npm, but just uses Github. It believes in small modules doing one thing well. React was a ‘big thing’ doing many things within a black box. We like knowing in detail how code works, so we feel comfortable with it and can debug it when something goes wrong. It’s very hard to do that with React or any big framework.
We ended up using React for a short time but the API forced us to use a class-like syntax that would lock us into the framework. We also found that we kept fighting with function context all the time which is waste of brain energy. React has some functional aspects to it but it still feels very object-oriented. You’re always concerning yourself with implicit environment state thanks to
this and the class system. If you don’t use classes you never need to worry about
this, you never need decorators and you force people to think about their logic in a functional way.
What started as a hack project to see if we could better understand the concept behind React has developed into a library that is replacing thousands of lines of code and has become the backbone of our entire UI. It’s also a lot of fun!
We’ve come a long way in the past few months. Next we’re going to look at a few ways we could add animation states to components to solve a problem that plagues every component system using virtual DOM.
In our next post on Deku, we’ll explain how we structure our components and how we deal with CSS. We’ll also show off our UIKit — the set of components we’ve constructed to rapidly built out our UI.
Steven Miller, Dominic Barne on April 9th 2015
Last week, we open sourced Sherlock, a pluggable tool for detecting third-party services on a given web page. You might use this to detect analytics trackers (eg: Google Analytics, Mixpanel, etc.), or social media widgets (eg: Facebook, Twitter, etc.) on your site.
We know that setting up your integrations has required some manual work. You’ve had to gather all your API keys and enter them into your Segment project one by one. We wanted to make this process easier for you, and thought that a “detective” to find your existing integrations would help!
Enter Sherlock. When you tell us your project’s url, Sherlock searches through your web page and finds the integrations you’re already using. Then, he automatically enters your integrations’ settings, which makes turning on new tools a bit easier.
Here’s a code sample of Sherlock in action:
Since there are no services baked into Sherlock itself, we’re adding a Twitter plugin here manually. Sherlock opens the
url and if
widgets.js is present on the page, then it will be added to
The above example is admittedly trivial. Here’s a more realistic use-case:
Here, we are adding sherlock-segment, a collection of plugins for about 20 of the integrations on our platform. Now,
results will look like this:
To make your own plugin, simply add the following details to your
package.json: (feel free to use sherlock-segment as a starting point)
name should include “sherlock-“ as a prefix
keywords should include “sherlock”
Your plugin should export an array of service configuration objects, each object can support the following keys:
name should be a human-readable string
script can be a string, regular expression, or a function that matches the src attribute of a script tag
settings is an optional function that is run on the page to extract configuration
Here is an example service configuration: