Benjamin Yolken on September 15th 2020
Stephen Mathieson, Calvin French-Owen on May 26th 2016
For the past year, we’ve been heavy users of Amazon’s EC2 Container Service (ECS). It’s given us an easy way to run and deploy thousands of containers across our infrastructure.
ECS gives us free Cloudwatch metrics, automatic healthchecks, and scheduling across hundreds of nodes. It’s radically simplified how we deploy code to production. In short, it’s pretty awesome.
But ECS had one major shortcoming: navigating the AWS Console is a massive pain. It’s hard to find which containers are running and what version of an image is deployed.
That’s why we built Specs
Specs gives you a high-level window into your ECS clusters and services.
It provides a handy search box to find particular services in production, and then helps diagnose what the service’s exact state. Specs lets us know if a container isn’t able to be placed due to insufficient capacity or if all containers in a service are dead.
It gives the entire development team a better picture of how our production environment is configured.
We run Specs behind an internal Google oAuth server, so any engineer can view it even if they don’t have an IAM account. It’s allowed a lot more of the team to debug issues without bothering the core infra team.
Getting started with Specs couldn’t be easier. Assuming you’re running Docker and have your AWS credentials exported, just run the following command:
You’ll get convenient dashboards with zero configuration.
If you’d like to help contribute, please check out the github repo. We’re excited to add features like log tailing and a real-time events feed to make working with ECS a truly seamless experience.
Lipei Wang on May 23rd 2016
Python, one of the most popular scripting languages, is also one of the most preferred tools for data analysis and visualization. In addition to the broader Python developer community, there is also a significant group that uses Python to analyze data, draw actionable insights, and make decisions.
With its extensive collection of helper libraries and platforms, Python is a great tool for quick, iterative data exploration. Python’s set of libraries includes everything from visualization to statistical analysis, making it convenient for its users to jump into the data and begin identifying patterns.
Together with the ability to iterate quickly in data and statistical analysis, there are great open source tools on managing data pipelines and workflows. A growing community of analysts are finding new ways of using Python to crunch numbers and understand their data.
We had a chance to catch up with the Chief Analyst at Mode, Benn Stancil, and ask him about the importance of Python, how to use it in day-to-day analysis, and to tell us about some key features of Mode’s new product, Mode Python Notebooks. Mode Python Notebooks is a hosted solution that allows analysts to use Python for exploratory analysis.
I’m interested in Python for the same reasons I like SQL: It gives me the power and flexibility to answer any question. The community is great and adoption is on the rise.
There are many easy to use Python libraries to make data exploration convenient and immediate. This allows for iterative data analysis. With Python, you can really chase your curiosities down the rabbit hole.
Lastly, Python’s utility and flexibility allows it to be used for a variety of tasks within the data science stack. For example, Luigi and Airflow both allow for managing data pipelines and workflows in Python. By completing exploratory analysis in Python, there can be times where the work carries over into production.
What are the most popular use cases for SQL as opposed to Python?
SQL is designed to query and extract data from a database. It’s a necessary first step to get the data into a usable format. For instance, SQL allows you to easily join several data sets to create a table that you can explore further.
SQL isn’t really designed for manipulating or transforming data in certain ways. Higher level data manipulation that is common with data science, such as statistical analysis, regressions, trend lines, and working with time series data, isn’t easy in SQL.
Despite these limitations, because SQL is necessary for extracting data, it’s still commonly used for complex operations. The query below, which calculates quantiles for different series in a data, is something I’ve used versions of many times.
When would Python come in?
Python has a ton of libraries (e.g. Pandas, StatsModel, and SciPy) that are designed for statistical and mathematical analysis. The libraries also do a great job of abstracting away the details so that you don’t need to calculate all the underlying math by hand. Moreover, you can get your results immediately, so you can use Python iteratively to explore your data.
Rather than saying “I want to do a regression analysis” and sitting down for half an hour figuring out where to begin in SQL, the Python libraries make it so that you can just run the analysis, see the results, and continue exploring the path your curiosity takes you down. With Python, there is not much lag between inspiration and action. With SQL, on the other hand, I often think twice before going down a path that may or may not be fruitful.
For example, I’d only write the query above if I really knew I wanted to present the quantiles of that dataset. Because the entire thing can be accomplished with the one line of Python below, I’d do it much earlier in my analytical process–and may discover something I wasn’t looking for as a result.
Another way to think about the difference between Python and SQL is that Python allows you to start with one large table, from which you branch off different analyses in different directions. One avenue of inspiration can bring you to another avenue and to another avenue. The speed and flexibility of analysis makes it easy to go down many exploratory paths.
Those kinds of analyses sound very different. Why combine SQL and Python in one place?
Because SQL and Python each have individual strengths and weaknesses. Tying the languages together gives analysts the best of both worlds.
First, SQL is needed to build the data set into a final table that has all of the necessary attributes. Then, from this large data set, you can use Python to spin off deeper analysis.
What would it take for a SQL analyst to learn Python?
Like many skills, the best way to learn how to use Python for analysis is by diving in to work on a problem you’re interested in, care about, and with which you’re somewhat familiar.
When you work on something that you’re interested in, you tend to go deeper. You uncover something in the core problem that piques your interest, and you want to learn more by analyzing the data set in a different way. You begin asking more and more questions. This curiosity can push you further than you would go otherwise, and a is a source of a lot of real learning.
You also should work on data that you’re somewhat familiar with, so you know when you do something wrong. You’ll have better instincts about what’s going on and what to expect. Compare this to when you’re working on data about which you know nothing—like flower petal sizes (an unusually popular data set found in many Python examples). If your analysis concludes that “all of these flowers have two centimeter-long petals” and you have no idea whether that is reasonable, you may just assume it’s right and move on.
Mode is also releasing a new Python tutorial that aims to help SQL users learn how and where to integrate Python into their workflow. In addition, the Python tutorials provide problems that are familiar to those in business settings, instead of academic problems.
How would you expect learning Python to help benefit analysts in their job and their careers?
Learning Python definitely can augment an analyst’s skill set.
An analyst needs to communicate the business value through data. One part of the job is to find the insights from the data, but the more effective job is also to include the right context and narrative around the insights that can compel your teammates towards action. And since using data and analytics to make decisions is becoming more important in the workplace, the role of the analyst to deliver comprehensive analysis is more important than ever.
Is it easy to get started with Python? What kind of setup and tooling do you need?
Until recently, getting started with Python for analysis requires installing a few things—Python, several main statistics and data analysis libraries, and Notebooksoftware that’ll run the analysis locally on your computer. Then, you’d have to run the Notebook, starting a local server to execute your Python commands.
The results generated from your commands on Notebooks will exist on your desktop. In order to share it, you basically download a Python Notebook HTML file (which has a mix of code and results) and send that around. In order to easily parse the results from the HTML, your colleagues would have to open that file in a browser. And at that point, unless they have Python set up too, the code isn’t re-executable.
When you run Python locally, you’re limited to the power of your computer, you have to leave your computer open to run it, and running scripts can slow other things down. By running it remotely, you can run it from any machine and you can run something, close your computer and walk away, and still have your results waiting when you get back.
With Mode Python Notebooks, all of that setup and hosting is taken care of for you. If you’re using Mode to query an existing database, there’s a tab for a Python Notebook. You can open it up and the query results table will be automatically populated. And after generating plots, time series analysis, summary statistics, etc., in the Notebook, it’s easy to curate the results into an instantly shareable report anyone can re-run. Moreover, you don’t have to run it on your local machine, which saves your computer’s processing power and memory. Finally, setting it all up is as easy as setting up a database connection.
Thanks for your time, Benn!
If you’re interested in learning how to run data analysis in Python, check out Mode’s new Python tutorial. And, if you’re a Segment customer looking to run Python scripts on your event and cloud app data check out Mode Python Notebooks.
Calvin French-Owen on May 16th 2016
Since Segment’s first launch in 2012, we’ve used queues everywhere. Our API queues messages immediately. Our workers communicate by consuming from one queue and then publishing to another. It’s given us a ton of leeway when it comes to dealing with sudden batches of events or ensuring fault tolerance between services.
We first started out with RabbitMQ, and a single Rabbit instance handled all of our pubsub. Rabbit had lot of nice tooling around message delivery, but it turned out to be really tough to cluster and scale as we grew. It didn’t help that our client library was a bit of a mess, and frequently dropped messages (anytime you have to do a seven-way handshake for a protocol that’s not TLS… maybe re-think the tech you’re using).
So in January 2014, we started the search for a new queue. We evaluated a few different systems: Darner, Redis, Kestrel, and Kafka (more on that later). Each queue had different delivery guarantees, but none of them seemed both scalable and operationally simple. And that’s when NSQ entered the picture… and it’s worked like a charm.
As of today, we’ve pushed 750 billion messages through NSQ, and we’re adding around 150,000 more every second.
Before discussing how NSQ works in practice, it’s worth understanding how the queue is architected. The design is so simple, it can be understood with only a few core concepts:
topics - a topic is the logical key where a program publishes messages. Topics are created when programs first publish to them.
channels - channels group related consumers and load balance between them–channels are the “queues” in a sense. Every time a publisher sends a message to a topic, that message is copied into each channel that consumes from it. Consumers will read messages from a particular channel and actually create the channel on the first subscription. Channels will queue messages (first in memory, and spill over to disk) if no consumers are reading from them.
messages - messages form the backbone of our data flow. Consumers can choose to finish messages, indicating they were processed normally, or requeue them to be delivered later. Each message contains a count for the number of delivery attempts. Clients should discard messages which pass a certain threshold of deliveries or handle them out of band.
NSQ also runs two programs during operation:
nsqd - the nsqd daemon is the core part of NSQ. It’s a standalone binary that listens for incoming messages on a single port. Each nsqd node operates independently and doesn’t share any state. When a node boots up, it registers with a set of nsqlookupd nodes and broadcasts which topics and channels are stored on the node.
Clients can publish or read from the nsqd daemon. Typically publishers will publish to a single, local nsqd. Consumers read remotely from the connected set of nsqd nodes with that topic. If you don’t care about adding more nodes dynamically, you can run nsqds standalone.
nsqlookupd – the nsqlookupd servers work like consul or etcd, only without coordination or strong consistency (by design). Each one acts as an ephemeral datastore that individual nsqd nodes register to. Consumers connect to these nodes to determine which nsqd nodes to read from.
Let’s walk through a more concrete example of how this works in practice.
NSQ recommends co-locating publishers with their corresponding nsqd instances. That means even in the face of a network partition, messages are stored locally until they are read by a consumer. What’s more, publishers don’t need to discover other nsqd nodes–they can always publish to the local instance.
First, a publisher sends a message to its local nsqd. To do this, it first opens up a connection, and then sends a PUB command with the topic and message body. In this case, we publish our messages to the events topic to be fanned out to our different workers.
The events topic will copy the message and queue it in each of the channels linked to the topic. In our case, there are three channels, one of them being the archives channel. Consumers will take these messages and upload them to S3.
Messages in each of the channels will queue until a worker consumes them. If the queue goes over the in-memory limit, messages will be written to disk.
The nsqd nodes will first broadcast their location to the nsqlookupds. Once they are registered, workers will discover all the nsqd nodes with the events topic from the lookup servers.
Then each worker subscribes to each nsqd host, indicating that it’s ready to receive messages. We don’t need a fully connected graph here, but we do need to ensure that individual nsqd instances have enough consumers to drain their messages (or else the channels will queue up).
Separate from the client library, here’s an example of what our message handling code might look like:
If the third party fails for any reason, we can handle the failure. In this snippet, we have three behaviors:
discard the message if it’s passed a certain number of delivery attempts
finish the message if it’s been processed successfully
requeue the message to be delivered later if an error occurs
As you can see, the queue behavior is both simple and explicit.
In our example, it’s easy to reason that we’ll tolerate
MAX_DELIVERY_ATTEMPTS * BACKOFF_TIME minutes of failure from our integration before discarding messages.
At Segment, we keep statsd counters for message attempts, discards, requeues, and finishes to ensure that we’re achieving a good QoS. We’ll alert for services any time the number of discards exceeds the thresholds we’ve set.
In production, we run nsqd daemons on pretty much all of our instances, co-located with the publishers that write to them. There are a few reasons NSQ works so well in practice:
Simple protocol - this isn’t a huge issue if you already have a good client library for your queue. But, it can make a massive difference if your existing client libraries are buggy or outdated.
NSQ has a fast binary protocol that is easy to implement with just a few days of work. We built our own pure-JS node driver, (at the time, only a coffeescript driver existed) which has been stable and reliable.
Easy to run - NSQ doesn’t have complicated watermark settings or JVM-level configuration. Instead, you can configure the number of in-memory messages and the max message size. If a queue fills up past this point, the messages will spill over to disk.
Distributed - because NSQ doesn’t share information between individual daemons, it’s built for distributed operation from the beginning. Individual machines can go up and down without affecting the rest of the system. Publishers can publish locally, even in the face of network partitions.
This ‘distributed-first’ design means that NSQ can essentially scale forever. Need more throughput? Add more nsqds!
The only shared state is kept in the lookup nodes, and even those don’t require a “global view of the universe.” It’s trivial to set up configurations where certain nsqds register with certain lookups. The only key is that the consumers can query to get the complete set.
Clear failure cases - NSQ sets up a clear set of trade-offs around components which might fail, and what that means for both deliverability and recovery.
I’m a firm believer in the principle of least surprise, particularly when it comes to distributed systems. Systems fail, we get it. But systems that fail in unexpected ways are impossible to build upon. You end up ignoring most failure cases because you can’t even begin to account for them.
While they might not be as strict of guarantees as a system like Kafka can provide, the simplicity of operating NSQ makes the failure conditions exceedingly apparent.
UNIX-y tooling - NSQ is a good general purpose tool. So it’s not surprising that the included utilities are multi-purpose and composable.
In addition to the TCP protocol, NSQ provides an HTTP interface for simple cURL-able maintenance operations. It ships with binaries for piping from the CLI, tailing a queue, piping from one queue to another, and HTTP pubsub.
There’s even an admin dashboard for monitoring and pausing queues (including the sweet animated counter above!) that wraps the HTTP API.
As I mentioned, that simplicity doesn’t come without trade-offs:
No replication - unlike other queues, NSQ doesn’t provide any sort of replication or clustering. This is part of what makes running it so simple, but it does force some hard guarantees on reliability for published messages.
We partially get around this by lowering the file sync time (configurable via a flag) and backing our queues with EBS. But there’s still the possibility that a queue could indicate it’s published a message and then die immediately, effectively losing the write.
Basic message routing - with NSQ, topics and channels are all you get. There’s no concept of routing or affinity based upon on key. It’s something that we’d love to support for various use cases, whether it’s to filter individual messages, or route certain ones conditionally. Instead we end up building routing workers, which sit in-between queues and act as a smarter pass-through filter.
No strict ordering - though Kafka is structured as an ordered log, NSQ is not. Messages could come through at any time, in any order. For our use case, this is generally okay as all the data is timestamped, but doesn’t fit cases which require strict ordering.
No de-duplication - Aphyr has talked extensively in his posts about the dangers of timeout-based systems. NSQ also falls into this trap by using a heartbeat mechanism to detect whether consumers are alive or dead. We’ve previously written about various reasons that would cause our workers to fail heartbeat checks, so there has to be a separate step in the workers to ensure idempotency.
As you can see, the underlying motif behind all of these benefits is simplicity. NSQ is a simple queue, which means that it’s easy to reason about and easy to spot bugs. Consumers can handle their own failure cases with confidence about how the rest of the system will behave.
In fact, simplicity was one of the primary reasons we decided to adopt NSQ in the first place (along with many of our other software choices). The payoff has been a queueing layer that has performed flawlessly even as we’ve increased throughput by several orders of magnitude.
Today, we face a more complex future. More and more of our workers require a stricter set of reliability and ordering guarantees than NSQ can easily provide.
We plan to start swapping out NSQ for Kafka in those pieces of the infrastructure and get much better at running the JVM in production. There’s a definite trade-off with Kafka; we’ll have to shoulder a lot more operational complexity ourselves. On the other hand, having a replicated, ordered log would provide much better behavior for many of our services.
But for any workers which fit NSQ’s trade-offs, it’s served us amazingly well. We’re looking forward to continuing to build on its rock-solid foundation.
TJ Holowaychuk for creating nsq.js and helping drive its adoption at Segment.
Jehiah Czebotar and Matt Reiferson for building NSQ and reading drafts of this article.
Julian Gruber, Garrett Johnson, and Amir Abu Shareb for maintaining nsq.js.
Tejas Manohar, Vince Prignano, Steven Miller, Tido Carriero, Andy Jiang, Achille Roussel, Peter Reinhardt, Stephen Mathieson, Brent Summers, Nathan Houle, Garrett Johnson, and Amir Abu Shareb for giving feedback on this article.
Calvin French-Owen on March 15th 2016
I recently jumped back into frontend development for the first time in months, and I was immediately struck by one thing: everything had changed.
When I was more active in the frontend community, the changes seemed minor. We’d occasionally make switches in packaging (RequireJS → Browserify), or frameworks (Backbone → Components). And sometimes we’d take advantage of new Node/v8 features. But for the most part, the updates were all incremental.
Years ago, a friend and I discussed choosing the ‘right’ module system for his company. At the time, he was leaning towards going with RequireJS–and I urged him to look at Browserify or Component (having just abandoned RequireJS ourselves).
We talked again last night, and he said that he’d chosen RequireJS. By now, his company had built a massive codebase around it –- “I guess we bet on the wrong horse there.”
Most of the tools we use today didn’t even really exist a year ago: React, JSX, Flux, Redux, ES6, Babel, etc. Even setting up a ‘modern’ project requires installing a swath of dependencies and build tools that are all new. No other language does anything remotely resembling that kind of thing. It’s enough to even warrant a “State of the Art” post so everyone knows what to use.
As it turns out, there are a ton of interesting dynamics at play: corporate self-interest, cries for open standards, Mozilla pushing language development, lagging IE releases that dominate the market, and tools that pave over a lot of the fragmentation.
ECMAScript has a weird and fascinating history of its own, largely broken up into different ‘major’ editions.
apply, and various
ES2 was released one year later in June 1998, but the changes were purely cosmetic. They had been added to comply with a corresponding ISO version of the spec without changing the specification for the language itself, so it’s typically omitted from compatibility tables.
ES3, released in December 1999, marked the first major additions to the language. It added support for regexes,
.bind, and try/catch error handling. ES3 became the “standard” for what the majority of browsers would support, underscored by the popularity of older IE browsers. Even in February 2012 (13 years after the ES3 release!!), ES3 IE browsers still had over 20% of the browser market.
But just as the spec was nearing completion, trouble struck! On the one hand Microsoft’s IE platform architect Chris Wilson, argued that the changes would effectively “break the web”. He stated that the ES4 changes were so backwards incompatible that they would significantly hurt adoption.
On the other hand, Brendan Eich, now the Mozilla CTO, argued for the changes. In an open letter to Chris Wilson, he objected to the fact that Microsoft was just now withdrawing support for a spec which had been in the works for years.
After meeting in Oslo, Brendan Eich sent a message to es-discuss (still the primary point for language development) outlining a plan for a near-term incremental release, along with a bigger reform to the language spec that would be known as ‘harmony’.. ES5 was finalized and published in December 2009.
With ES5 finally out the door, the committee started moving full-speed ahead on the next set of language extensions which they’d hoped to start incoporating nearly a decade earlier.
By now, browser-makers and committee members alike had seen the danger of adding a giant release without testing the features. So drafts of ES6 have been published frequently since 2011, and available behind flags (
--harmony) in Node and many of the major browsers. Additionally transpilers had become a part of the modern build chain (more on that later), allowing developers to use the bleeding-edge of the spec, without worrying about breaking old browsers. It was eventually released in June 2015.
So we have:
ECMAScript versions driven by the planning committee to standardize the language
With experiments and language additions to each dialect pushing the ES standard ahead.
Of course, we can’t talk just about the full history of JS without mentioning the platform it’s running on: browsers.
But an interesting thing happened when Chrome first emerged onto the scene in December, 2008. While Chrome had numerous features that made it technically interesting (tabs as processes, a new fast js runtime, and others), perhaps the thing that made it most revolutionary was its release schedule.
Chrome shipped from day-1 with auto-updates, first tested from a widespread number of early adopters using the dev and beta channels. If you wanted to use the bleeding edge, you could easily do so. Nowadays, updates to the stable channel happen every six weeks, with thousands of users testing and automatically reporting bugs they encounter on the dev and beta channels.
It meant that Chrome was able to ship updates far more frequently than its competitors. Where the other browsers would tell users to update every 6-12 months, Chrome updated the three channels weekly, monthly, and released updates on the stable channel every six weeks.
Most of the features are still not “safe” to use and hidden behind flags, but the fact that browsers will continue to auto-update means that the velocity of new JS features as increased dramatically in the last few years.
Imagine now that you’re a developer just learning JS, and knowing none of that history. It’d be pretty much impossible to keep track of which features are supported in what browsers, what’s to spec and what’s not, and what parts of the language should even be used.
Realizing this was a problem, John Resig released the first version of JQuery in 2006. It was a drop-in library designed to pave over a lot of the inconsistencies between the different browsers. And for the most part, developers became comfortable with always bundling jQuery into their projects. It became the de facto library for DOM manipulation, AJAX requests, and more.
But working with many different scripts was tedious. They had to be loaded in a certain order, or else bundle duplicated dependencies many times over. Individual libraries would overwrite the global scope, causing conflicts and monkey-patching default implementations.
So James Burke created RequireJS in 2009, spawned out of his work with Dojo. He created a module loader designed to specify dependencies, and load them asynchonously in the browser. It was one of the first frameworks that actually introduced the idea of a “module system” for isolating pieces of code and loading them asynchronously–dubbed AMD (Asynchronous Module Definition).
While AMD was easy to load on the fly, it was also verbose. It essentially added dependency injection for every single library you’d want to load:
The simplicity had the benefit that the browser could load libraries on-the-fly, but it was a lot of overhead for the programmer to really understand.
Since Node ran on the server, making requests for additional scripts was cheap. The scripts were cached, so it was as easy as reading and parsing an additional file.
Instead of using AMD modules, Node popularized the CommonJS format (the competing module format at the time), using synchronous
require statements for loading dependencies. It mirrored the same way that other languages worked, grabbing dependencies synchronously and then caching them. Isaac Schuetler built
npm around it as the de facto way to manage and install dependencies in Node. Soon, everyone was writing node scripts that looked like this:
Yet the frontend still lagged. Developers wanted to
So a new set of tools appeared on the frontend developer’s toolchain. Bower for pure dependency management, Browserify and Component/Duo for actually building scripts and bundling them together. Webpack promised a new build system that also handled CSS, and Grunt and Gulp for actually orchestrating them all and making them play nicely together.
After all, there are hundreds of programming languages that can run on the server, but there’s only one widely-supported language for the browser.
But it didn’t end there.
I didn’t even go into changes in frameworks (Ember, Backbone, Angular, React) or how we structure asynchronous programming (Callbacks, Promises, Iterators, Async/Await). But those have all churned relatively regularly over the years.
The most interesting thing about having a language where everyone is comfortable with build tools, transpilers, and syntax additions is that the language can advance at an astonishing rate. Really the only thing holding back new language features is consensus. As quickly as people agree and implement the spec, there’s code to support it.
I predict we’ll continue to see the cycle of rapid iteration for the next few years at least. Companies will be forced to update their codebase, or be happy with the horse they have.
Calvin French-Owen on December 15th 2015
At Segment, we’ve fully embraced the idea of microservices; but not for the reasons you might think.
The microservices vs. monoliths debate has been pretty thoroughly discussed, so I won’t completely re-hash it here. Microservices proponents say that they provide better scalability and are the best way to split responsibilty across software engineering teams. While the pro-monolith group say that microservices are too operationally complex to begin with.
But a major benefit of running microservices is largely absent from today’s discussions: visibility.
When we’re getting paged at 3am on a Tuesday, it’s a million times easier to see that a given worker is backing up compared to adding tracing through every single function call of a monolithic app.
That’s not to say you can’t get good visiblity from more tightly coupled code, it’s just rarer to have all the right visibility from day one.
Where does that visibility come from? Consider for a moment the standard tools that are part of our ops arsenal:
None of them monitor individual program execution: hot codepaths, stack size, etc. The battle-tested tools we’ve built over the past 20 years are all built around the concepts of hosts, processes, or drives.
With a distributed system, we can add in requests and network throughput to our metrics, but most tools still tend to aggregate at a host or service level.
The process-centric nature of monitoring tools makes it really difficult to get a sense of where a program is actually spending time. With a monolithic app, our best options to debug are either to run the program against a profiler or to implement our own timing metrics.
Now that’s kind of crazy when you think about it. Most of the reason flamegraphs are so useful is that we don’t have that detailed amount of monitoring at the level of individual function calls.
So instead of trying to shoe-horn lots of functionality into monoliths, at Segment we’ve doubled down on microservices. We’re betting that container scheduling and orchestration will continue to get easier and more powerful, while most metrics and monitoring will continue to be dominated by the idea of ‘hosts’ and ‘services’.
The caveat here is that microservices only work so long as it’s actually easy to create new services. Otherwise we’ve just traded a visibility problem for a provisioning problem.
In other posts, we’ve talked a little bit about what our services look like, and how we build them with terraform. And now, we’ve started splitting each service into modules, so we can re-use the exact configuration between stage and prod.
Here’s an example of a simple auth service, using terraform as our configuration to set up all of our resources:
For the curious, you can check out an example of the full module definition.
As long as there’s a singificant benefit (free metrics) and low cost (10-line terraform script), we remove the temptation to tack on different functionality into an existing service.
And so far, that approach has been working quite well.
Segment is a bit unusual–instead of microservices which are coordinating together, we have a lot of what I’d call “microworkers.” Fundamentally, it’s the same concept, but the worker doesn’t serve requests to clients. Instead, the typical Segment worker reads some data from a queue, does some processing on it, and then acks the message.
These workers end up being a lot simpler than services because there are no dependencies. There’s no coupling or worrying that a given problem with one worker will compound and disrupt the rest of a system. If a service is acting up, there’s just a single queue that ends up backing up. And we can scale additional workers to handle the load.
There are a few forces at work which make tiny workers the right call for us. But the biggest comes from our team size and relative complexity of what we’re trying to build.
Microservices are usually touted when the team grows to a size where there are too many people working on the same codebase. At that point, it makes sense to shard ownership of the codebase by team. But we’ve seen it be equally helpful with a small team as well.
Most folks I talk with are surprised at how small our engineering team is. To give you a rough sense of our scale:
400 private repos
70 different services (workers)
We’re in the postion of having a large product scope and a small engineering team. So if I’m currently on-call and get paged, it could be for code that I wrote 6-months ago and haven’t touched since.
And that’s the place where tiny, well-defined, services shine.
Here’s the typical scenario: first there’s an alert which gets triggered because a particular queue depth is backing up.
We can verify this is really the case (and isn’t getting better) by checking the queue depth in our monitoring tools.
At that point we know exactly which worker is backing up (since each worker subscribes to a single queue), and which logs to look at. Each service logs with its own tags, so we don’t have to worry about unrelated logs interleaving within a single app for multiple requests.
We can look at Datadog for a single dashboard containing that worker’s CPU, Memory, and the responses and latency coming from it’s ELB. Once we’ve identified the problem, it’s a question of reading through 50-100 line file to isolate exactly where the problem is happening (let’s play spot the memory leak!).
With a monolith, we could add individual monitoring specifically for each endpoint. But why bother when we get it for free by running code as part of its own process?
Not to mention the fact that we also get isolated CPU, memory, and latency (if the service sits behind an ELB) out of the box. It’s infinitely easier to track down a memory leak in a hundred-line worker with a single codepath than it is in a monolithic app with hundreds of endpoints.
I understand this approach won’t work for everyone. And it requires a pretty significant investment in up-front tooling to make sure that creating a new service from scratch has everything it needs. Depending on your team, workload, and product scope, it might not make sense.
But for any product operating with a high level of operational complexity and load, I’d choose the microservice architecture every time. It’s made our infrastructure flexible, scalable, and far easier to monitor–without sacrificing developer productivity.
Calvin French-Owen on November 20th 2015
Every month, Segment collects, transforms and routes over 50 billion API calls to hundreds of different business-critical applications. We’ve come a long way from the early days, where my co-founders and I were running just a handful of instances.
Today, we have a much deeper understanding of the problems we’re solving, and we’ve learned a ton. To keep moving quickly and avoid past mistakes, our team has started developing a list of engineering best practices.
Now that a lot of these “pro tips” have been tested, deployed and are currently in production… we wanted to share them with you. It’s worth noting that we’re standing on the shoulders of giants here, to The Zen of Python, Hints for Computer System Design, and the Twelve-Factor App for the inspiration.
Editor’s Note: This post was based off an internal wiki page for Segment “Pro Tips”. There are more tips recorded there, but we chose a handful that seemed most broadly applicable. They’re written as fact, but internally we treat them as guidelines, always weighing other trade-offs within the organization. Each practice is followed by a few bullet-points underscoring the main takeaways.
When we first started out, we had one massive repo. Every module was filled with tightly coupled dependencies and was completely unversioned. Changing a single API required changing code globally. Developing with more than a handful of people would’ve been a nightmare.
So one of our first changes as the engineering team grew was splitting out the modules into separate repos (thanks TJ!). It was a massive task but it had huge payoff by making development with a larger team actually sane. Unfortunately, it was way harder than it should have been because we lumped everything together at the start.
It turns out this temptation to combine happens everywhere: in services, libraries, repos and tools. It’s so easy to add (just) one more feature to an existing codebase. But it has a long-term cost. Separation of concerns is the exact reason why UNIX-style systems are so successful; they give you the tools to compose many small building blocks into more complex programs.
structure code so that it’s easy to be split (or split from the beginning)
if a service or library doesn’t share concerns with existing ones, create a new one rather than shoe-horning it into an existing piece of code
testing and documenting libraries which perform a single function is much easier to understand
keep uptime, resource consumption and monitoring in mind when combining read/write concerns of a service
prefer libraries to frameworks, composing them together where possible
“Clever” code usually means “complicated” code. It’s hard to search for, and tough to track down where bugs are happening. We prefer simple code that’s explicit in it’s purpose rather than trying to create a magical API that relies on convention (go’s lack of “magic” is actually one of our favorite things about it).
As part of being explicit, always consider the “grep-ability” of your code. Imagine that you’re trying to find out where the implementation for the
post method lives, which is easier to find in a codebase?
Where possible, write code that is short, straightforward and easy to understand. Often that will come down to single functions that are easy to test and easy to document. Even libraries can perform just a single function and then be combined for more powerful functionality.
With comments, describe the “why” versus the typical “what” for a given process or routine. If a routine seems out of place but is necessary, it’s sometimes worth leaving a quick note as to why it exists at all.
avoid generating code dynamically or being overly ‘clever’ to shorten the line count
aim for functions that are <7 lines and <2 nested callbacks
Running code in production without metrics or alerting is flying blind. This has bitten our team more times than I’d care to admit, so we’ve increased our test coverage and monitoring extensively. Every time a user encounters a bug before we do, it damages their trust in us as a company. And that sucks.
Trust in our product is perhaps most valuable asset we have as a company. Losing that is almost completely irrecoverable; it’s the way we lose as a business. Our brand is built around data, and reliability is paramount to our success.
write test cases first to check for the broken behavior, then write the fix
all top-level apps should ship with metrics and monitoring
create ‘warning’ alerts for when an internal system is acting up, ‘critical’ ones when it starts affecting end customers
try to keep unrealistic failure scenarios in mind when designing the alerts
When building a product, there are three aspects you can optimize: Speed, quality, and scope. The catch… is that you can’t ever juggle all three simultaneously. Sacrificing quality by adding hacky fixes increases the amount of technical debt. It slows us down over the long-term, and we risk losing customer trust in the product. Not to mention, hacks are a giant pain to work on later.
At the same time, we can’t sacrifice speed either–that’s our main advantage as a startup. Long-running projects tend to drag on, use up a ton of resources and have no clearly defined “end.” By the time a monolithic project is finally ready to launch, releasing the finished product to customers becomes a daunting process.
When push comes to shove, it’s usually best to cut scope. It allows us to split shipments into smaller, more manageable chunks, and really focus on making each one great.
evaluate features for their benefit versus their effort
identify features that could be easily layered in later
cut features that create obvious technical debt
Separate code paths almost always become out of sync. One will get updated while another doesn’t, which makes for inconsistent behavior. At the architecture level, we want to try and optimize for a single code path.
Note that this is still consistent with splitting things apart, it just means that we need smaller pieces which are flexible enough to be combined together in different ways. If two pieces of code rely on the same functionality, they should use the same code path.
have a peer review your code; an objective opinion will almost always help
get someone else to sign-off on non-trivial pull-requests
if you ever find yourself copy-pasting code, consider pulling it into a library
if you need to frequently update a library, or keep state around, turn it into a service
Creating a loose mockup of a program is often the quickest way to understand the problem you’re solving. When you’re ready to write the real thing just
`rm -fr .git ` to start with a clean slate and better context.
Building something helps you learn more than you could ever hope to uncover through theorizing. Trust me, prototyping helps discover strange edge-cases and bottlenecks which may require you to rearchitect the solution. This process minimizes the impact of architectural changes.
don’t spend a lot of time with commit messages, keep them short but sensical
refactors typically come from a better understanding of the problem, the best way to get there is by building a version to “throw away”
Early on, it’s easy to write off automation as unimportant. But if you’ve done any time-consuming task more than 3 times you’ll probably want to automate it.
A key example of where we failed at this in the past was with Redshift’s cluster management. Investing in the tooling around provisioning clusters was a big endeavor, but it would have saved a ton of time if we’d started it sooner.
if you find yourself repeatedly spending more than a few minutes on a task, take a step back and consider tooling around it
ask yourself if you could be 20% more efficient, or if automation would help
share tools in dotfiles, vm, or task runner so the whole team can use them
Whenever you’re building out a new project or library, it’s worth considering which pieces can be pulled out and open sourced. At face value, it sounds like an extra constraint that doesn’t help ship product. But in practice, it actually creates much cleaner code. We’re guaranteed that the code’s API isn’t tightly coupled to anything we’re building internally, and that it’s more easily re-used across projects.
Open sourced code typically has a well-documented Readme, tests, CI, and more closely resembles the rest of the ecosystem. It’s a good sanity check that we’re not doing anything too weird internally, and the code is easier to forget about and re-visit 6-months later.
if you build a general purpose library without any dependencies, it’s a prime target for open sourcing
try and de-couple code so that it can be used standalone with a clear interface
never include custom configuration in a library, allow it to be passed in with sane defaults
Sometimes big problems arise in code and it may seem easier to write a work-around. Don’t do that. Hacking around the outskirts of a problem is only going to create a rat’s nest that will become an even bigger problem in the future. Tackle the root cause head-on.
A textbook example of this came from the first version of our integrations product. We proxied and transformed analytics calls through our servers to 30–40 different services, depending on what integrations the customer had enabled. On the backend, we had a single pool of integration workers that would read each incoming event from the queue, look up which settings were enabled, and then send copies of the event each enabled integration.
It worked great for the first year, but over time we started running into more and more problems. Because the workers were all shared, a single slow endpoint would grind the entire pool of workers to a halt. We kept adjusting and tweaking individual timeouts to no end, but the backlogs kept occurring. Since then, we’ve fixed the underlying issue by partitioning the data processing queues by endpoint so they operate completely independently. It was a large project, but one that had immediate pay-off, allowing us to scale our integrations platform.
Sometimes it’s worth taking a step back to solve the root cause or upstream problem rather than hacking around the periphery. Even if it requires a more significant restructuring, it can save you a lot of time and headache down the road, allowing you to achieve much greater scale.
whenever fixing a bug or infrastructure issue, ask yourself whether it’s a core fix or just a band-aid over one of the symptoms
keep tabs on where you’re spending the most time, if code is continually being tweaked, it probably needs a bigger overhaul
if there’s some bug or alert we didn’t catch, make sure the upstream cause is being monitored
When designing applications, coming up with a data model is one of the trickiest parts of implementation. The frontend, naturally, wants to match the user’s idea of how the data is formatted. Out of necessity, the backend has to match the actual data format. It must be stored in a way that is fast, performant and flexible.
So when starting with a new design, it’s best to first look at the user requirements and ask “which goals do we want to meet?” Then, look at the data we already have (or decide what new data you need) and figure out how it should be combined.
The frontend models should match the user’s idea of the data. We don’t want to have to change the data model every time we change the UI. It should remain the same, regardless of how the interface changes.
The service and backend models should allow for a flexible API from the programmer’s perspective, in a way that’s fast and efficient. It should be easy to combine individual services to build bigger pieces of functionality.
The controllers are the translation layer, tying together individual services into a format which makes sense to the frontend code. If there’s a piece of complicated logic which makes sense to be re-used, then it should be split into it’s own service.
the frontend models should match the user’s conception of the data
the services need to map to a data model that is performant and flexible
controllers can map between services and the frontend to assemble data
It’s easy to talk at length about best practices but actually following them requires discipline. Sometimes it’s tempting to cut corners or skip a step; but that doesn’t help long-term.
Now that we’ve codified these engineering best practices and the rationale behind each one, they have made their way into our default mode of operation. The act of explicitly writing them down has both clarified our thinking and helped us avoid making the same short-term mistakes over and over.
In practice, this means that we invest heavily in good tooling, modular libraries and microservices. In development, we keep a shared VM that auto-updates, with shared dotfiles for easily navigating our many small repositories. We put a focus on creating projects which increase functionality through composability rather than inheritance. And we’ve worked hard to streamline our process for running services in production.
All of this keeps our development team moving quickly and increases the quality of the product we ship. We’re able to accomplish a lot more with a lot less effort. And we’ll continue trying to improve and share that tooling with the community as it matures.
Andy Jiang, Vince Prignano on November 17th 2015
Growing a business is hard and growing the engineering team to support that is arguably harder, but doing both of those without a stable infrastructure is basically impossible. Particularly for high growth businesses, where every engineer must be empowered to write, test, and ship code with a high degree of autonomy.
Over the past year, we’ve added ~60 new integrations (to over 160), built a platform for partners to write their own integrations, released a Redshift integration, and have a few big product announcements on the way. And in that time, we’ve had many growing pains around managing multiple environments, deploying code, and general development workflows. Since our engineers are happiest and most productive when their time is spent shipping product, building tooling, and scaling services, it’s paramount that the development workflow and its supporting infrastructure are simple to use and flexible.
And that’s why we’ve automated many facets of our infrastructure. We’ll share our current setup in greater detail below, covering these main areas:
Let’s dive in!
As the code complexity and the engineering team grow, it can become harder to keep dev environments consistent across all engineers.
Before our current solution, one big problem our engineering team faced was keeping all dev environments in sync. We had a GitHub repo with a set of shell scripts that all new engineers executed to install the necessary tools and authentication tokens onto their local machines. These scripts would also setup Vagrant and a VM.
But this VM was built locally on each computer. If you modified the state of your VM, then in order to get it back to the same VM as the other engineers, you’d have to build everything again from scratch. And when one engineer updates the VM, you have to tell everyone on Slack to pull changes from our GitHub VM repo and rebuild. An awfully painful process, since Vagrant can be slow.
Not a great solution for a growing team that is trying to move fast.
When we first played with Docker, we liked the ability to run code in a reproducible and isolated environment. We wanted to reuse these Docker principles and experience in maintaining consistent dev environments across a growing engineering team.
We wrote a bunch of tools to set up the VM for new engineers to upgrade or to reset from the basic image state. When our engineers set up the VM for the first time, it asks for their GitHub credentials and AWS tokens, then pulls and builds from the latest image in Docker Hub.
On each run, we make sure that the VM is up-to-date by querying the Docker Hub API. This process updates packages, tools, etc. that our engineers use everyday. It takes around 5 seconds and is needed in order to make sure that everything is running correctly for the user.
Additionally, since our engineers use Macs, we switched from boot2dockervirtualbox machine to a Vagrant hosted boot2docker instance so that we could take advantage of NFS to share the volumes between the host and guest. Using NFS provides massive performance gains during local development. Lastly, NFS allows any changes our engineers make outside of the VM to be instantaneously reflected within the VM.
With this solution we have vastly reduced the number of dependencies needed to be installed on the host machine. The only things needed now are Docker, Docker Compose, Go, and a
The ideal situation is dev and prod environments running the same code, yet separated so code running on dev may never affect code running production.
Before we had the AWS state (generated by Terraform) stored alongside the Terraform files, but this wasn’t a perfect system. For example if two people asynchronously plan and apply different changes, the state will be modified and who pushes last is going to have hard times to figure out the merge collisions.
We achieved mirroring staging and production in the simplest way possible: copying files from one folder to another. Terraform enabled us to reduce the amount of hours taken to modify the infrastructure, deploy new services and making improvements.
We integrated Terraform with CircleCI writing a custom build process and ensuring that the right amount of security was taken in consideration before applying.
At the moment, we have one single repository hosted on GitHub named
infrastructure, which contains a collection of Terraform scripts that configure environmental variables and settings for each of our containers.
When we want to change something in our infrastructure, we make the necessary changes to the Terraform scripts and run them before opening a new pull request for someone else on the infra-team to review it. Once the pull request gets merged to master, CircleCI will start the deployment process: the state gets pulled, modified locally, and stored again on S3.
When developing locally, it’s important to populate local data stores with dummy data, so our app looks more realistic. As such, seeding databases is a common part of setting up the dev environment.
We rely on CircleCI, Docker, and volume containers to provide easy access to dummy data. Volume containers are portable images of static data. We decided to use volume containers because the data model and logic becomes less coupled and easier to maintain. Also just in case this data is useful in other places in our infrastructure (testing, etc., who knows).
Loading seed data into our local dev environment occurs automatically when we start the app server in development. For example, when the
app (our main application) container is started in a dev environment,
app‘s docker-compose.yml script will pull the latest
seed image from Docker Hub and mount the raw data in the VM.
seed image from Docker Hub is created from a GitHub repo
seed, that is just a collection of JSON files as the raw objects we import into our databases. To update the seed data, we have CircleCI setup on the repo so that any publishes to master will build (grabbing our mongodb and redis containers from Docker Hub) and publish a new
seed image to Docker Hub, which we can use in the app.
Due to the data-heavy nature of Segment, our app already relies on several microservices (db service, redis, nsq, etc). In order for our engineers to work on the app, we need to have an easy way to create these services locally.
Again, Docker makes this workflow extremely easy.
Similar to how we use
seed volume containers to mount data into the VM locally, we do the same with microservices. We use the docker compose file to grab images from Docker Hub to create locally, set addresses and aliases, and ultimately reduce the complexity to a single terminal command to get everything up and running.
If you write code, but never ship it to production, did it ever really happen? 😃
Deploying code to production is an integral part of the development workflow. At Segment, we prioritize easiness and flexibility around shipping code to production, since that encourages our engineers to move quickly and be productive. We’ve also created adequate tooling around safeguarding for errors, rolling back, and monitoring build statuses.
We use Docker, ECS, CircleCI, and Terraform to automate as much of the continuous deployment process as possible.
Whenever code is pushed or merged into its
master branch, the CircleCI script build the container and push it to Docker Hub.
With this setup, we can define the configuration once for any service, making it extremely easy for our engineers to create and deploy new microservices. As Calvin mentioned in a previous post, “Rebuilding Our Infrastructure with Docker, ECS, and Terraform”:
We no longer have a complex set of provisioning scripts or AMIs—we just hand the production cluster an image, and it runs. There’s no more stateful instances, and we’re guaranteed to run the same exact code on both staging and prod.
The automation and ease of use around deployment have positively impacted more than just our engineers. Our success and marketing teams can update markdown files in a handful of repos that, when merged to master, kick off an auto deploy process so that changes can be live in minutes.
Because we chose to invest effort into rethinking and automating our dev workflow and its supporting infrastructure, our engineering team move fasters and more confidently. We spend more time doing high leverage jobs that we love—shipping product and building internal tools—and less time yak shaving.
That said, this is by no means the final iteration of our infrastructure automation. We are constantly playing with new tools and testing new ideas, seeing what further efficiencies we can eek out.
This has been a tremendous learning process for us and we’d love to hear what others in the community have done with their dev workflows. If you end up implementing something like this (or have already), let us know! We’d love to hear what you’ve done, and what’s worked or hasn’t for others with similar problems.
Andy Jiang on October 20th 2015
A little while ago we open-sourced a static site generator called Metalsmith. We built Metalsmith to be flexible enough that it could build blogs (like the one you’re reading now), knowledge bases, and most importantly our technical documentation.
Using Metalsmith to build a simple blog is one thing, but building easy-to-maintain documentation isn’t as simple. There are different sections, libraries, various reusable code snippets and content that live in multiple places, and other stuff to consider. Metalsmith simplifies maintaining all of these moving parts and let’s us focus purely on creating helpful content. Here’s how we did it!
The first thing to know is that Metalsmith works by running a series of transformations on a directory of files. In the case of our docs, that directory is just a bunch of folders with Markdown files in them.
The directory structure mimics the URL structure:
And an individual Markdown file might look like this:
It’s structured this way because it makes the actual content easy to maintain. They are just regular folders with plain old Markdown files in them. That means that anyone on the team can easily edit the docs without any technical knowledge. You can even do it right form the GitHub editor:
So far so good. But how do just those simple Markdown files get transformed into our entire technical documentation? That’s where Metalsmith comes in. Like I mentioned earlier, Metalsmith is really just a series of transformations to run on the directory of Markdown files.
I’ll walk you through each of the transformations we use in order.
The first transformation we use is a custom plugin that takes all the files in a directory and exposes them as partials in Handlebars. That means we can keep content that we repeat a lot in a single place for easier maintenance.
The next transformation is a plugin that groups files together into “collections”. In our case, those collections are built into our sub-directories, so we have collections like: Libraries, Plugins, Tutorials, etc.
We also pass our own custom
sorter function that will return the order specified in the array and append the remaining files pseudo-alphabetized.
Having all of the collections grouped as simple arrays makes it easy for us to do things like automatically generate a top-level navigation to get to every collection:
Or to automatically generate a collection-level navigation for navigating between pages:
The plugin categorizes all files that fit the provided definition (in our case, providing file path patterns), adds a
collection array to each file that contains the name of the collection, and finally adds a
previous properties to files that points to the sibling file in the collection. This allows us to easily render collections later on with handlebars:
The key is that all of those pieces are automatically generated, so you never need to worry about remembering to link between pages.
Note that this plugin does not determine the final directory structure. By default, the directory structure is preserved from start to end, unless a plugin specifically modifies this.
The third transformation step is to template all of our Markdown files in placewith metalsmith-in-place. By that, I mean that we just run Handlebars over our Markdown files right where they are, so that we can take advantage of a bunch of helpers we’ve added.
For example, in any of our Markdown files we can use an
api-example helper like so:
Which, will render a language-agnostic code snippet that remembers the user’s language preference:
You can find the above code snippet here.
Then, we transform all of those Markdown files into HTML files with the metalsmith-markdown plugin. Pretty self-explanatory!
Now that we have all of our files as
.html instead of
.md, the next transformation is pretty simple using metalsmith-headings. It iterates over all of the files once more, extracting the text of all the
<h2> tags and adding that array as metadata of the file. So you might end up with a file object that looks like this:
Why would we want to do that? Because it means we can build the navigation in the sidebar automatically from the content of the file itself:
So you never need to worry about remembering to update the navigation yourself.
The next step is to use the permalinks plugin to transform files so that all of the content lives in
index.html files, so they can be served statically. For example, given a source directory like this:
The permalinks plugin would transform that into:
So that NGINX can serve those static files as:
The last step is to template all of our source files again (they’re not
.md anymore, they’re all
.html at this point) by rendering them into our top-level layout.
layout.html file is where all of the navigation rendering logic is contained, and we just dump the contents of each of the pages that started as Markdown into the global template, like so:
Once that’s done, we’re done! All of those files that started their life as simple Markdown have been run through a bunch of transformations. They now live as a bunch of static HTML files that each have automatically-generated navigations and sidebars (with active states too).
The last step is to deploy our documentation. This step isn’t to be forgotten, because our goal was to make our docs so simple to edit that everyone on the team can apply fixes as customers report problems.
To make our team as efficient as possible about shipping fixes and updates to our docs, we have our repo setup so that any branch merged to
master will kick off CircleCI to build and publish to production. Anyone can then make edits in a separate branch, submit a PR, then merge to
master, which will then automatically deploy the changes.
For the vast majority of text-only updates, this is perfect. Though, occasionally we may need more complex things.
For more information on the tech we use for our backend, check out Rebuilding Our Infrastructure with Docker, ECS, and Terraform.
Before we converted our docs to Metalsmith, they lived in a bunch of Jade files that were a pain in the butt to change. Because we had little incentive to edit them, we let typos run rampant and waited too long to fix misinformation. Obviously this was a bad situation for our customers.
Now that our docs are easy to edit in Markdown and quick to deploy, we fix problems much faster. The quickest way to fix docs issues is to make a permanent change, rather than repeat ourselves in ticket after ticket. With a simpler process, we’re able to serve our customers much better, and we hope you can too!
Calvin French-Owen on October 7th 2015
In Segment’s early days, our infrastructure was pretty hacked together. We provisioned instances through the AWS UI, had a graveyard of unused AMIs, and configuration was implemented three different ways.
As the business started taking off, we grew the size of the eng team and the complexity of our architecture. But working with production was still limited to a handful of folks who knew the arcane gotchas. We’d been improving the process incrementally, but we needed to give our infrastructure a deeper overhaul to keep moving quickly.
So a few months ago, we sat down and asked ourselves: “What would an infrastructure setup look like if we designed it today?”
Over the course of 10 weeks, we completely re-worked our infrastructure. We retired nearly every single instance and old config, moved our services to run in Docker containers, and switched over to use fresh AWS accounts.
We spent a lot of time thinking about how we could make a production setup that’s auditable, simple, and easy to use–while still allowing for the flexibility to scale and grow.
Here’s our solution.
Instead of using regions or tags to separate different staging and prod instances, we switched over totally separate AWS accounts. We need to ensure that our provisioning scripts wouldn’t affect our currently running services, and using fresh accounts meant that we had a blank slate to start with.
ops account serves as the jump point and centralized login. Everyone in the organization can have a IAM account for it.
The other environments have a set of IAM roles to switch between them. It means there’s only ever one login point for our admin accounts, and a single place to restrict access.
As an example, Alice might have access to all three environments, but Bob can only access dev (ever since he deleted the production load balancer). But they both enter through the
Instead of having complex IAM settings to restrict access, we can easily lock down users by environment and group them by role. Using each account from the interface is as simple as switching the currently active role.
Instead of worrying that a staging box might be unsecured or alter a production database, we get true isolation for free. No extra configuration required.
There’s the additional benefit of being able to share configuration code so that our staging environment actually mirrors prod. The only difference in configuration are the sizes of the instances and the number of containers.
Finally, we’ve also enabled consolidated billing across the accounts. We pay our monthly bill with the same invoicing and see a detailed breakdown of the costs split by environment.
As of today, we’re now running the majority of our services inside Docker containers, including our API and data pipeline. The containers receive thousands of requests per second and process 50 billion events every month.
The biggest single benefit of Docker is the extent that it’s empowered the team to build services from scratch. We no longer have a complex set of provisioning scripts or AMIs—we just hand the production cluster an image, and it runs. There’s no more stateful instances, and we’re guaranteed to run the same exact code on both staging and prod.
After configuring our services to run in containers, we chose ECS as the scheduler.
At a high level, ECS is responsible for actually running our containers in production. It takes care of scheduling services, placing them on separate hosts, and zero-downtime reloads when attached to an ELB. It can even schedule across AZs for better availability. If a container dies, ECS will make sure it’s re-scheduled on a new instance within that cluster.
The switch to ECS has vastly simplified running a service without needing to worry about upstart jobs or provisioning instances. It’s as easy as adding a Dockerfile, setting up the task definition, and associating it with a cluster.
In our setup, the Docker images are built by CI, and then pushed to Docker Hub. When a service boots up, it pulls the image from Docker Hub, and then ECS schedules it across machines.
We group our service clusters by their concern and load profile (e.g. different clusters for API, CDN, App, etc). Having separate clusters means that we get better visibility and can decide to use different instance types for each (since ECS has no concept of instance affinity).
Each service has a particular task definition indicating which version of the container to run, how many instances to run on, and which cluster to choose.
During operation, the service registers itself with an ELB and uses a healthcheck to confirm that the container is actually ready to go. We point a local Route53 entry at the ELB, so that services can talk to each other and simply reference via DNS.
The setup is nice because we don’t need any service discovery. The local DNS does all the bookkeeping for us.
ECS runs all the services and we get free cloudwatch metrics from the ELBs. It’s been a lot simpler than having to register services with a centralized authority at boot-time. And the best part is that we don’t have to deal with state conflicts ourselves.
Where Docker and ECS describe how to run each of our services, Terraform is the glue that holds them together. At a high level, it’s a set of provisioning scripts that create and update our infrastructure. You can think of it like a version of Cloudformation on steroids–but it doesn’t make you want to poke your eyes out.
Rather than running a set of servers for maintaining state, there’s just a set of scripts that describe the cluster. Configuration is run locally (and in the future, via CI) and committed to git, so we have a continuous record of what our production infrastructure actually looks like.
Here’s an sample of our Terraform module for setting up our bastion nodes. It creates all the security groups, instances, and AMIs, so that we’re able to easily set up new jump points for future environments.
We use the same module in both stage and prod to set up our individual bastions. The only thing we need to switch out are the IAM keys, and we’re ready to go.
Making changes is also painless. Instead of always tearing down the entire infrastructure, Terraform will make updates where it can.
When we wanted to change our ELB draining timeout to 60 seconds, it took a simple find/replace followed by a
terraform apply. And voilà, two minutes later we had a fully altered production setup for all of our ELBs.
It’s reproduceable, auditable, and self-documenting. No black boxes here.
We’ve put all the config in a central
infrastructure repo, so it’s easy to discover how a given service is setup.
We haven’t quite reached the holy grail yet though. We’d like to convert more of our Terraform config to take advantage of modules so that individual files can be combined and reduce the amount of shared boilerplate.
Along the way we found a few gotchas around the
.tfstate, since Terraform always first reads from the existing infrastructure and complains if the state gets out of sync. We ended up just committing our
.tfstate to the repo, and pushing it after making any changes, but we’re looking into Atlas or applying via CI to solve that problem.
By this point, we had our infrastructure, our provisioning, and our isolation. The last things left were metrics and monitoring to keep track of everything running in production.
In our new environment, we’ve switched all of our metrics and monitoring over to Datadog, and it’s been fantastic.
We’ve been incredibly happy with Datadog’s UI, API, and complete integration with AWS, but getting the most out of the tool comes from a few key pieces of setup.
The first thing we did was integrate with AWS and Cloudtrail. It gives a 10,000 foot view of what’s going on in each of our environments. Since we’re integrating with ECS, the Datadog feed updates everytime a task definition updates, so we end up getting notifications for deploys for free. Searching the feed is surprisingly snappy, and makes it easy to trace down the last time a service was deployed or rescheduled.
Next, we made sure to add the Datadog-agent as a container to our base AMI (datadog/docker-dd-agent). It not only gathers metrics from the host (CPU, Memory, etc) but also acts as a sink for our statsd metrics. Each of our services collects custom metrics on queries, latencies, and errors so that we can explore and alert on the in Datadog. Our go toolkit (soon to be open sourced) automatically collects the output of
pprof on a ticker and sends it as well, so we can monitor memory and goroutines.
What’s even cooler is that the agent can visualize instance utilization across hosts in the environment, so we can get a high level overview of instances or clusters which might be having issues:
Additionally, my teammate Vince created a Terraform provider for Datadog, so we can completely script our alerting against the actual production configuration. Our alerts will be recorded and stay in sync with what’s running in prod.
By convention, we specify two alert levels:
warning is there to let anyone currently online know that something looks suspicious and should be triggered well in advance of any potential problems. The
criticalalerts are reserved for ‘wake-you-up-in-the-middle-of-the-night’ problems where there’s a serious system failure.
What’s more, once we transition to Terraform modules and add the Datadog provider to our service description, then all services end up getting alerts for free. The data will be powered directly by our internal toolkit and Cloudwatch metrics.
Once we had all these pieces in place, the day had finally come to make the switch.
We first set up a VPC peering connection between our new production environment and our legacy one–allowing us to cluster databases and replicate across the two.
Next, we pre-warmed the ELBs in the new environment to make sure that they could handle the load. Amazon won’t provision automatically sized ELBs, so we had to ask them to ramp it ahead of time (or slowly scale it oursleves) to deal with the increased load.
From there, it was just a matter of steadily ramping up traffic from our old environment to our new one using weighted Route53 routes, and continuously monitoring that everything looked good.
Today, our API is humming along, handling thousands of requests per second and running entirely inside Docker containers.
But we’re not done yet. We’re still fine-tuning our service creation, and reducing the boilerplate so that anyone on the team can easily build services with proper monitoring and alerting. And we’d like to improve our tooling around working with containers, since services are no longer tied to instances.
We also plan to keep an eye on promising tech for this space. The Convox team is building awesome tooling around AWS infrastructure. Kubernetes, Mesosphere, Nomad, and Fleet seemed like incredibly cool schedulers, though we liked the simplicity and integration of ECS. It’s going to be exciting to see how they all shake out, and we’ll keep following them to see what we can adopt.
After all of these orchestration changes, we believe more strongly than ever in outsourcing our infrastructure to AWS. They’ve changed the game by productizing a lot of core services, while maintaining an incredibly competitive price point. It’s creating a new breed of startups that can build products efficiently and cheaply while spending less time on maintenance. And we’re bullish on the tools that will be built atop their ecosystem.