Calvin French-Owen on May 16th 2016
Since Segment’s first launch in 2012, we’ve used queues everywhere. Our API queues messages immediately. Our workers communicate by consuming from one queue and then publishing to another. It’s given us a ton of leeway when it comes to dealing with sudden batches of events or ensuring fault tolerance between services.
We first started out with RabbitMQ, and a single Rabbit instance handled all of our pubsub. Rabbit had lot of nice tooling around message delivery, but it turned out to be really tough to cluster and scale as we grew. It didn’t help that our client library was a bit of a mess, and frequently dropped messages (anytime you have to do a seven-way handshake for a protocol that’s not TLS… maybe re-think the tech you’re using).
So in January 2014, we started the search for a new queue. We evaluated a few different systems: Darner, Redis, Kestrel, and Kafka (more on that later). Each queue had different delivery guarantees, but none of them seemed both scalable and operationally simple. And that’s when NSQ entered the picture… and it’s worked like a charm.
As of today, we’ve pushed 750 billion messages through NSQ, and we’re adding around 150,000 more every second.
Before discussing how NSQ works in practice, it’s worth understanding how the queue is architected. The design is so simple, it can be understood with only a few core concepts:
topics - a topic is the logical key where a program publishes messages. Topics are created when programs first publish to them.
channels - channels group related consumers and load balance between them–channels are the “queues” in a sense. Every time a publisher sends a message to a topic, that message is copied into each channel that consumes from it. Consumers will read messages from a particular channel and actually create the channel on the first subscription. Channels will queue messages (first in memory, and spill over to disk) if no consumers are reading from them.
messages - messages form the backbone of our data flow. Consumers can choose to finish messages, indicating they were processed normally, or requeue them to be delivered later. Each message contains a count for the number of delivery attempts. Clients should discard messages which pass a certain threshold of deliveries or handle them out of band.
NSQ also runs two programs during operation:
nsqd - the nsqd daemon is the core part of NSQ. It’s a standalone binary that listens for incoming messages on a single port. Each nsqd node operates independently and doesn’t share any state. When a node boots up, it registers with a set of nsqlookupd nodes and broadcasts which topics and channels are stored on the node.
Clients can publish or read from the nsqd daemon. Typically publishers will publish to a single, local nsqd. Consumers read remotely from the connected set of nsqd nodes with that topic. If you don’t care about adding more nodes dynamically, you can run nsqds standalone.
nsqlookupd – the nsqlookupd servers work like consul or etcd, only without coordination or strong consistency (by design). Each one acts as an ephemeral datastore that individual nsqd nodes register to. Consumers connect to these nodes to determine which nsqd nodes to read from.
Let’s walk through a more concrete example of how this works in practice.
NSQ recommends co-locating publishers with their corresponding nsqd instances. That means even in the face of a network partition, messages are stored locally until they are read by a consumer. What’s more, publishers don’t need to discover other nsqd nodes–they can always publish to the local instance.
First, a publisher sends a message to its local nsqd. To do this, it first opens up a connection, and then sends a PUB command with the topic and message body. In this case, we publish our messages to the events topic to be fanned out to our different workers.
The events topic will copy the message and queue it in each of the channels linked to the topic. In our case, there are three channels, one of them being the archives channel. Consumers will take these messages and upload them to S3.
Messages in each of the channels will queue until a worker consumes them. If the queue goes over the in-memory limit, messages will be written to disk.
The nsqd nodes will first broadcast their location to the nsqlookupds. Once they are registered, workers will discover all the nsqd nodes with the events topic from the lookup servers.
Then each worker subscribes to each nsqd host, indicating that it’s ready to receive messages. We don’t need a fully connected graph here, but we do need to ensure that individual nsqd instances have enough consumers to drain their messages (or else the channels will queue up).
Separate from the client library, here’s an example of what our message handling code might look like:
If the third party fails for any reason, we can handle the failure. In this snippet, we have three behaviors:
discard the message if it’s passed a certain number of delivery attempts
finish the message if it’s been processed successfully
requeue the message to be delivered later if an error occurs
As you can see, the queue behavior is both simple and explicit.
In our example, it’s easy to reason that we’ll tolerate
MAX_DELIVERY_ATTEMPTS * BACKOFF_TIME minutes of failure from our integration before discarding messages.
At Segment, we keep statsd counters for message attempts, discards, requeues, and finishes to ensure that we’re achieving a good QoS. We’ll alert for services any time the number of discards exceeds the thresholds we’ve set.
In production, we run nsqd daemons on pretty much all of our instances, co-located with the publishers that write to them. There are a few reasons NSQ works so well in practice:
Simple protocol - this isn’t a huge issue if you already have a good client library for your queue. But, it can make a massive difference if your existing client libraries are buggy or outdated.
NSQ has a fast binary protocol that is easy to implement with just a few days of work. We built our own pure-JS node driver, (at the time, only a coffeescript driver existed) which has been stable and reliable.
Easy to run - NSQ doesn’t have complicated watermark settings or JVM-level configuration. Instead, you can configure the number of in-memory messages and the max message size. If a queue fills up past this point, the messages will spill over to disk.
Distributed - because NSQ doesn’t share information between individual daemons, it’s built for distributed operation from the beginning. Individual machines can go up and down without affecting the rest of the system. Publishers can publish locally, even in the face of network partitions.
This ‘distributed-first’ design means that NSQ can essentially scale forever. Need more throughput? Add more nsqds!
The only shared state is kept in the lookup nodes, and even those don’t require a “global view of the universe.” It’s trivial to set up configurations where certain nsqds register with certain lookups. The only key is that the consumers can query to get the complete set.
Clear failure cases - NSQ sets up a clear set of trade-offs around components which might fail, and what that means for both deliverability and recovery.
I’m a firm believer in the principle of least surprise, particularly when it comes to distributed systems. Systems fail, we get it. But systems that fail in unexpected ways are impossible to build upon. You end up ignoring most failure cases because you can’t even begin to account for them.
While they might not be as strict of guarantees as a system like Kafka can provide, the simplicity of operating NSQ makes the failure conditions exceedingly apparent.
UNIX-y tooling - NSQ is a good general purpose tool. So it’s not surprising that the included utilities are multi-purpose and composable.
In addition to the TCP protocol, NSQ provides an HTTP interface for simple cURL-able maintenance operations. It ships with binaries for piping from the CLI, tailing a queue, piping from one queue to another, and HTTP pubsub.
There’s even an admin dashboard for monitoring and pausing queues (including the sweet animated counter above!) that wraps the HTTP API.
As I mentioned, that simplicity doesn’t come without trade-offs:
No replication - unlike other queues, NSQ doesn’t provide any sort of replication or clustering. This is part of what makes running it so simple, but it does force some hard guarantees on reliability for published messages.
We partially get around this by lowering the file sync time (configurable via a flag) and backing our queues with EBS. But there’s still the possibility that a queue could indicate it’s published a message and then die immediately, effectively losing the write.
Basic message routing - with NSQ, topics and channels are all you get. There’s no concept of routing or affinity based upon on key. It’s something that we’d love to support for various use cases, whether it’s to filter individual messages, or route certain ones conditionally. Instead we end up building routing workers, which sit in-between queues and act as a smarter pass-through filter.
No strict ordering - though Kafka is structured as an ordered log, NSQ is not. Messages could come through at any time, in any order. For our use case, this is generally okay as all the data is timestamped, but doesn’t fit cases which require strict ordering.
No de-duplication - Aphyr has talked extensively in his posts about the dangers of timeout-based systems. NSQ also falls into this trap by using a heartbeat mechanism to detect whether consumers are alive or dead. We’ve previously written about various reasons that would cause our workers to fail heartbeat checks, so there has to be a separate step in the workers to ensure idempotency.
As you can see, the underlying motif behind all of these benefits is simplicity. NSQ is a simple queue, which means that it’s easy to reason about and easy to spot bugs. Consumers can handle their own failure cases with confidence about how the rest of the system will behave.
In fact, simplicity was one of the primary reasons we decided to adopt NSQ in the first place (along with many of our other software choices). The payoff has been a queueing layer that has performed flawlessly even as we’ve increased throughput by several orders of magnitude.
Today, we face a more complex future. More and more of our workers require a stricter set of reliability and ordering guarantees than NSQ can easily provide.
We plan to start swapping out NSQ for Kafka in those pieces of the infrastructure and get much better at running the JVM in production. There’s a definite trade-off with Kafka; we’ll have to shoulder a lot more operational complexity ourselves. On the other hand, having a replicated, ordered log would provide much better behavior for many of our services.
But for any workers which fit NSQ’s trade-offs, it’s served us amazingly well. We’re looking forward to continuing to build on its rock-solid foundation.
TJ Holowaychuk for creating nsq.js and helping drive its adoption at Segment.
Jehiah Czebotar and Matt Reiferson for building NSQ and reading drafts of this article.
Julian Gruber, Garrett Johnson, and Amir Abu Shareb for maintaining nsq.js.
Tejas Manohar, Vince Prignano, Steven Miller, Tido Carriero, Andy Jiang, Achille Roussel, Peter Reinhardt, Stephen Mathieson, Brent Summers, Nathan Houle, Garrett Johnson, and Amir Abu Shareb for giving feedback on this article.
Calvin French-Owen on March 15th 2016
I recently jumped back into frontend development for the first time in months, and I was immediately struck by one thing: everything had changed.
When I was more active in the frontend community, the changes seemed minor. We’d occasionally make switches in packaging (RequireJS → Browserify), or frameworks (Backbone → Components). And sometimes we’d take advantage of new Node/v8 features. But for the most part, the updates were all incremental.
Years ago, a friend and I discussed choosing the ‘right’ module system for his company. At the time, he was leaning towards going with RequireJS–and I urged him to look at Browserify or Component (having just abandoned RequireJS ourselves).
We talked again last night, and he said that he’d chosen RequireJS. By now, his company had built a massive codebase around it –- “I guess we bet on the wrong horse there.”
Most of the tools we use today didn’t even really exist a year ago: React, JSX, Flux, Redux, ES6, Babel, etc. Even setting up a ‘modern’ project requires installing a swath of dependencies and build tools that are all new. No other language does anything remotely resembling that kind of thing. It’s enough to even warrant a “State of the Art” post so everyone knows what to use.
As it turns out, there are a ton of interesting dynamics at play: corporate self-interest, cries for open standards, Mozilla pushing language development, lagging IE releases that dominate the market, and tools that pave over a lot of the fragmentation.
ECMAScript has a weird and fascinating history of its own, largely broken up into different ‘major’ editions.
apply, and various
ES2 was released one year later in June 1998, but the changes were purely cosmetic. They had been added to comply with a corresponding ISO version of the spec without changing the specification for the language itself, so it’s typically omitted from compatibility tables.
ES3, released in December 1999, marked the first major additions to the language. It added support for regexes,
.bind, and try/catch error handling. ES3 became the “standard” for what the majority of browsers would support, underscored by the popularity of older IE browsers. Even in February 2012 (13 years after the ES3 release!!), ES3 IE browsers still had over 20% of the browser market.
But just as the spec was nearing completion, trouble struck! On the one hand Microsoft’s IE platform architect Chris Wilson, argued that the changes would effectively “break the web”. He stated that the ES4 changes were so backwards incompatible that they would significantly hurt adoption.
On the other hand, Brendan Eich, now the Mozilla CTO, argued for the changes. In an open letter to Chris Wilson, he objected to the fact that Microsoft was just now withdrawing support for a spec which had been in the works for years.
After meeting in Oslo, Brendan Eich sent a message to es-discuss (still the primary point for language development) outlining a plan for a near-term incremental release, along with a bigger reform to the language spec that would be known as ‘harmony’.. ES5 was finalized and published in December 2009.
With ES5 finally out the door, the committee started moving full-speed ahead on the next set of language extensions which they’d hoped to start incoporating nearly a decade earlier.
By now, browser-makers and committee members alike had seen the danger of adding a giant release without testing the features. So drafts of ES6 have been published frequently since 2011, and available behind flags (
--harmony) in Node and many of the major browsers. Additionally transpilers had become a part of the modern build chain (more on that later), allowing developers to use the bleeding-edge of the spec, without worrying about breaking old browsers. It was eventually released in June 2015.
So we have:
ECMAScript versions driven by the planning committee to standardize the language
With experiments and language additions to each dialect pushing the ES standard ahead.
Of course, we can’t talk just about the full history of JS without mentioning the platform it’s running on: browsers.
But an interesting thing happened when Chrome first emerged onto the scene in December, 2008. While Chrome had numerous features that made it technically interesting (tabs as processes, a new fast js runtime, and others), perhaps the thing that made it most revolutionary was its release schedule.
Chrome shipped from day-1 with auto-updates, first tested from a widespread number of early adopters using the dev and beta channels. If you wanted to use the bleeding edge, you could easily do so. Nowadays, updates to the stable channel happen every six weeks, with thousands of users testing and automatically reporting bugs they encounter on the dev and beta channels.
It meant that Chrome was able to ship updates far more frequently than its competitors. Where the other browsers would tell users to update every 6-12 months, Chrome updated the three channels weekly, monthly, and released updates on the stable channel every six weeks.
Most of the features are still not “safe” to use and hidden behind flags, but the fact that browsers will continue to auto-update means that the velocity of new JS features as increased dramatically in the last few years.
Imagine now that you’re a developer just learning JS, and knowing none of that history. It’d be pretty much impossible to keep track of which features are supported in what browsers, what’s to spec and what’s not, and what parts of the language should even be used.
Realizing this was a problem, John Resig released the first version of JQuery in 2006. It was a drop-in library designed to pave over a lot of the inconsistencies between the different browsers. And for the most part, developers became comfortable with always bundling jQuery into their projects. It became the de facto library for DOM manipulation, AJAX requests, and more.
But working with many different scripts was tedious. They had to be loaded in a certain order, or else bundle duplicated dependencies many times over. Individual libraries would overwrite the global scope, causing conflicts and monkey-patching default implementations.
So James Burke created RequireJS in 2009, spawned out of his work with Dojo. He created a module loader designed to specify dependencies, and load them asynchonously in the browser. It was one of the first frameworks that actually introduced the idea of a “module system” for isolating pieces of code and loading them asynchronously–dubbed AMD (Asynchronous Module Definition).
While AMD was easy to load on the fly, it was also verbose. It essentially added dependency injection for every single library you’d want to load:
The simplicity had the benefit that the browser could load libraries on-the-fly, but it was a lot of overhead for the programmer to really understand.
Since Node ran on the server, making requests for additional scripts was cheap. The scripts were cached, so it was as easy as reading and parsing an additional file.
Instead of using AMD modules, Node popularized the CommonJS format (the competing module format at the time), using synchronous
require statements for loading dependencies. It mirrored the same way that other languages worked, grabbing dependencies synchronously and then caching them. Isaac Schuetler built
npm around it as the de facto way to manage and install dependencies in Node. Soon, everyone was writing node scripts that looked like this:
Yet the frontend still lagged. Developers wanted to
So a new set of tools appeared on the frontend developer’s toolchain. Bower for pure dependency management, Browserify and Component/Duo for actually building scripts and bundling them together. Webpack promised a new build system that also handled CSS, and Grunt and Gulp for actually orchestrating them all and making them play nicely together.
After all, there are hundreds of programming languages that can run on the server, but there’s only one widely-supported language for the browser.
But it didn’t end there.
I didn’t even go into changes in frameworks (Ember, Backbone, Angular, React) or how we structure asynchronous programming (Callbacks, Promises, Iterators, Async/Await). But those have all churned relatively regularly over the years.
The most interesting thing about having a language where everyone is comfortable with build tools, transpilers, and syntax additions is that the language can advance at an astonishing rate. Really the only thing holding back new language features is consensus. As quickly as people agree and implement the spec, there’s code to support it.
I predict we’ll continue to see the cycle of rapid iteration for the next few years at least. Companies will be forced to update their codebase, or be happy with the horse they have.
Calvin French-Owen on December 15th 2015
At Segment, we’ve fully embraced the idea of microservices; but not for the reasons you might think.
The microservices vs. monoliths debate has been pretty thoroughly discussed, so I won’t completely re-hash it here. Microservices proponents say that they provide better scalability and are the best way to split responsibilty across software engineering teams. While the pro-monolith group say that microservices are too operationally complex to begin with.
But a major benefit of running microservices is largely absent from today’s discussions: visibility.
When we’re getting paged at 3am on a Tuesday, it’s a million times easier to see that a given worker is backing up compared to adding tracing through every single function call of a monolithic app.
That’s not to say you can’t get good visiblity from more tightly coupled code, it’s just rarer to have all the right visibility from day one.
Where does that visibility come from? Consider for a moment the standard tools that are part of our ops arsenal:
None of them monitor individual program execution: hot codepaths, stack size, etc. The battle-tested tools we’ve built over the past 20 years are all built around the concepts of hosts, processes, or drives.
With a distributed system, we can add in requests and network throughput to our metrics, but most tools still tend to aggregate at a host or service level.
The process-centric nature of monitoring tools makes it really difficult to get a sense of where a program is actually spending time. With a monolithic app, our best options to debug are either to run the program against a profiler or to implement our own timing metrics.
Now that’s kind of crazy when you think about it. Most of the reason flamegraphs are so useful is that we don’t have that detailed amount of monitoring at the level of individual function calls.
So instead of trying to shoe-horn lots of functionality into monoliths, at Segment we’ve doubled down on microservices. We’re betting that container scheduling and orchestration will continue to get easier and more powerful, while most metrics and monitoring will continue to be dominated by the idea of ‘hosts’ and ‘services’.
The caveat here is that microservices only work so long as it’s actually easy to create new services. Otherwise we’ve just traded a visibility problem for a provisioning problem.
In other posts, we’ve talked a little bit about what our services look like, and how we build them with terraform. And now, we’ve started splitting each service into modules, so we can re-use the exact configuration between stage and prod.
Here’s an example of a simple auth service, using terraform as our configuration to set up all of our resources:
For the curious, you can check out an example of the full module definition.
As long as there’s a singificant benefit (free metrics) and low cost (10-line terraform script), we remove the temptation to tack on different functionality into an existing service.
And so far, that approach has been working quite well.
Segment is a bit unusual–instead of microservices which are coordinating together, we have a lot of what I’d call “microworkers.” Fundamentally, it’s the same concept, but the worker doesn’t serve requests to clients. Instead, the typical Segment worker reads some data from a queue, does some processing on it, and then acks the message.
These workers end up being a lot simpler than services because there are no dependencies. There’s no coupling or worrying that a given problem with one worker will compound and disrupt the rest of a system. If a service is acting up, there’s just a single queue that ends up backing up. And we can scale additional workers to handle the load.
There are a few forces at work which make tiny workers the right call for us. But the biggest comes from our team size and relative complexity of what we’re trying to build.
Microservices are usually touted when the team grows to a size where there are too many people working on the same codebase. At that point, it makes sense to shard ownership of the codebase by team. But we’ve seen it be equally helpful with a small team as well.
Most folks I talk with are surprised at how small our engineering team is. To give you a rough sense of our scale:
400 private repos
70 different services (workers)
We’re in the postion of having a large product scope and a small engineering team. So if I’m currently on-call and get paged, it could be for code that I wrote 6-months ago and haven’t touched since.
And that’s the place where tiny, well-defined, services shine.
Here’s the typical scenario: first there’s an alert which gets triggered because a particular queue depth is backing up.
We can verify this is really the case (and isn’t getting better) by checking the queue depth in our monitoring tools.
At that point we know exactly which worker is backing up (since each worker subscribes to a single queue), and which logs to look at. Each service logs with its own tags, so we don’t have to worry about unrelated logs interleaving within a single app for multiple requests.
We can look at Datadog for a single dashboard containing that worker’s CPU, Memory, and the responses and latency coming from it’s ELB. Once we’ve identified the problem, it’s a question of reading through 50-100 line file to isolate exactly where the problem is happening (let’s play spot the memory leak!).
With a monolith, we could add individual monitoring specifically for each endpoint. But why bother when we get it for free by running code as part of its own process?
Not to mention the fact that we also get isolated CPU, memory, and latency (if the service sits behind an ELB) out of the box. It’s infinitely easier to track down a memory leak in a hundred-line worker with a single codepath than it is in a monolithic app with hundreds of endpoints.
I understand this approach won’t work for everyone. And it requires a pretty significant investment in up-front tooling to make sure that creating a new service from scratch has everything it needs. Depending on your team, workload, and product scope, it might not make sense.
But for any product operating with a high level of operational complexity and load, I’d choose the microservice architecture every time. It’s made our infrastructure flexible, scalable, and far easier to monitor–without sacrificing developer productivity.
Calvin French-Owen on November 20th 2015
Every month, Segment collects, transforms and routes over 50 billion API calls to hundreds of different business-critical applications. We’ve come a long way from the early days, where my co-founders and I were running just a handful of instances.
Today, we have a much deeper understanding of the problems we’re solving, and we’ve learned a ton. To keep moving quickly and avoid past mistakes, our team has started developing a list of engineering best practices.
Now that a lot of these “pro tips” have been tested, deployed and are currently in production… we wanted to share them with you. It’s worth noting that we’re standing on the shoulders of giants here, to The Zen of Python, Hints for Computer System Design, and the Twelve-Factor App for the inspiration.
Editor’s Note: This post was based off an internal wiki page for Segment “Pro Tips”. There are more tips recorded there, but we chose a handful that seemed most broadly applicable. They’re written as fact, but internally we treat them as guidelines, always weighing other trade-offs within the organization. Each practice is followed by a few bullet-points underscoring the main takeaways.
The best practices listed below have helped our engineering team move more efficiently and effectively, minimizing waste and boosting collaboration. Your engineering team might want to tweak these practices based on your goals or priorities, but by following the general principles laid out below you’ll have a good start.
When we first started out, we had one massive repo. Every module was filled with tightly coupled dependencies and was completely unversioned. Changing a single API required changing code globally. Developing with more than a handful of people would’ve been a nightmare.
So one of our first changes as the engineering team grew was splitting out the modules into separate repos (thanks TJ!). It was a massive task but it had huge payoff by making development with a larger team actually sane. Unfortunately, it was way harder than it should have been because we lumped everything together at the start.
It turns out this temptation to combine happens everywhere: in services, libraries, repos and tools. It’s so easy to add (just) one more feature to an existing codebase. But it has a long-term cost. Separation of concerns is the exact reason why UNIX-style systems are so successful; they give you the tools to compose many small building blocks into more complex programs.
structure code so that it’s easy to be split (or split from the beginning)
if a service or library doesn’t share concerns with existing ones, create a new one rather than shoe-horning it into an existing piece of code
testing and documenting libraries which perform a single function is much easier to understand
keep uptime, resource consumption and monitoring in mind when combining read/write concerns of a service
prefer libraries to frameworks, composing them together where possible
“Clever” code usually means “complicated” code. It’s hard to search for, and tough to track down where bugs are happening. We prefer simple code that’s explicit in it’s purpose rather than trying to create a magical API that relies on convention (go’s lack of “magic” is actually one of our favorite things about it).
As part of being explicit, always consider the “grep-ability” of your code. Imagine that you’re trying to find out where the implementation for the
post method lives, which is easier to find in a codebase?
Where possible, write code that is short, straightforward and easy to understand. Often that will come down to single functions that are easy to test and easy to document. Even libraries can perform just a single function and then be combined for more powerful functionality.
With comments, describe the “why” versus the typical “what” for a given process or routine. If a routine seems out of place but is necessary, it’s sometimes worth leaving a quick note as to why it exists at all.
avoid generating code dynamically or being overly ‘clever’ to shorten the line count
aim for functions that are <7 lines and <2 nested callbacks
Running code in production without metrics or alerting is flying blind. This has bitten our team more times than I’d care to admit, so we’ve increased our test coverage and monitoring extensively. Every time a user encounters a bug before we do, it damages their trust in us as a company. And that sucks.
Trust in our product is perhaps most valuable asset we have as a company. Losing that is almost completely irrecoverable; it’s the way we lose as a business. Our brand is built around data, and reliability is paramount to our success.
write test cases first to check for the broken behavior, then write the fix
all top-level apps should ship with metrics and monitoring
create ‘warning’ alerts for when an internal system is acting up, ‘critical’ ones when it starts affecting end customers
try to keep unrealistic failure scenarios in mind when designing the alerts
When building a product, there are three aspects you can optimize: Speed, quality, and scope. The catch… is that you can’t ever juggle all three simultaneously. Sacrificing quality by adding hacky fixes increases the amount of technical debt. It slows us down over the long-term, and we risk losing customer trust in the product. Not to mention, hacks are a giant pain to work on later.
At the same time, we can’t sacrifice speed either–that’s our main advantage as a startup. Long-running projects tend to drag on, use up a ton of resources and have no clearly defined “end.” By the time a monolithic project is finally ready to launch, releasing the finished product to customers becomes a daunting process.
When push comes to shove, it’s usually best to cut scope. It allows us to split shipments into smaller, more manageable chunks, and really focus on making each one great.
evaluate features for their benefit versus their effort
identify features that could be easily layered in later
cut features that create obvious technical debt
Separate code paths almost always become out of sync. One will get updated while another doesn’t, which makes for inconsistent behavior. At the architecture level, we want to try and optimize for a single code path.
Note that this is still consistent with splitting things apart, it just means that we need smaller pieces which are flexible enough to be combined together in different ways. If two pieces of code rely on the same functionality, they should use the same code path.
have a peer review your code; an objective opinion will almost always help
get someone else to sign-off on non-trivial pull-requests
if you ever find yourself copy-pasting code, consider pulling it into a library
if you need to frequently update a library, or keep state around, turn it into a service
Creating a loose mockup of a program is often the quickest way to understand the problem you’re solving. When you’re ready to write the real thing just
`rm -fr .git ` to start with a clean slate and better context.
Building something helps you learn more than you could ever hope to uncover through theorizing. Trust me, prototyping helps discover strange edge-cases and bottlenecks which may require you to rearchitect the solution. This process minimizes the impact of architectural changes.
don’t spend a lot of time with commit messages, keep them short but sensical
refactors typically come from a better understanding of the problem, the best way to get there is by building a version to “throw away”
Early on, it’s easy to write off automation as unimportant. But if you’ve done any time-consuming task more than 3 times you’ll probably want to automate it.
A key example of where we failed at this in the past was with Redshift’s cluster management. Investing in the tooling around provisioning clusters was a big endeavor, but it would have saved a ton of time if we’d started it sooner.
if you find yourself repeatedly spending more than a few minutes on a task, take a step back and consider tooling around it
ask yourself if you could be 20% more efficient, or if automation would help
share tools in dotfiles, vm, or task runner so the whole team can use them
Whenever you’re building out a new project or library, it’s worth considering which pieces can be pulled out and open sourced. At face value, it sounds like an extra constraint that doesn’t help ship product. But in practice, it actually creates much cleaner code. We’re guaranteed that the code’s API isn’t tightly coupled to anything we’re building internally, and that it’s more easily re-used across projects.
Open sourced code typically has a well-documented Readme, tests, CI, and more closely resembles the rest of the ecosystem. It’s a good sanity check that we’re not doing anything too weird internally, and the code is easier to forget about and re-visit 6-months later.
if you build a general purpose library without any dependencies, it’s a prime target for open sourcing
try and de-couple code so that it can be used standalone with a clear interface
never include custom configuration in a library, allow it to be passed in with sane defaults
Sometimes big problems arise in code and it may seem easier to write a work-around. Don’t do that. Hacking around the outskirts of a problem is only going to create a rat’s nest that will become an even bigger problem in the future. Tackle the root cause head-on.
A textbook example of this came from the first version of our integrations product. We proxied and transformed analytics calls through our servers to 30–40 different services, depending on what integrations the customer had enabled. On the backend, we had a single pool of integration workers that would read each incoming event from the queue, look up which settings were enabled, and then send copies of the event each enabled integration.
It worked great for the first year, but over time we started running into more and more problems. Because the workers were all shared, a single slow endpoint would grind the entire pool of workers to a halt. We kept adjusting and tweaking individual timeouts to no end, but the backlogs kept occurring. Since then, we’ve fixed the underlying issue by partitioning the data processing queues by endpoint so they operate completely independently. It was a large project, but one that had immediate pay-off, allowing us to scale our integrations platform.
Sometimes it’s worth taking a step back to solve the root cause or upstream problem rather than hacking around the periphery. Even if it requires a more significant restructuring, it can save you a lot of time and headache down the road, allowing you to achieve much greater scale.
whenever fixing a bug or infrastructure issue, ask yourself whether it’s a core fix or just a band-aid over one of the symptoms
keep tabs on where you’re spending the most time, if code is continually being tweaked, it probably needs a bigger overhaul
if there’s some bug or alert we didn’t catch, make sure the upstream cause is being monitored
When designing applications, coming up with a data model is one of the trickiest parts of implementation. The frontend, naturally, wants to match the user’s idea of how the data is formatted. Out of necessity, the backend has to match the actual data format. It must be stored in a way that is fast, performant and flexible.
So when starting with a new design, it’s best to first look at the user requirements and ask “which goals do we want to meet?” Then, look at the data we already have (or decide what new data you need) and figure out how it should be combined.
The frontend models should match the user’s idea of the data. We don’t want to have to change the data model every time we change the UI. It should remain the same, regardless of how the interface changes.
The service and backend models should allow for a flexible API from the programmer’s perspective, in a way that’s fast and efficient. It should be easy to combine individual services to build bigger pieces of functionality.
The controllers are the translation layer, tying together individual services into a format which makes sense to the frontend code. If there’s a piece of complicated logic which makes sense to be re-used, then it should be split into it’s own service.
the frontend models should match the user’s conception of the data
the services need to map to a data model that is performant and flexible
controllers can map between services and the frontend to assemble data
It’s easy to talk at length about best practices but actually following them requires discipline. Sometimes it’s tempting to cut corners or skip a step; but that doesn’t help long-term.
Now that we’ve codified these engineering best practices and the rationale behind each one, they have made their way into our default mode of operation. The act of explicitly writing them down has both clarified our thinking and helped us avoid making the same short-term mistakes over and over.
In practice, this means that we invest heavily in good tooling, modular libraries and microservices. In development, we keep a shared VM that auto-updates, with shared dotfiles for easily navigating our many small repositories. We put a focus on creating projects which increase functionality through composability rather than inheritance. And we’ve worked hard to streamline our process for running services in production.
All of this keeps our development team moving quickly and increases the quality of the product we ship. We’re able to accomplish a lot more with a lot less effort. And we’ll continue trying to improve and share that tooling with the community as it matures.
Andy Jiang, Vince Prignano on November 17th 2015
Growing a business is hard and growing the engineering team to support that is arguably harder, but doing both of those without a stable infrastructure is basically impossible. Particularly for high growth businesses, where every engineer must be empowered to write, test, and ship code with a high degree of autonomy.
Over the past year, we’ve added ~60 new integrations (to over 160), built a platform for partners to write their own integrations, released a Redshift integration, and have a few big product announcements on the way. And in that time, we’ve had many growing pains around managing multiple environments, deploying code, and general development workflows. Since our engineers are happiest and most productive when their time is spent shipping product, building tooling, and scaling services, it’s paramount that the development workflow and its supporting infrastructure are simple to use and flexible.
And that’s why we’ve automated many facets of our infrastructure. We’ll share our current setup in greater detail below, covering these main areas:
Let’s dive in!
As the code complexity and the engineering team grow, it can become harder to keep dev environments consistent across all engineers.
Before our current solution, one big problem our engineering team faced was keeping all dev environments in sync. We had a GitHub repo with a set of shell scripts that all new engineers executed to install the necessary tools and authentication tokens onto their local machines. These scripts would also setup Vagrant and a VM.
But this VM was built locally on each computer. If you modified the state of your VM, then in order to get it back to the same VM as the other engineers, you’d have to build everything again from scratch. And when one engineer updates the VM, you have to tell everyone on Slack to pull changes from our GitHub VM repo and rebuild. An awfully painful process, since Vagrant can be slow.
Not a great solution for a growing team that is trying to move fast.
When we first played with Docker, we liked the ability to run code in a reproducible and isolated environment. We wanted to reuse these Docker principles and experience in maintaining consistent dev environments across a growing engineering team.
We wrote a bunch of tools to set up the VM for new engineers to upgrade or to reset from the basic image state. When our engineers set up the VM for the first time, it asks for their GitHub credentials and AWS tokens, then pulls and builds from the latest image in Docker Hub.
On each run, we make sure that the VM is up-to-date by querying the Docker Hub API. This process updates packages, tools, etc. that our engineers use everyday. It takes around 5 seconds and is needed in order to make sure that everything is running correctly for the user.
Additionally, since our engineers use Macs, we switched from boot2dockervirtualbox machine to a Vagrant hosted boot2docker instance so that we could take advantage of NFS to share the volumes between the host and guest. Using NFS provides massive performance gains during local development. Lastly, NFS allows any changes our engineers make outside of the VM to be instantaneously reflected within the VM.
With this solution we have vastly reduced the number of dependencies needed to be installed on the host machine. The only things needed now are Docker, Docker Compose, Go, and a
The ideal situation is dev and prod environments running the same code, yet separated so code running on dev may never affect code running production.
Before we had the AWS state (generated by Terraform) stored alongside the Terraform files, but this wasn’t a perfect system. For example if two people asynchronously plan and apply different changes, the state will be modified and who pushes last is going to have hard times to figure out the merge collisions.
We achieved mirroring staging and production in the simplest way possible: copying files from one folder to another. Terraform enabled us to reduce the amount of hours taken to modify the infrastructure, deploy new services and making improvements.
We integrated Terraform with CircleCI writing a custom build process and ensuring that the right amount of security was taken in consideration before applying.
At the moment, we have one single repository hosted on GitHub named
infrastructure, which contains a collection of Terraform scripts that configure environmental variables and settings for each of our containers.
When we want to change something in our infrastructure, we make the necessary changes to the Terraform scripts and run them before opening a new pull request for someone else on the infra-team to review it. Once the pull request gets merged to master, CircleCI will start the deployment process: the state gets pulled, modified locally, and stored again on S3.
When developing locally, it’s important to populate local data stores with dummy data, so our app looks more realistic. As such, seeding databases is a common part of setting up the dev environment.
We rely on CircleCI, Docker, and volume containers to provide easy access to dummy data. Volume containers are portable images of static data. We decided to use volume containers because the data model and logic becomes less coupled and easier to maintain. Also just in case this data is useful in other places in our infrastructure (testing, etc., who knows).
Loading seed data into our local dev environment occurs automatically when we start the app server in development. For example, when the
app (our main application) container is started in a dev environment,
app‘s docker-compose.yml script will pull the latest
seed image from Docker Hub and mount the raw data in the VM.
seed image from Docker Hub is created from a GitHub repo
seed, that is just a collection of JSON files as the raw objects we import into our databases. To update the seed data, we have CircleCI setup on the repo so that any publishes to master will build (grabbing our mongodb and redis containers from Docker Hub) and publish a new
seed image to Docker Hub, which we can use in the app.
Due to the data-heavy nature of Segment, our app already relies on several microservices (db service, redis, nsq, etc). In order for our engineers to work on the app, we need to have an easy way to create these services locally.
Again, Docker makes this workflow extremely easy.
Similar to how we use
seed volume containers to mount data into the VM locally, we do the same with microservices. We use the docker compose file to grab images from Docker Hub to create locally, set addresses and aliases, and ultimately reduce the complexity to a single terminal command to get everything up and running.
If you write code, but never ship it to production, did it ever really happen? 😃
Deploying code to production is an integral part of the development workflow. At Segment, we prioritize easiness and flexibility around shipping code to production, since that encourages our engineers to move quickly and be productive. We’ve also created adequate tooling around safeguarding for errors, rolling back, and monitoring build statuses.
We use Docker, ECS, CircleCI, and Terraform to automate as much of the continuous deployment process as possible.
Whenever code is pushed or merged into its
master branch, the CircleCI script build the container and push it to Docker Hub.
With this setup, we can define the configuration once for any service, making it extremely easy for our engineers to create and deploy new microservices. As Calvin mentioned in a previous post, “Rebuilding Our Infrastructure with Docker, ECS, and Terraform”:
We no longer have a complex set of provisioning scripts or AMIs—we just hand the production cluster an image, and it runs. There’s no more stateful instances, and we’re guaranteed to run the same exact code on both staging and prod.
The automation and ease of use around deployment have positively impacted more than just our engineers. Our success and marketing teams can update markdown files in a handful of repos that, when merged to master, kick off an auto deploy process so that changes can be live in minutes.
Because we chose to invest effort into rethinking and automating our dev workflow and its supporting infrastructure, our engineering team move fasters and more confidently. We spend more time doing high leverage jobs that we love—shipping product and building internal tools—and less time yak shaving.
That said, this is by no means the final iteration of our infrastructure automation. We are constantly playing with new tools and testing new ideas, seeing what further efficiencies we can eek out.
This has been a tremendous learning process for us and we’d love to hear what others in the community have done with their dev workflows. If you end up implementing something like this (or have already), let us know! We’d love to hear what you’ve done, and what’s worked or hasn’t for others with similar problems.
Andy Jiang on October 20th 2015
A little while ago we open-sourced a static site generator called Metalsmith. We built Metalsmith to be flexible enough that it could build blogs (like the one you’re reading now), knowledge bases, and most importantly our technical documentation.
Using Metalsmith to build a simple blog is one thing, but building easy-to-maintain documentation isn’t as simple. There are different sections, libraries, various reusable code snippets and content that live in multiple places, and other stuff to consider. Metalsmith simplifies maintaining all of these moving parts and let’s us focus purely on creating helpful content. Here’s how we did it!
The first thing to know is that Metalsmith works by running a series of transformations on a directory of files. In the case of our docs, that directory is just a bunch of folders with Markdown files in them.
The directory structure mimics the URL structure:
And an individual Markdown file might look like this:
It’s structured this way because it makes the actual content easy to maintain. They are just regular folders with plain old Markdown files in them. That means that anyone on the team can easily edit the docs without any technical knowledge. You can even do it right form the GitHub editor:
So far so good. But how do just those simple Markdown files get transformed into our entire technical documentation? That’s where Metalsmith comes in. Like I mentioned earlier, Metalsmith is really just a series of transformations to run on the directory of Markdown files.
I’ll walk you through each of the transformations we use in order.
The first transformation we use is a custom plugin that takes all the files in a directory and exposes them as partials in Handlebars. That means we can keep content that we repeat a lot in a single place for easier maintenance.
The next transformation is a plugin that groups files together into “collections”. In our case, those collections are built into our sub-directories, so we have collections like: Libraries, Plugins, Tutorials, etc.
We also pass our own custom
sorter function that will return the order specified in the array and append the remaining files pseudo-alphabetized.
Having all of the collections grouped as simple arrays makes it easy for us to do things like automatically generate a top-level navigation to get to every collection:
Or to automatically generate a collection-level navigation for navigating between pages:
The plugin categorizes all files that fit the provided definition (in our case, providing file path patterns), adds a
collection array to each file that contains the name of the collection, and finally adds a
previous properties to files that points to the sibling file in the collection. This allows us to easily render collections later on with handlebars:
The key is that all of those pieces are automatically generated, so you never need to worry about remembering to link between pages.
Note that this plugin does not determine the final directory structure. By default, the directory structure is preserved from start to end, unless a plugin specifically modifies this.
The third transformation step is to template all of our Markdown files in placewith metalsmith-in-place. By that, I mean that we just run Handlebars over our Markdown files right where they are, so that we can take advantage of a bunch of helpers we’ve added.
For example, in any of our Markdown files we can use an
api-example helper like so:
Which, will render a language-agnostic code snippet that remembers the user’s language preference:
You can find the above code snippet here.
Then, we transform all of those Markdown files into HTML files with the metalsmith-markdown plugin. Pretty self-explanatory!
Now that we have all of our files as
.html instead of
.md, the next transformation is pretty simple using metalsmith-headings. It iterates over all of the files once more, extracting the text of all the
<h2> tags and adding that array as metadata of the file. So you might end up with a file object that looks like this:
Why would we want to do that? Because it means we can build the navigation in the sidebar automatically from the content of the file itself:
So you never need to worry about remembering to update the navigation yourself.
The next step is to use the permalinks plugin to transform files so that all of the content lives in
index.html files, so they can be served statically. For example, given a source directory like this:
The permalinks plugin would transform that into:
So that NGINX can serve those static files as:
The last step is to template all of our source files again (they’re not
.md anymore, they’re all
.html at this point) by rendering them into our top-level layout.
layout.html file is where all of the navigation rendering logic is contained, and we just dump the contents of each of the pages that started as Markdown into the global template, like so:
Once that’s done, we’re done! All of those files that started their life as simple Markdown have been run through a bunch of transformations. They now live as a bunch of static HTML files that each have automatically-generated navigations and sidebars (with active states too).
The last step is to deploy our documentation. This step isn’t to be forgotten, because our goal was to make our docs so simple to edit that everyone on the team can apply fixes as customers report problems.
To make our team as efficient as possible about shipping fixes and updates to our docs, we have our repo setup so that any branch merged to
master will kick off CircleCI to build and publish to production. Anyone can then make edits in a separate branch, submit a PR, then merge to
master, which will then automatically deploy the changes.
For the vast majority of text-only updates, this is perfect. Though, occasionally we may need more complex things.
For more information on the tech we use for our backend, check out Rebuilding Our Infrastructure with Docker, ECS, and Terraform.
Before we converted our docs to Metalsmith, they lived in a bunch of Jade files that were a pain in the butt to change. Because we had little incentive to edit them, we let typos run rampant and waited too long to fix misinformation. Obviously this was a bad situation for our customers.
Now that our docs are easy to edit in Markdown and quick to deploy, we fix problems much faster. The quickest way to fix docs issues is to make a permanent change, rather than repeat ourselves in ticket after ticket. With a simpler process, we’re able to serve our customers much better, and we hope you can too!
Calvin French-Owen on October 7th 2015
In Segment’s early days, our infrastructure was pretty hacked together. We provisioned instances through the AWS UI, had a graveyard of unused AMIs, and configuration was implemented three different ways.
As the business started taking off, we grew the size of the eng team and the complexity of our architecture. But working with production was still limited to a handful of folks who knew the arcane gotchas. We’d been improving the process incrementally, but we needed to give our infrastructure a deeper overhaul to keep moving quickly.
So a few months ago, we sat down and asked ourselves: “What would an infrastructure setup look like if we designed it today?”
Over the course of 10 weeks, we completely re-worked our infrastructure. We retired nearly every single instance and old config, moved our services to run in Docker containers, and switched over to use fresh AWS accounts.
We spent a lot of time thinking about how we could make a production setup that’s auditable, simple, and easy to use–while still allowing for the flexibility to scale and grow.
Here’s our solution.
Instead of using regions or tags to separate different staging and prod instances, we switched over totally separate AWS accounts. We need to ensure that our provisioning scripts wouldn’t affect our currently running services, and using fresh accounts meant that we had a blank slate to start with.
ops account serves as the jump point and centralized login. Everyone in the organization can have a IAM account for it.
The other environments have a set of IAM roles to switch between them. It means there’s only ever one login point for our admin accounts, and a single place to restrict access.
As an example, Alice might have access to all three environments, but Bob can only access dev (ever since he deleted the production load balancer). But they both enter through the
Instead of having complex IAM settings to restrict access, we can easily lock down users by environment and group them by role. Using each account from the interface is as simple as switching the currently active role.
Instead of worrying that a staging box might be unsecured or alter a production database, we get true isolation for free. No extra configuration required.
There’s the additional benefit of being able to share configuration code so that our staging environment actually mirrors prod. The only difference in configuration are the sizes of the instances and the number of containers.
Finally, we’ve also enabled consolidated billing across the accounts. We pay our monthly bill with the same invoicing and see a detailed breakdown of the costs split by environment.
As of today, we’re now running the majority of our services inside Docker containers, including our API and data pipeline. The containers receive thousands of requests per second and process 50 billion events every month.
The biggest single benefit of Docker is the extent that it’s empowered the team to build services from scratch. We no longer have a complex set of provisioning scripts or AMIs—we just hand the production cluster an image, and it runs. There’s no more stateful instances, and we’re guaranteed to run the same exact code on both staging and prod.
After configuring our services to run in containers, we chose ECS as the scheduler.
At a high level, ECS is responsible for actually running our containers in production. It takes care of scheduling services, placing them on separate hosts, and zero-downtime reloads when attached to an ELB. It can even schedule across AZs for better availability. If a container dies, ECS will make sure it’s re-scheduled on a new instance within that cluster.
The switch to ECS has vastly simplified running a service without needing to worry about upstart jobs or provisioning instances. It’s as easy as adding a Dockerfile, setting up the task definition, and associating it with a cluster.
In our setup, the Docker images are built by CI, and then pushed to Docker Hub. When a service boots up, it pulls the image from Docker Hub, and then ECS schedules it across machines.
We group our service clusters by their concern and load profile (e.g. different clusters for API, CDN, App, etc). Having separate clusters means that we get better visibility and can decide to use different instance types for each (since ECS has no concept of instance affinity).
Each service has a particular task definition indicating which version of the container to run, how many instances to run on, and which cluster to choose.
During operation, the service registers itself with an ELB and uses a healthcheck to confirm that the container is actually ready to go. We point a local Route53 entry at the ELB, so that services can talk to each other and simply reference via DNS.
The setup is nice because we don’t need any service discovery. The local DNS does all the bookkeeping for us.
ECS runs all the services and we get free cloudwatch metrics from the ELBs. It’s been a lot simpler than having to register services with a centralized authority at boot-time. And the best part is that we don’t have to deal with state conflicts ourselves.
Where Docker and ECS describe how to run each of our services, Terraform is the glue that holds them together. At a high level, it’s a set of provisioning scripts that create and update our infrastructure. You can think of it like a version of Cloudformation on steroids–but it doesn’t make you want to poke your eyes out.
Rather than running a set of servers for maintaining state, there’s just a set of scripts that describe the cluster. Configuration is run locally (and in the future, via CI) and committed to git, so we have a continuous record of what our production infrastructure actually looks like.
Here’s an sample of our Terraform module for setting up our bastion nodes. It creates all the security groups, instances, and AMIs, so that we’re able to easily set up new jump points for future environments.
We use the same module in both stage and prod to set up our individual bastions. The only thing we need to switch out are the IAM keys, and we’re ready to go.
Making changes is also painless. Instead of always tearing down the entire infrastructure, Terraform will make updates where it can.
When we wanted to change our ELB draining timeout to 60 seconds, it took a simple find/replace followed by a
terraform apply. And voilà, two minutes later we had a fully altered production setup for all of our ELBs.
It’s reproduceable, auditable, and self-documenting. No black boxes here.
We’ve put all the config in a central
infrastructure repo, so it’s easy to discover how a given service is setup.
We haven’t quite reached the holy grail yet though. We’d like to convert more of our Terraform config to take advantage of modules so that individual files can be combined and reduce the amount of shared boilerplate.
Along the way we found a few gotchas around the
.tfstate, since Terraform always first reads from the existing infrastructure and complains if the state gets out of sync. We ended up just committing our
.tfstate to the repo, and pushing it after making any changes, but we’re looking into Atlas or applying via CI to solve that problem.
By this point, we had our infrastructure, our provisioning, and our isolation. The last things left were metrics and monitoring to keep track of everything running in production.
In our new environment, we’ve switched all of our metrics and monitoring over to Datadog, and it’s been fantastic.
We’ve been incredibly happy with Datadog’s UI, API, and complete integration with AWS, but getting the most out of the tool comes from a few key pieces of setup.
The first thing we did was integrate with AWS and Cloudtrail. It gives a 10,000 foot view of what’s going on in each of our environments. Since we’re integrating with ECS, the Datadog feed updates everytime a task definition updates, so we end up getting notifications for deploys for free. Searching the feed is surprisingly snappy, and makes it easy to trace down the last time a service was deployed or rescheduled.
Next, we made sure to add the Datadog-agent as a container to our base AMI (datadog/docker-dd-agent). It not only gathers metrics from the host (CPU, Memory, etc) but also acts as a sink for our statsd metrics. Each of our services collects custom metrics on queries, latencies, and errors so that we can explore and alert on the in Datadog. Our go toolkit (soon to be open sourced) automatically collects the output of
pprof on a ticker and sends it as well, so we can monitor memory and goroutines.
What’s even cooler is that the agent can visualize instance utilization across hosts in the environment, so we can get a high level overview of instances or clusters which might be having issues:
Additionally, my teammate Vince created a Terraform provider for Datadog, so we can completely script our alerting against the actual production configuration. Our alerts will be recorded and stay in sync with what’s running in prod.
By convention, we specify two alert levels:
warning is there to let anyone currently online know that something looks suspicious and should be triggered well in advance of any potential problems. The
criticalalerts are reserved for ‘wake-you-up-in-the-middle-of-the-night’ problems where there’s a serious system failure.
What’s more, once we transition to Terraform modules and add the Datadog provider to our service description, then all services end up getting alerts for free. The data will be powered directly by our internal toolkit and Cloudwatch metrics.
Once we had all these pieces in place, the day had finally come to make the switch.
We first set up a VPC peering connection between our new production environment and our legacy one–allowing us to cluster databases and replicate across the two.
Next, we pre-warmed the ELBs in the new environment to make sure that they could handle the load. Amazon won’t provision automatically sized ELBs, so we had to ask them to ramp it ahead of time (or slowly scale it oursleves) to deal with the increased load.
From there, it was just a matter of steadily ramping up traffic from our old environment to our new one using weighted Route53 routes, and continuously monitoring that everything looked good.
Today, our API is humming along, handling thousands of requests per second and running entirely inside Docker containers.
But we’re not done yet. We’re still fine-tuning our service creation, and reducing the boilerplate so that anyone on the team can easily build services with proper monitoring and alerting. And we’d like to improve our tooling around working with containers, since services are no longer tied to instances.
We also plan to keep an eye on promising tech for this space. The Convox team is building awesome tooling around AWS infrastructure. Kubernetes, Mesosphere, Nomad, and Fleet seemed like incredibly cool schedulers, though we liked the simplicity and integration of ECS. It’s going to be exciting to see how they all shake out, and we’ll keep following them to see what we can adopt.
After all of these orchestration changes, we believe more strongly than ever in outsourcing our infrastructure to AWS. They’ve changed the game by productizing a lot of core services, while maintaining an incredibly competitive price point. It’s creating a new breed of startups that can build products efficiently and cheaply while spending less time on maintenance. And we’re bullish on the tools that will be built atop their ecosystem.
Calvin French-Owen, Chris Sperandio on August 4th 2015
It wasn’t long ago that building out an analytics pipeline took serious engineering chops. Buying racks and drives, scaling to thousands of requests a second, running ETL jobs, cleaning the data, etc. A team of data engineers could easily spend months on it.
But these days, it’s getting easier and cheaper. We’re seeing the UNIX-ification of hosted services: each one designed to do one thing and do it well. And they’re all composable.
It made us wonder: just how quickly could a single person build their own data analytics pipeline, with scalability in mind, without having to worry about maintaining it? An entirely managed data processing stream?
It sounded like analytics Zen. So we set out to find our inner joins peace armed with just a few tools: Terraform, Segment, DynamoDB and Lambda.
make (check it out on Github).
An analytics pipeline is a series of tools for data collection, processing, and analysis that businesses use to better understand their customers and assess their business performance. Analytics pipelines collect data from different sources, classify it, and send it to the proper storage facility (like a data warehouse) where it’s made accessible to analytics tools.
As a toy example, our data pipeline takes events uploaded to S3 and increments a time-series count for each event in Dynamo. It’s the simplest rollup we could possibly do to answer questions like: “How many purchases did I get in the past hour?” or “Are my signups increasing month over month?”
Here’s the general dataflow:
Event data enters your S3 bucket through Segment’s integration. The integration uploads all the analytics data sent to the Segment API on an hourly basis.
That’s where the composability of AWS comes in. S3 has a little-known feature called “Event Notifications”. We can configure the bucket to push a notification to a Lambda function on every file upload.
In theory, our Lambda function could do practically anything once it gets a file. In our example, we’ll extract the individual events, and then increment each
<event, hour> pair in Dynamo.
Once our function is up and running, we’ll have a very rudimentary timeseries database to query event counts over time.
From there, we’ll handle the incoming events, and update each item in Dynamo:
And finally, we can query for the events in our database using the CLI:
We could also build dashboards on it a la google analytics or geckoboard.
Even though we have our architecture and lambda function written, there’s still the task of having to describe and provision the pipeline on AWS.
Configuring these types of resources has been kind of a pain for us in the past. We’ve tried Cloudformation templates (verbose) and manually creating all the resources through the AWS UI (slooooow).
Neither of these options has been much fun for us, so we’ve started using Terraform as an alternative.
If you haven’t heard of Terraform, definitely give it a spin. It’s a badass tool designed to help provision and describe your infrastructure. It uses a much simpler configuration syntax than Cloudformation, and is far less error-prone than using the AWS Console UI.
As a taste, here’s what our
lambda.tf file looks like to provision our Lambda function:
The Terraform plan for this project creates an S3 bucket, the Lambda function, the necessary IAM roles and policies, and the Dynamo database we’ll use for storing the data. It runs in under 10 seconds and immediately sets up our infrastructure so that everything is working properly.
If we ever want to make a change, a simple
terraform apply, will cause the changes to update in our production environment. At Segment, we commit our configurations to source control so that they are easily audited and changelog’d.
We just walked through a basic example, but with Lambda there’s really no limit to what your functions might do. You could publish the events to Kinesis with additional Lambda handlers for further processing, or pull in extra fields from your database. The possibilities are pretty much endless thanks the APIs Amazon has created.
If you’d like to build your own pipeline, just clone or fork our example Github repo.
And that’s it! We’ll keep streaming. You keep dreaming.
Orta Therox on June 24th 2015
We’re excited to welcome Orta from CocoaPods to the blog to discuss the new Stats feature! We’re big fans of CocoaPods and are excited to help support the project.
CocoaPods is the Dependency Manager for iOS and Mac projects. It works similar to npm, rubygems, gradle or nuget. We’ve been running the open source project for 5 years and we’ve tried to keep the web infrastructure as minimal as possible.
Our users have been asking for years about getting feedback on how many downloads their libraries have received. We’ve been thinking about the problem for a while, and finally ended up asking Segment if they would sponsor a backend for the project.
It wasn’t just enough to offer just download counts. We spend a lot of time working around Xcode (Apple’s developer tool) project file intricacies, however in this context, it provides us with foundations for a really nice feature. CocoaPods Stats will be able to keep track of the unique number of installs within Apps / Watch Apps / Extensions / Unit Tests.
This means is that developers using continuous integration only register as 1 install, even if the server runs
pod install each time, separating total installations vs actual downloads.
Let’s go over how we check which pods get sent up for analytics, and how we do the unique installs. CocoaPods-Stats is a plugin that will be bundled with CocoaPods within a version or two. It registers as a post-install plugin and runs on every
pod install or
We’re very pessimistic about sending a Pod up to our stats server. We ensure that you have a CocoaPods/Specs repo set up as your master repo, then ensure that each pod to be sent is inside that repo before accepting it as a public domain pod.
First up, we don’t want to know anything about your app. So in order to know unique targets we use your project’s target UUID as an identifier. These are a hash of your MAC address, Xcode’s process id and the time of target creation (but we only know the UUID/hash, so your MAC address is unknown to us). These UUIDs never change in a project’s lifetime (contrary to, for example, the bundle identifier). We double hash it just to be super safe.
We then also send along the CocoaPods version that was used to generate this installation and about whether this installation is a
pod try [pod] rather than a real install.
My first attempt at a stats architecture was based on how npm does stats, roughly speaking they send all logs to S3 where they are map-reduced on a daily basis into individual package metrics. This is an elegant solution for a companywith people working full time on up-time and stability. As someone who wants to be building iOS apps, and not maintaining more infrastructure in my spare time, I wanted to avoid this.
We use Segment at Artsy, where I work, and our analytics team had really good things to say about Segment’s Redshift infrastructure. So I reached out about having Segment host the stats infrastructure for CocoaPods.
We were offered a lot of great advice around the data-modelling and were up and running really quickly. So you already know about the CocoaPods plugin, but from there it sends your anonymous Pod stats up to stats.cocoapods.org. This acts as a conduit sending analytics events to Segment. A daily task is triggered on the web site, this makes SQL requests against the Redshift instance which is then imported into metrics.cocoapods.org.
If you want to learn more about CocoaPods, check us out here.