Every month, Segment collects, transforms and routes over 50 billion API calls to hundreds of different business-critical applications. We’ve come a long way from the early days, where my co-founders and I were running just a handful of instances.
Today, we have a much deeper understanding of the problems we’re solving, and we’ve learned a ton. To keep moving quickly and avoid past mistakes, our team has started developing a list of engineering best practices.
Now that a lot of these “pro tips” have been tested, deployed and are currently in production… we wanted to share them with you. It’s worth noting that we’re standing on the shoulders of giants here, to The Zen of Python, Hints for Computer System Design, and the Twelve-Factor App for the inspiration.
Editor’s Note: This post was based off an internal wiki page for Segment “Pro Tips”. There are more tips recorded there, but we chose a handful that seemed most broadly applicable. They’re written as fact, but internally we treat them as guidelines, always weighing other trade-offs within the organization. Each practice is followed by a few bullet-points underscoring the main takeaways.
1. It’s easier to combine than to split apart.
When we first started out, we had one massive repo. Every module was filled with tightly coupled dependencies and was completely unversioned. Changing a single API required changing code globally. Developing with more than a handful of people would’ve been a nightmare.
So one of our first changes as the engineering team grew was splitting out the modules into separate repos (thanks TJ!). It was a massive task but it had huge payoff by making development with a larger team actually sane. Unfortunately, it was way harder than it should have been because we lumped everything together at the start.
It turns out this temptation to combine happens everywhere: in services, libraries, repos and tools. It’s so easy to add (just) one more feature to an existing codebase. But it has a long-term cost. Separation of concerns is the exact reason why UNIX-style systems are so successful; they give you the tools to compose many small building blocks into more complex programs.
structure code so that it’s easy to be split (or split from the beginning)
if a service or library doesn’t share concerns with existing ones, create a new one rather than shoe-horning it into an existing piece of code
testing and documenting libraries which perform a single function is much easier to understand
keep uptime, resource consumption and monitoring in mind when combining read/write concerns of a service
prefer libraries to frameworks, composing them together where possible
2. Explicit is better than implicit.
“Clever” code usually means “complicated” code. It’s hard to search for, and tough to track down where bugs are happening. We prefer simple code that’s explicit in it’s purpose rather than trying to create a magical API that relies on convention (go’s lack of “magic” is actually one of our favorite things about it).
As part of being explicit, always consider the “grep-ability” of your code. Imagine that you’re trying to find out where the implementation for the
post method lives, which is easier to find in a codebase?
Where possible, write code that is short, straightforward and easy to understand. Often that will come down to single functions that are easy to test and easy to document. Even libraries can perform just a single function and then be combined for more powerful functionality.
With comments, describe the “why” versus the typical “what” for a given process or routine. If a routine seems out of place but is necessary, it’s sometimes worth leaving a quick note as to why it exists at all.
avoid generating code dynamically or being overly ‘clever’ to shorten the line count
aim for functions that are <7 lines and <2 nested callbacks
3. It doesn’t ship without metrics and tests.
Running code in production without metrics or alerting is flying blind. This has bitten our team more times than I’d care to admit, so we’ve increased our test coverage and monitoring extensively. Every time a user encounters a bug before we do, it damages their trust in us as a company. And that sucks.
Trust in our product is perhaps most valuable asset we have as a company. Losing that is almost completely irrecoverable; it’s the way we lose as a business. Our brand is built around data, and reliability is paramount to our success.
write test cases first to check for the broken behavior, then write the fix
all top-level apps should ship with metrics and monitoring
create ‘warning’ alerts for when an internal system is acting up, ‘critical’ ones when it starts affecting end customers
try to keep unrealistic failure scenarios in mind when designing the alerts
4. Cut scope aggressively.
When building a product, there are three aspects you can optimize: Speed, quality, and scope. The catch… is that you can’t ever juggle all three simultaneously. Sacrificing quality by adding hacky fixes increases the amount of technical debt. It slows us down over the long-term, and we risk losing customer trust in the product. Not to mention, hacks are a giant pain to work on later.
At the same time, we can’t sacrifice speed either–that’s our main advantage as a startup. Long-running projects tend to drag on, use up a ton of resources and have no clearly defined “end.” By the time a monolithic project is finally ready to launch, releasing the finished product to customers becomes a daunting process.
When push comes to shove, it’s usually best to cut scope. It allows us to split shipments into smaller, more manageable chunks, and really focus on making each one great.
evaluate features for their benefit versus their effort
identify features that could be easily layered in later
cut features that create obvious technical debt
5. Maintain a single code path.
Separate code paths almost always become out of sync. One will get updated while another doesn’t, which makes for inconsistent behavior. At the architecture level, we want to try and optimize for a single code path.
Note that this is still consistent with splitting things apart, it just means that we need smaller pieces which are flexible enough to be combined together in different ways. If two pieces of code rely on the same functionality, they should use the same code path.
have a peer review your code; an objective opinion will almost always help
get someone else to sign-off on non-trivial pull-requests
if you ever find yourself copy-pasting code, consider pulling it into a library
if you need to frequently update a library, or keep state around, turn it into a service
6. Create rapid prototypes.
Creating a loose mockup of a program is often the quickest way to understand the problem you’re solving. When you’re ready to write the real thing just
`rm -fr .git ` to start with a clean slate and better context.
Building something helps you learn more than you could ever hope to uncover through theorizing. Trust me, prototyping helps discover strange edge-cases and bottlenecks which may require you to rearchitect the solution. This process minimizes the impact of architectural changes.
don’t spend a lot of time with commit messages, keep them short but sensical
refactors typically come from a better understanding of the problem, the best way to get there is by building a version to “throw away”
7. Know when to automate.
Early on, it’s easy to write off automation as unimportant. But if you’ve done any time-consuming task more than 3 times you’ll probably want to automate it.
A key example of where we failed at this in the past was with Redshift’s cluster management. Investing in the tooling around provisioning clusters was a big endeavor, but it would have saved a ton of time if we’d started it sooner.
if you find yourself repeatedly spending more than a few minutes on a task, take a step back and consider tooling around it
ask yourself if you could be 20% more efficient, or if automation would help
share tools in dotfiles, vm, or task runner so the whole team can use them
8. Aim to open source.
Whenever you’re building out a new project or library, it’s worth considering which pieces can be pulled out and open sourced. At face value, it sounds like an extra constraint that doesn’t help ship product. But in practice, it actually creates much cleaner code. We’re guaranteed that the code’s API isn’t tightly coupled to anything we’re building internally, and that it’s more easily re-used across projects.
Open sourced code typically has a well-documented Readme, tests, CI, and more closely resembles the rest of the ecosystem. It’s a good sanity check that we’re not doing anything too weird internally, and the code is easier to forget about and re-visit 6-months later.
if you build a general purpose library without any dependencies, it’s a prime target for open sourcing
try and de-couple code so that it can be used standalone with a clear interface
never include custom configuration in a library, allow it to be passed in with sane defaults
9. Solve the root cause.
Sometimes big problems arise in code and it may seem easier to write a work-around. Don’t do that. Hacking around the outskirts of a problem is only going to create a rat’s nest that will become an even bigger problem in the future. Tackle the root cause head-on.
A textbook example of this came from the first version of our integrations product. We proxied and transformed analytics calls through our servers to 30–40 different services, depending on what integrations the customer had enabled. On the backend, we had a single pool of integration workers that would read each incoming event from the queue, look up which settings were enabled, and then send copies of the event each enabled integration.
It worked great for the first year, but over time we started running into more and more problems. Because the workers were all shared, a single slow endpoint would grind the entire pool of workers to a halt. We kept adjusting and tweaking individual timeouts to no end, but the backlogs kept occurring. Since then, we’ve fixed the underlying issue by partitioning the data processing queues by endpoint so they operate completely independently. It was a large project, but one that had immediate pay-off, allowing us to scale our integrations platform.
Sometimes it’s worth taking a step back to solve the root cause or upstream problem rather than hacking around the periphery. Even if it requires a more significant restructuring, it can save you a lot of time and headache down the road, allowing you to achieve much greater scale.
whenever fixing a bug or infrastructure issue, ask yourself whether it’s a core fix or just a band-aid over one of the symptoms
keep tabs on where you’re spending the most time, if code is continually being tweaked, it probably needs a bigger overhaul
if there’s some bug or alert we didn’t catch, make sure the upstream cause is being monitored
10. Design models by concern.
When designing applications, coming up with a data model is one of the trickiest parts of implementation. The frontend, naturally, wants to match the user’s idea of how the data is formatted. Out of necessity, the backend has to match the actual data format. It must be stored in a way that is fast, performant and flexible.
So when starting with a new design, it’s best to first look at the user requirements and ask “which goals do we want to meet?” Then, look at the data we already have (or decide what new data you need) and figure out how it should be combined.
The frontend models should match the user’s idea of the data. We don’t want to have to change the data model every time we change the UI. It should remain the same, regardless of how the interface changes.
The service and backend models should allow for a flexible API from the programmer’s perspective, in a way that’s fast and efficient. It should be easy to combine individual services to build bigger pieces of functionality.
The controllers are the translation layer, tying together individual services into a format which makes sense to the frontend code. If there’s a piece of complicated logic which makes sense to be re-used, then it should be split into it’s own service.
the frontend models should match the user’s conception of the data
the services need to map to a data model that is performant and flexible
controllers can map between services and the frontend to assemble data
Our engineering best practices, in practice.
It’s easy to talk at length about best practices but actually following them requires discipline. Sometimes it’s tempting to cut corners or skip a step; but that doesn’t help long-term.
Now that we’ve codified these engineering best practices and the rationale behind each one, they have made their way into our default mode of operation. The act of explicitly writing them down has both clarified our thinking and helped us avoid making the same short-term mistakes over and over.
In practice, this means that we invest heavily in good tooling, modular libraries and microservices. In development, we keep a shared VM that auto-updates, with shared dotfiles for easily navigating our many small repositories. We put a focus on creating projects which increase functionality through composability rather than inheritance. And we’ve worked hard to streamline our process for running services in production.
All of this keeps our development team moving quickly and increases the quality of the product we ship. We’re able to accomplish a lot more with a lot less effort. And we’ll continue trying to improve and share that tooling with the community as it matures.