We’ve been running Node in production for a little over two years now, scaling from a trickle of 30 requests per second up to thousands today. We’ve been hit with almost every kind of weird request pattern under the sun.
First there was the customer who liked to batch their data into a single dump every Friday night (getting called on a Friday night used to be a good thing). Then the user who sent us their visitor’s entire social graph with every request. And finally an early customer who hit us with a
while(true) send(data) loop and caused a minor emergency.
By now, our ops team has seen the good, the bad, and the ugly of Node. Here’s what we’ve learned.
Beware the event loop
One of the great things about Node is that you don’t have to worry about threading and locking. Since everything runs on a single thread, the state of the world is incredibly simple. At any given time there’s only a single running code block.
But here… there be dragons.
Our API ingests tons of small pieces of customer data. When we get data, we want to make sure we’re actually taking the JSON and representing any ISO Strings as dates. We traverse the JSON data we’d receive, converting any date strings into native
Date objects. As long as the total size is under
15kb, we’ll pass it through our system.
It seemed innocent enough… until we’d get a massively nested JSON blob and we’d start traversing. It’d take seconds, even minutes, before we chewed through all the queued up function calls. Here’s what the times and sizes would look like after an initial large batch would get rejected:
And then things would only get worse: the problems would start cascading. Our API servers would start failing healthchecks and disconnect from the ELB. The lack of heartbeat would cause the NSQ connection to disconnect so we weren’t actually publishing messages. Our customer’s clients would start retrying, and we’d be hit with a massive flood of requests. Not. Good.
Clearly something had to be done–we had to find out where the blockage was happening and then limit it.
Now we use node-blocked to get visibility into whether our production apps are blocking on the event loop, like this errant worker:
It’s a simple module which checks when the event loop is getting blocked and calls you when it happens. We hooked it up to our logging and statsd monitoring so we can get alerted when a serious blockage occurs.
We dropped in the module and immediately started seeing the following in our logs:
A customer was sending us really large batches of nested JSON. Applying a few stricter limits to our API (this was back before we had limits) and moving the processing to a background worker fixed the problem for good.
To further avoid event loop problems entirely, we’ve started switching more of our data processing services to Go and using goroutines, but that’s a topic for an upcoming post!
Exceptions: the silent noisy killer
Error handling is tricky in every language–and node is no exception. Plenty of times, there will be an uncaught exception which–through no fault of your own–bubbles up and kills the whole process.
There are multiple ways around this using the
vm module or domains. We haven’t perfected error handling, but here’s our take.
Simple exceptions should be caught using a linter. There’s no reason to have bugs for
undefined vars when they could be caught with some basic automation.
To make that super easy, we started adding make-lint to all of our projects. It catches unhandled errors and undefined variables before they even get pushed to production. Then our makefiles run the linter as the first target of `make test`.
If you’re not already catching exceptions in development, add
make-lint today and save yourself a ton of hassle. We tried to make the defaults sane so that it shouldn’t hamper your coding style but still catch errors.
In prod, things get trickier. Connections across the internet fail way more often. The most important thing is that we know when and where uncaught exceptions are happening, which is often easier said than done.
Fortunately, Node has a global
uncaughtException handler, which we use to detect when the process is totally hosed.
We ship all logs off to a separate server for collection, so we want to make sure to have enough time to log the error before the process dies. Our cleanup could use a bit more sophistication, but typically we’ll attempt to disconnect and then exit after a timeout.
Actually serializing errors also requires some special attention (handled for us by YAL). You’ll want to make sure to include both the
stack explicitly, since they are non-enumerable properties and will be missed by simply calling
JSON.stringify on the error.
Finally, we’ve also written our own module called
oh-crapto automatically dump a heap snapshot for later examination.
It’s easily loaded into the chrome developer tools, and incredibly handy for those times we’re hunting the root cause of the crash. We just drop it in and we’ve instantly got full access to whatever state killed our beloved workers.
It’s easy to overload the system by setting our concurrency too high. When that happens, the CPU on the box starts pegging, and nothing is able to complete. Node doesn’t do a great job handling this case, so it’s important to know when we’re load testing just how much concurrency we can really deal with.
Our solution is to stick queues between every piece of processing. We have lots of little workers reading from NSQ and each of them sets a
maxInFlightparameter specifying just how many messages the worker should deal with concurrently.
If we see the CPU thrashing, we’ll adjust the concurrency and reload the worker. It’s way easier to think about the concurrency once at boot rather than constantly tweaking our application code and limiting it across different pipelines.
It also means we get free visibility into where data is queueing, not to mention the ability to pause entire data processing flows if a problem occurs. It gives us much better isolation between processes and makes them easier to reason about.
Streams and errors
We moved away from using streams for most of our modules in favor of dedicated queues. But, there are a few places where they still make sense.
The biggest overall gotcha with streams is their error handling. By default, piping won’t cause streams to propagate their errors to whatever stream is next.
Take the example of a file processing pipeline which is reading some files, extracting some data and then running some transforms on it:
Looking at this code, it’s easy to miss that we haven’t actually setup our error handling properly. Sure, the resulting pipeline stream has handlers, but if any errors occur in the
Transform streams, they’ll go uncaught.
To get around this, we use Julian Gruber’s nifty
multipipe module, which provides a nice API over centralized error handling. That way we can attach a single error handler, and be off to the races.
If you’re also running Node in production and dealing with a highly variable data pipeline, you’ve probably run into a lot of similar issues. For all these gotchas, we’ve been able to scale our node processes pretty smoothly.
Now we’re starting to move our new plumbing and data infrastructure layers to Go. The language makes different trade-offs, but we’ve been quite impressed so far. We’ve got another post in the works on how it’s working out for us, along with a bunch of open source goodies! (and if you love distributed systems, we’re hiring!)
Have other tips for dealing with a large scale node system? Send them our way.
And as always: peace, love, ops, analytics.