Calvin French-Owen on June 5th 2020
This blog should not be construed as legal advice. Please discuss with your counsel what you need to do to comply with the GDPR, CCPA, and other similar laws.
Deletion – all identifying info related to the user must be properly deleted.
Suppression – the user should be able to specify where their data is used and sent (e.g. for a marketing, advertising, or product use case).
When you get a deletion request, it doesn’t just mean deleting a few rows of data in your database. It’s your responsibility to purge data about your users from all of your tools – email, advertising, and push notifications.
Typically, this process is incredibly time-consuming. We have seen companies create custom JIRA workflows, in-depth checklists, and other manual work to comply with the law.
In this article we’ll show you how to automate and easily respect user privacy by:
Managing consent with our open source consent manager.
Issuing DSAR (Data Subject Access Requests) on behalf of your users.
Federating those requests to downstream tools.
Let's dive in.
If you haven’t already, you’ll want to be sure you have a source data setup on your website, and collecting your user data through Segment.
// when a user first logs in, identify them with name and email
Generally, we recommend you first:
Generate user ID in your database – a user ID should never change! It’s best to generate these in your database, so they can stay constant even if a user changes their email address. We’ll handle anonymous IDs automatically.
Collect the traits you have – you don’t have to worry about collecting all traits with every call. We’ll automatically merge them for you, so just collect what you have.
Start with messaging – if you’re trying to come up with a list of traits to collect, start with email personalization. Most customers start by collecting email, first and last name, age, phone, role, and company info so they can send personalized emails or push notifications.
Once you’ve collected data, you’re ready to start your compliance efforts.
Giving users the ability to control what personal data is collected is a huge part of any privacy compliance regime.
We’ve built an open source drop-in consent manager that automatically works with Analytics.js.
Adding it in is straightforward.
First, you’ll want to remove the two lines from your analytics.js snippet.
analytics.load("<Your Write Key") // <-- delete meanalytics.page() // <-- delete me
These will automatically be called by the consent manager.
We’ve included some boilerplate configuration, which dictates when the consent manager is shown and what the text looks like. You’ll want to add this somewhere and customize it to your liking.
You’ll also want to add a target container for the manager to load.
You can and should also customize this to your liking.
Finally, we’re ready to load the consent manager.
<script src="https://unpkg.com/@email@example.com/standalone/consent-manager.js" defer></script>
Once you’re done, it should look like this.
Great, now we can let users manage their preferences! They can opt-in to all data collection, or just the portion they want to.
Now it’s time to allow users to delete their data. The simplest way to do this is to start an Airtable sheet to keep track of user requests, and then create a form from it.
At a minimum, you’ll want to have columns for:
The user identifier – either an email or user ID.
A confirmation if your page is public (making sure the user was authenticated).
A checkbox indicating that the deletion was submitted.
From there, we can automatically turn it into an Airtable form to collect this data.
To automate this you can use our GDPR Deletion APIs. You can automatically script these so that you don’t need to worry about public form submissions. We’ve done this internally at Segment.
Tip: Make sure deletions are guarded by some sort of confirmation step, or only accessible when the user is logged in.
Now we’re ready to put it all together. We can issue deletion requests within Segment for individual users.
This will remove user records from:
Your warehouses and data lakes
Downstream destinations that support deletion
To do so, simply go to the deletion manager under Workspace Settings > End User Privacy.
This will allow you to make a new request by ID.
Simply select “New Request”, and enter the user ID from your database.
This will automatically kick off deletions in any end tools which support them. You’ll see receipts in Segment indicating that these deletions went through.
As your different destinations begin processing this data, they will send you notifications as well.
And just like that, we’ve built deletion and suppression into our pipeline, all with minimal work!
Here’s what we’ve accomplished in this article. We’ve:
Collected our user data thoughtfully and responsibly by asking for consent with the Segment open source consent manager.
Accepted deletion requests via Airtable or the Segment deletion API.
Automated that deletion in downstream tools with the deletion requests.
Dominic Barnes on April 3rd 2015
Make is awesome! It’s simple, familiar, and compatible with everything. Unfortunately, editing a
Makefile can be challenging because it has a very terse and cryptic syntax. In this post, we will outline how we author them to get simple, yet powerful, build systems.
For the uninitiated, check out this gist by Isaac Schlueter. That gist takes the form of a heavily-commented Makefile, which makes it a great learning tool. In fact, I would recommend checking it out regardless of your skill level before reading the remainder of this post.
Here at Segment, we write a lot of code. One of our philosophies is that the code we write should be beautiful, especially since we’ll be spending literally hours a day looking at it.
By beautiful, we mean that code should not be convoluted and verbose, but instead it should be expressive and concise. This philosophy is even reflected in how we write a
We dedicate the top section of each
Makefile as a place to define variables (much like normal source code). These variables will be used to reduce the amount of code used in our recipes, making them far easier to read.
In node projects, we always rely on modules that are installed locally instead of globally. This allows us to give each project it’s own dependencies, giving us the room to upgrade freely without worrying about compatibility across our many other projects.
This decision requires more typing at first:
But it’s easily fixed by using
We use this same pattern frequently, as it helps to shorten the code written in a recipe, making the intention far more clear. This makes understanding the recipe much easier, which leads to faster development and maintenance.
Beyond just using variables for the command name, we also put shared flags behind their own variable as well.
This helps keep things dry, but also gives developers a hook to change the flags themselves if needed:
When writing code and interacting with developer tools, we seek to avoid noise as much as possible. There are enough things on a programmer’s mind, so it’s best to avoid adding to that cognitive load unnecessarilly.
One example is “echoing” in Make, which basically outputs each command of your recipe as it is being executed. You may notice that we used the
@ prefix on the recipes above, which actually suppresses that behavior. This is a small thing, but it is part of the larger goal.
We also run many commands in “quiet mode”, which basically suppresses all output except errors. This is one case where we definitely want to alert the developer, so they can take the necessary action to correct it.
make, now we only will see errors that happened with the corresponding build. If nothing is output, we can assume everything went according to plan!
There are some target names that are so commonly used, they practically become a convention. While we haven’t invented most of the targets I will mention here, the main principle here is that using names consistently throughout an organization is important to improve the experience for developers new to a specific project.
Since we have a lot of web projects, the
build/ directory is often reserved as the destination for any files we are bundling to serve to the client.
This target is used to delete any transient files from the project. This generally includes:
build/ directory (the generated client assets)
intermediary build files/caches
test coverage reports
Remote dependencies are not part of this process. (see clean-deps)
Depending on the size and complexity of a project, the downloaded dependencies can take a considerable amount of time to completely resolve and download. As a result, they are cleaned using a distinct target.
While Make will automatically assume the first target in a Makefile is the default one to run, we adopt the convention of putting a
default target in every
Makefile, just for consistency and flexibility.
For our projects, the
default target is usually synonymous with
build, as it is common practice to enter a project and use
make to kick off the initial build.
Runs static analysis (eg: JSHint, ESLint, etc) against the source code for this project.
This starts up the web server for the given project. (in the case of web projects)
This is exclusively for running the automated tests within a project. Depending on the complexity of the project, there could also be other related targets, such as
test-server. But regardless, the
test target will be the entry-point for a developer to run those tests.
All in all, Make is a powerful tool suitable for many projects regardless of size, tooling and environment. Other tools like Grunt and Gulp are great, but Make comes out on top for being even more powerful, expressive and portable. It has become a staple in practically all of our projects, and the conventions we follow have helped to create a more predictable workflow for everyone on the team.
Calvin French-Owen on April 1st 2015
We’ve been running Node in production for a little over two years now, scaling from a trickle of 30 requests per second up to thousands today. We’ve been hit with almost every kind of weird request pattern under the sun.
First there was the customer who liked to batch their data into a single dump every Friday night (getting called on a Friday night used to be a good thing). Then the user who sent us their visitor’s entire social graph with every request. And finally an early customer who hit us with a
while(true) send(data) loop and caused a minor emergency.
By now, our ops team has seen the good, the bad, and the ugly of Node. Here’s what we’ve learned.
One of the great things about Node is that you don’t have to worry about threading and locking. Since everything runs on a single thread, the state of the world is incredibly simple. At any given time there’s only a single running code block.
But here… there be dragons.
Our API ingests tons of small pieces of customer data. When we get data, we want to make sure we’re actually taking the JSON and representing any ISO Strings as dates. We traverse the JSON data we’d receive, converting any date strings into native
Date objects. As long as the total size is under
15kb, we’ll pass it through our system.
It seemed innocent enough… until we’d get a massively nested JSON blob and we’d start traversing. It’d take seconds, even minutes, before we chewed through all the queued up function calls. Here’s what the times and sizes would look like after an initial large batch would get rejected:
And then things would only get worse: the problems would start cascading. Our API servers would start failing healthchecks and disconnect from the ELB. The lack of heartbeat would cause the NSQ connection to disconnect so we weren’t actually publishing messages. Our customer’s clients would start retrying, and we’d be hit with a massive flood of requests. Not. Good.
Clearly something had to be done–we had to find out where the blockage was happening and then limit it.
Now we use node-blocked to get visibility into whether our production apps are blocking on the event loop, like this errant worker:
It’s a simple module which checks when the event loop is getting blocked and calls you when it happens. We hooked it up to our logging and statsd monitoring so we can get alerted when a serious blockage occurs.
We dropped in the module and immediately started seeing the following in our logs:
A customer was sending us really large batches of nested JSON. Applying a few stricter limits to our API (this was back before we had limits) and moving the processing to a background worker fixed the problem for good.
To further avoid event loop problems entirely, we’ve started switching more of our data processing services to Go and using goroutines, but that’s a topic for an upcoming post!
Error handling is tricky in every language–and node is no exception. Plenty of times, there will be an uncaught exception which–through no fault of your own–bubbles up and kills the whole process.
There are multiple ways around this using the
vm module or domains. We haven’t perfected error handling, but here’s our take.
Simple exceptions should be caught using a linter. There’s no reason to have bugs for
undefined vars when they could be caught with some basic automation.
To make that super easy, we started adding make-lint to all of our projects. It catches unhandled errors and undefined variables before they even get pushed to production. Then our makefiles run the linter as the first target of `make test`.
If you’re not already catching exceptions in development, add
make-lint today and save yourself a ton of hassle. We tried to make the defaults sane so that it shouldn’t hamper your coding style but still catch errors.
In prod, things get trickier. Connections across the internet fail way more often. The most important thing is that we know when and where uncaught exceptions are happening, which is often easier said than done.
Fortunately, Node has a global
uncaughtException handler, which we use to detect when the process is totally hosed.
We ship all logs off to a separate server for collection, so we want to make sure to have enough time to log the error before the process dies. Our cleanup could use a bit more sophistication, but typically we’ll attempt to disconnect and then exit after a timeout.
Actually serializing errors also requires some special attention (handled for us by YAL). You’ll want to make sure to include both the
stack explicitly, since they are non-enumerable properties and will be missed by simply calling
JSON.stringify on the error.
Finally, we’ve also written our own module called
oh-crapto automatically dump a heap snapshot for later examination.
It’s easily loaded into the chrome developer tools, and incredibly handy for those times we’re hunting the root cause of the crash. We just drop it in and we’ve instantly got full access to whatever state killed our beloved workers.
It’s easy to overload the system by setting our concurrency too high. When that happens, the CPU on the box starts pegging, and nothing is able to complete. Node doesn’t do a great job handling this case, so it’s important to know when we’re load testing just how much concurrency we can really deal with.
Our solution is to stick queues between every piece of processing. We have lots of little workers reading from NSQ and each of them sets a
maxInFlightparameter specifying just how many messages the worker should deal with concurrently.
If we see the CPU thrashing, we’ll adjust the concurrency and reload the worker. It’s way easier to think about the concurrency once at boot rather than constantly tweaking our application code and limiting it across different pipelines.
It also means we get free visibility into where data is queueing, not to mention the ability to pause entire data processing flows if a problem occurs. It gives us much better isolation between processes and makes them easier to reason about.
We moved away from using streams for most of our modules in favor of dedicated queues. But, there are a few places where they still make sense.
The biggest overall gotcha with streams is their error handling. By default, piping won’t cause streams to propagate their errors to whatever stream is next.
Take the example of a file processing pipeline which is reading some files, extracting some data and then running some transforms on it:
Looking at this code, it’s easy to miss that we haven’t actually setup our error handling properly. Sure, the resulting pipeline stream has handlers, but if any errors occur in the
Transform streams, they’ll go uncaught.
To get around this, we use Julian Gruber’s nifty
multipipe module, which provides a nice API over centralized error handling. That way we can attach a single error handler, and be off to the races.
If you’re also running Node in production and dealing with a highly variable data pipeline, you’ve probably run into a lot of similar issues. For all these gotchas, we’ve been able to scale our node processes pretty smoothly.
Now we’re starting to move our new plumbing and data infrastructure layers to Go. The language makes different trade-offs, but we’ve been quite impressed so far. We’ve got another post in the works on how it’s working out for us, along with a bunch of open source goodies! (and if you love distributed systems, we’re hiring!)
Have other tips for dealing with a large scale node system? Send them our way.
And as always: peace, love, ops, analytics.
Chris Sperandio on February 19th 2015
Before I joined Segment, I was something of a Github stalker. Which is how I found Segment.
(To be clear, I’m still a Github stalker, only now I work here.)
I snooped through Segment’s projects for I don’t know how long before starting to realize what drew me in so consistently across every project. And it wasn’t until I joined the team and learned the thought-process and ethos behind them that I gained a sincere appreciation for why we have over 1000 repos.
Our approach to software is radically modular, pluggable and composable.
Which makes sense, because in reality, that’s the whole point of Segment.
When you build your tools at the right level of abstraction, incidental complexity is hidden away and edge-cases take care of themselves. Not to mention, you can be a lot more productive. It’s what precipitated analytics.js in our early days: the intention to find the right level of abstraction for collecting data about who your users are and what they’re doing. It’s why people love express, koa, and reworktoo.
It’s also why we’re big proponents of the component and duo ecosystem. We even manage customer and partner logos with an extensible and modular systemand our entire front-end is built with components based on ripple and, more recently, deku.
It’s hard to communicate the power of this modular and composable approach, but it ends up being disarmingly obvious to developers and product strategists alike (see Rich Hickey’s presentation. Rather than attempt to explain it outright, I’ll give you a tour of our more popular open source repos to show you how we’ve tried to make them small, self-contained, and composable.
Let’s dive into some examples!
When building our documentation, academy, blog, job board, and help section, we wanted the speed and simplicity of static sites over the restrictiveness and complexity of a CMS. And though there’s a lot of logic that could be shared between them, each called for its own unique feature-set and build process.
But when we looked at existing static site generators, they all imposed a degree of structure on the content, and weren’t flexible enough for our wide array of use cases. Enter: Metalsmith.
Metalsmith does not impose any assumptions on your content model or build process. In fact, Metalsmith is just an abstraction for manipulating a directory of files.
Breaking static sites down to their core, the underlying abstraction includes content in files (blog entries, job listings, what have you), and these files’ associated metadata. Metalsmith allows you to read a directory of files, then run a series of plugins on that data to transform it exactly the way you need.
For example, you can run markdown files through handlebars templates, create navigation or a table of contents, compress images, concatenate scripts, or anything your heart desires before writing the result to the build directory.
For example, our blog articles are just files with two sections: a header with metadata about the author, date, title and url, and then markdown for the content of the article. Metalsmith transforms the markdown to HTML, wraps the posts in their layout, looks up and inserts the author’s avatar, renders any custom Handlebars helpers, etc. The beauty is that the build process is completely customizable and abstract for many use cases. It’s just a matter of which plugins you choose. And the word is out: the metalsmith plugin ecosystem is booming!
By building our static sites with Metalsmith and hosting the source on Github, our marketing, success, and business teams can create and edit posts right from the Github web interface, or work locally and “sync” their updates with Github for Mac. This workflow closely mirrors a traditional CMS, but gives us the speed and reliability of a pre-built, static site while also lending things like auto-generated code samples for every language and automatically compiled navigation, two places otherwise prone to falling out of date.
As you’ve heard, Segment is committed to a component-driven development model: breaking things into small pieces that can be developed in isolation, and then shared and reused. But when everything is a plugin, that means an awful lot of small repos. So we build tooling to make working with lots of small repos frictionless.
For example, when we or a partner want to add a new integration to Segment, the very first thing we need to do is create a new repo to house that project. In order to enforce common style across file structure and code, we built a project scaffolder that generates a “base” the developer can use to jump right into their project.
This is where Khaos comes in — our own project scaffolder that’s built on Metalsmith, the first example here of building building blocks (with building blocks :). In the source, you can see how even we tried to make Khaos itself composable and modular.
Khaos is really just a CLI wrapper for metalsmith with the following plugin pipeline:
Read template files from a directory.
Parse files for template placeholders.
Prompt user to fill in each placeholder.
Render files with a templating engine.
Write filled-in files to a new directory.
We have khaos templates for new logos, integrations, back-end services, nightmare plugins, etc. Not only does this make getting started easy, but it reinforces cultural values like defaulting yes to MIT licensing.
As you might guess, we use Segment as the backbone of our customer data pipeline to route our data into our third-party tools and to Amazon Redshift.
While we use our partners’ visualization tools to write and share ad-hoc queries against data in Segment SQL, we wanted to make the most important data points accessible in real time throughout the organization. So we built Metrics.
We query the underlying the data from Segment Warehouses and services like Stripe and Zendesk, and use Metrics to orchestrate these queries and store the aggregate metrics for each team. On any given team’s board you might see ARR, MRR, daily signups, the depth of our queue for new integration requests, number of active Zendesk tickets by department, number of deploys in the last week – the list goes on.
We’ll go into more details about the business motivations and outcomes around Metrics in a future blog post, but what excites me most about Metrics is how it’s designed under the hood. It’s another example of offloading feature scope to plugins.
You can use plugins to define what data gets collected and stored, the interval at which it’s updated, and where those metrics are pushed: to dashboards, spreadsheets, summary emails, or anywhere else your heart desires.
All metrics does is expose an API for orchestrating this dance via plugins.
Check it out on github here!
We have a bit of an obsession with automation and elimination of the mundane. And that’s what drove the development of Hermes.
Raph, our beloved head of sales and first businessperson at Segment, thought Hermes was Ian’s potty-mouthed “ami français” for a few months. Nope. Hermes is a chatbot whose sole feature is, you guessed it, a plugin interface.
When you’re building a new feature for Hermes, like looking up an account’s usage, or fetching a gif from the interwebs, all you need to do is tell Hermes what he’s listening for and what to say back. Everything in between, you define in your own plugin.
Whether we want to announce that lunch is here, check Loggly for errors related to a customer’s project, kickoff a Metalsmith build of the latest blog release, or create an SVG logo for a new integration, we get our boy Hermes Hubeau to do the dirty (repetitive) work.
We were thankful for the plugin approach when we switched from Hipchat to Slack. Instead of rewriting all of Hermes, we just hot-swapped the old plugin with the new!
“Wait a minute — your chat bot creates SVG logos?!”
Nope! Humans do. Hermes just knows how to ask politely. He creates a new logorepo with Khaos, then spins up a Nightmare instance based on Metalsmith plugins, navigates to 99designs Tasks, and posts a job. When the job is finished, he resizes the logos with our logo component creation CLI.
Automating these sorts of jobs, for which there was not yet a public API, required us to mimic and automate pointing and clicking in a web browser, and that’s what Nightmare does. While there were plenty of tools to do this, like PhantomJS, webdriver APIs imposed the burden of a convoluted interface and lots of mental overhead. So we wrote a library that puts all those headaches under the covers, and lets you automate browsers the way you browse the web:
Browser automation is nothing new, but we tried to give Nightmare a cleaner API and a plugin interface so people could more easily compose automations. The goal is to make it really simple to automate tasks on the web and create APIs where a public one doesn’t yet exist.
We try to break things into small, reusable pieces and hate to solve the same problem twice, pushing for simple solutions that build off of each other and are flexible enough for multiple use cases.
This level of commitment to “building building blocks” and sharing them with the community is what drew me in so hypnotically to Segment in the first place, and why I feel so immensely fortunate to be here now. As a success engineer, part of my job is to build and maintain internal tooling that enables us to better serve our customers. I’m empowered to apply the same principles and rigor used by our product team and core engineers to those projects, and my development and product direction skills have improved faster than I ever thought possible as a result.
If you think of any cool use cases for any of these tools at your company, we’d love to hear more about them. Tweet us @segment with your ideas or fork away on GitHub! We always appreciate new plugins and contributions. And if any of this particularly resonates with you, we’re hiring!
TJ Holowaychuk on February 21st 2014
One of the most popular logging libraries for node.js is Winston. Winston is a great library that lets you easily send your logs to any number of services directly from your application. For lots of cases Winston is all you need, however there are some problems with this technique when you’re dealing with mission critical nodes in a distributed system. To solve them we wrote a simple solution called Yet-Another-Logger.
The typical multicast logging setup looks something like this:
The biggest issue with this technique for us was that many of these plugins are only enabled in production, or cause problems that are only visible under heavy load. For example it’s not uncommon for such libraries to use CPU-intensive methods of retrieving stack traces, or cause memory leaks, or even worse uncaught exceptions!
Another major drawback is that if your application is network-bound like ours is, then sending millions of log requests out to multiple services can quickly take its toll on the network, slowing down everything else.
Finally the use of logging intermediaries allows you to add to or remove services at will, with without re-deploying most of your cluster or making code changes to the applications themselves.
Our solution was to build a simple client/server system of nodes to isolate any probelms just to a set of servers whose sole job is to fan out the logs. We call it Yet-Another-Logger, or YAL.
The Yet-Another-Logger client is pretty much you would expect from a standard logging client. It has some log-level methods and accepts
messagearguments—standard stuff. The only difference is that you instantiate the client with an array of YAL Server addresses, which it uses to round-robin:
YAL is backed by the Axon library, a zeromq-inspired messaging library. The great thing about this is that when a node goes down, messages will be routed to stable nodes, and then resume when the node comes back online.
The YAL server is also extremely simple. It accepts log events from the clients and distributes them to any number of configured services, taking the load off of mission-critical applications.
At the time of writing YAL Server is a library, and does not provide an executable, however in the near future an executable may be provided too. Until then a typical setup would include writing a little executable specific to your system.
Server plugins are simply functions that accept a
server instance, and listen on the
'message' event. That makes writing YAL plugins really simple. It’s also trivial to re-use an existing Winston setup by just plunking your Winston code right into YAL Server.
I’d recommend always running at least 3 YAL Servers in a cluster for redundancy, so you can be sure not to lose any data.
That’s all for now! The two pieces themselves are very simple, but combined they give your distributed system a nice layer of added protection against logging-related outages.
Coming up soon I’ll be blogging about some Elasticsearch tooling that we’ve built exlusively for digging through all of those logs we’re sending through YAL!
Anthony Short on February 19th 2014
Every month we’re going to do a round-up of all the projects we’ve open-sourced on Github. We have hundreds of projects available for anyone to use, ranging from CSS libraries and UI components to static-site generators and server tools. Not to mention that Segment all started from analytics.js.
Myth is a preprocess that lets you write pure CSS without having to worry about slow browser support, or even slow spec approval. It’s a like CSS polyfill.
Diff two versions of a node module.
Yet-Another-Logger that pushes logs to log servers with axon/tcp to delegate network overhead.
Adds some concurrency to a transform stream for that multiple items may be transformed at once.
A FIFO queue for co.
If you want to see more of the awesome code we’re releasing, follow us on Github or follow any of our team members. We’re all open-source fanatics.
Peter Reinhardt on November 27th 2013
When we analyze usage and customers and Segment, we constantly need to join queries across Mongo and Redis. Why? Because our account information is in Mongo and our API usage is in Redis. Today we’re open sourcing Hydros. It’s a quick cheat to let us run SQL queries for analysis, while using NoSQL in production.
What we’ve noticed is that every business question boils down to a simple join across account info and usage. Here are some examples:
Enterprise integrations: find the integrations used by projects (Mongo) that send over 100 million API calls per month (Redis).
Mobile projects: get the names of projects (Mongo) that use our iOS or Android SDKs (Redis).
Power users: get the emails of users (Mongo) who have 20 or more active projects (Redis).
Before Hydros, I’d cobble together a bunch of 50-line node scripts that would connect to both databases. All the join and relational logic was in code. It was horrible. Just a huge, messy folder of code that I never wanted to touch again. Check out cohort.js for a taste of what should have been a simple SQL query.
For an engineer turned business guy, this is pretty frustrating. I wanted something maintainable, that we could build on as the company grows.
This was such an annoying problem for us that we even went to so far as to sync our entire database to Google Spreadsheets so that we could sort, filter and join the databases there. Ilya made some magic happen there, but Google Docs is just really slow and clunky.
Finally! one night at Happy Data Hour, Josh from Mode yammered my ear off about how Yammer’s internal analytics system worked. I was a couple beers in, but what I understood, I liked :) I definitely walked away with a bastardized version of what they’ve accomplished over there, but…
It was akin to “data marts”… you sync your databases to SQL tables idempotently and transactionally, and then run the SQL queries there. Simple.
Here we were, getting all fancy with NoSQL, but the answer was right there all along. Good old SQL.
If we had a good syncing abstraction, all we’d need to do is:
Write idempotent transformations from production databases to SQL tables.
Run our queries against the SQL tables.
So that weekend I got really excited, and started building a similar system for ourselves. After a couple fresh starts and a rewrite, Hydros was born.
Hydros is a node module that lets you easily pull any data source into a MySQL table. You define the SQL table name, columns, and two functions:
list function is generates a list of all rows that should be in the Hydros table. For a user table this would be an array of all the user IDs. For a project table this would be an array of all the project IDs. Dropdead simple, that’s the point :)
get function is responsible for filling in a single row. The function is passed a single row id, and returns all the column values for that row. For a project table, this might mean looking up project metadata in Mongo, or looking up API usage in Redis.
Between those two functions, you get a full sync:
list all the rows, then
get the columns for each row.
Hydros handles table creation and manages the timing of
get for you automatically.
The goal is to have many simple tables in MySQL, and then have many simple Hydros instances syncing the data into them. We have a half-dozen tables already, and it’s growing quickly.
For example, a hydros implementation of an “Project API Usage” table might work like this:
list project IDs from Mongo
get each project’s API usage by pulling counters from Redis
The Hydros table gets a list of rows it should have by polling the
list function. Then, at a higher frequency, Hydros polls the
get function to fill out the columns for each row. You control the refresh time.
Here’s an incomplete implementation of that example:
At Segment we use Hydros to answer a ton of business questions. Combined with Chartio, even the nontechnical people on our team can run queries and dashboard the results.
We have seven tables so far:
Project API Usage (list: Mongo, get: Redis)
Project Integrations (list: Mongo, get: Mongo)
Project Channel Usage (list: Mongo, get: Redis)
Project Library Usage (list: Mongo, get: Redis)
Project Metadata (list: Mongo, get: Mongo)
User Metadata (list: Mongo, get: Mongo)
User Projects (list: Mongo, get: Mongo)
And from that we create 25 charts and tables. Here are some examples:
A table of client libraries, sorted by popularity.
A table of integrations sorted by popularity. Juicy competitor data!
A graph of monthly project cohorts, colored by payment tier.
A graph of monthly project cohorts, colored by client library.
The charts tell us which client libraries and integrations we should focus on, help us estimate future revenue, and even let us prioritize enterprise leads to contact.
Hydros keeps the underlying tables in sync with production Mongo and Redis, completing a sync every 8 hours. Chartio keeps the charts up to date within 30 minutes. This is plenty fast for product analytics!
With Hydros in place, we’re saving a ton of time. Instead of writing piles of janky query code for every business question, we just run SQL queries. Best of all, we get to use business intelligence tools like Chartio right out of the box. We’re actually able to build on our analysis instead of treading water.
If you want to take Hydros for a spin, we open-sourced it. Check it out on Github: https://github.com/segmentio/hydros.
Calvin French-Owen on May 9th 2013
Five months ago, we released a small library called Analytics.js by submitting it to Hacker News. A couple hours in it hit the #1 spot, and over the course of the day it grew from just 20 stars to over 1,000. Since then we’ve learned a ton about managing an open-source library, so I wanted to share some of those tips.
At the very beginning, we knew absolutely nothing about managing an open-source library. I don’t think any of us had even been on the merging side of a pull request before. So we had to learn fast.
Since Analytics.js has over 2,000 stars now, lots of people are making amazing contributions from the open-source community. Along the way, we’ve learned a lot about what we can do to keep pull requests top-quality, and how to streamline the development process for contributors.
New contributors will look at your existing codebase to learn how to add functionality to your library. And that’s exactly what they should be doing. Every developer wants to match the structure of a library they contribute to, but don’t own. Your job as a maintainer is to make that as easy as possible.
The trouble starts when your library leaves ambiguity in its source. If you do the same thing two different ways in two different places, how are contributors going to know which way is recommended? Answer: they won’t.
In the worst case they might even decide that because you aren’t consistent, they don’t have to be either!
Solving this takes a lot of discipline and consistency. As a rule, you shouldn’t experiment with different styles inside a single open source repository. If you want to change styles, do it quickly and globally. Otherwise, newcomers won’t be able to differentiate new conventions from ones you abandoned months ago.
We started off being very poorly equipped to handle this. All of our code lived in a single file, and the functions weren’t organized at all. (And if you check out the commits, that was after a cleanup!) We hadn’t taken the time to set a consistent style - the library was a jumble of different conventions.
As the pull requests started coming in, each one conflicted with the others. Everyone was modifying the same parts of the same files and adding their own utility functions wherever they felt like it.
Which leads into my next point…
The initial way we structured our code was leading to loads of problems with merging pull requests: namely we had no structure! One of the big changes we made to fix our structure was moving over to Component.
We love Component because it eliminates the magic from our code and reduces our library’s scope. It lets us use CommonJS, so we can just
require the modules we need, right from where we need them. Everything is explicit, which means our is code much easier for newcomers to follow. It’s a maintainer’s dream.
While making the switch, we wrote a bunch of our own components to replace all the utility functions we had been attaching to our global
analytics object. And now, since components are easy to include and use everywhere else in the library, pull requesters just use them by default!
As soon as we released the right way and made it clear, pull request quality went up dramatically.
As far as keeping a consistent style goes, you have to be militant when it comes to new code. You cannot be afraid of commenting on pull requests even if it seems like a minor style correction, or refusing requests which needlessly clutter your API.
And remember, that goes for your own code as well! If you get lazy while adding new features, why shouldn’t contributors? The more clean code in your repository, the more good examples you have for newcomers to learn from.
Speaking of not getting lazy…
Having great test coverage is easily the best way to speed up development. We push changes all the time, so we can’t afford to spend time worrying about breaking existing functionality. We write lots of tests, and get lots of benefits: much fastoer development, more confidence in our own code, more trust from outsiders, and…
It also leads to much higher-quality pull requests!
When developers copy the library coding style, that extends to tests as well. We don’t let contributors add their own code without adding corresponding tests.
The good part is we don’t have to enforce this too much. Thanks to Travis-CI, nearly all of our pull requests come complete with tests patterned from our existing tests. And since we’ve made sure our existing tests are high quality, the copied tests naturally start off at a higher bar.
Travis is so well integrated with GitHub, that it will make merging your pull requests significantly easier. Both you and your contributors get notifications when a pull request doesn’t pass all your tests, so they’ll be incentivized to fix a breaking change.
Having a passing Travis badge also inspires trust that your library actually does work correctly and is still maintained. Not to mention peer pressures you into keeping your tests in good condition (which we all know is the hardest part).
Versioning properly, just like testing, takes discipline and is completely worth it. Most developers when they are having issues will immediately check to see if they are running the later version of a library. If not, they’ll update and pray the bug is fixed. Without versioning, you’ll get issues being reported about bugs you’ve already fixed.
When we started, we had no idea how to manage versions at all; our repository was basically just a pile of commits. If you push frequent updates, this will needlessly hassle the developers using your library.
From the start, there are three things your repo needs for versioning:
Readme.md describing what your library does.
Version numbers both in the source and in git tags.
History.md containing versioned and dated descriptions of your changes.
Having a changelog is essential for letting developers track down issues. Any time a developer finds a potential bug, the changelog if the first place they’ll go to see if an upgrade will fix it.
Tagging our repository also turned out to be immensely useful. Each version has a well defined point where the code has stabilized. If you’re using a frontend registry like Component or Bower, you also get the advantage of automatic packaging.
Don’t forget to put the version somewhere in your source too, where the developer using the library can access it. Because when you’re helping someone remotely debug you’ll want to have a quick way to check what version they’re running so you don’t waste time needlessly.
Oh, and use semver. It’s the standard for open-source projects.
I’ll let you in on a little secret: we have close to no documentation for people wanting to contribute to the library. (That’s another problem we need to fix.) But I’m amazed at how many pull requests we receive even without any building or testing instructions whatsoever.
Just from looking at our repository, how do new developers know how to build the script?
Considering that most of them have never seen a Component-based repository, my best guess is that they learn through our Makefile. Running
make forms a natural entry point to see how a new project works. It’s become the de facto instruction manual for editing code.
We use long flags in our scripts, and give each command a descriptive name. By using Component to structure and a build our library, we can essentially defer to their documentation before writing any of our own. Once you understand how Component works, you know how any Component-based repo is built.
We’ve done a lot to clean up our Makefile through the different iterations of ourcode. Remember, your build and test processes are part of your code as well. They should be clean and readable as they are the starting point newcomers.
Notice how all the tips I’ve mentioned are about streamlining your process? That’s because maintaining a popular repository is all about staying above water. You’ll be making lots of little changes throughout the day as new issues are filed, and if you don’t optimize your development process all of your free time will evaporate.
Not only that, but the quality of your library will suffer. Without a good build system, automated testing, and a clean codebase, fixing small bugs becomes a chore, so issues start taking longer and longer to resolve. No one wants that.
We’ve learned a lot when it comes to managing a repo of our size, but we still have a long ways to go in terms of managing a really big project. Managing a large open source project takes a lot of work. As the codebase grows, it will be harder and harder to make major overhauls of the code. More and more people will start depending on it, and the number of pull requests will start growing.
We still have a few more TODOs that will hopefully make maintaining Analytics.js even easier:
We want to split up our tests even more to make them as manageable as possible. Right now the file sizes are getting pretty out of control, which means it’s hard for newcomers to keep everything in their head at once.
Add better contributor documentation. It’s kind of ridiculous that we haven’t done this yet. We’re surprised we’ve even gotten any pull requests at all without it, so this is very high on our list.
Start pull requesting every change we make to the repository ourselves as well. This way we can always peer review each other’s changes, and other contributors can get involved with discussions.
Lots of these tactics come from the Node.js source, which has great guidelines for new contributors.
Every Node.js commit is first pull requested, and reviewed by a core contributor before it is merged. New features are discussed first as issues or pull requests, so multiple opinions are considered. Node commit logs are clean, yet detailed. They have an extensive guide for new contributors, and a linter to serve as a rough style guide.
No project is perfect, but learning these lessons firsthand has helped us adopt better practices across the rest of our libraries as well. Hopefully, they’ll help you too!
Calvin French-Owen on February 4th 2013
It’s been said that “constraints drive creativity.” If that’s true, then PHP is a language which is ripe for creative solutions. I just spent the past week building our PHP library for Segment, and discovered a variety of approaches used to get good performance making server-side requests.
When designing client libraries to send data to our API, one of our top priorities is to make sure that none of our code affects the performance of your core application. That is tricky when you have a single-threaded, “shared-nothing” language like PHP.
To make matters more complicated, hosted PHP installations come in many flavors. If you’re lucky, your hosting provider will let you fork processes, write files, and install your own extensions. If you’re not, you’re sharing an installation with some noisy neighbors and can only upload
Ideally, we like to keep the setup process minimal and address a wide variety of use cases. As long as it runs with PHP (and possibly a common script or two), you should be ready to dive right in.
We ended up experimenting with three main approaches to make requests in PHP. Here’s what we learned.
The top search results for PHP async requests all use the same method: write to a socket and then close it before waiting for a response.
The idea here is that you open a connection to the server and then write to it as soon as it is ready. Since the socket write is fast and you don’t need the response at all, you close the connection immediately post-write. This saves you from waiting on a single round-trip time.
But as you can see from some of the comments on StackOverflow, there’s some debate about what’s actually going on here. It left me wondering: “How asynchronous is the socket approach?”
Here’s what our own code using sockets looks like:
The initial results weren’t promising. A single
fsockopen call was taking upwards of 300 milliseconds, and occasionally much longer.
As it turns out,
fsockopen is blocking - not very asynchronous at all! To see what’s really going on here, you have to dive into the internals of how
fsockopen actually works.
As a refresher, the basic protocol on which the internet run is called TCP. It ensures that messages between computers are transmitted reliably and get ordered properly. Since nearly all HTTP runs over TCP, we use it for our API to make writing custom clients simple.
Here’s the gist of how a TCP socket gets started:
The client sends a
syn message to the server.
The server responds with an
The client sends a final
ack message and starts sending data.
For those of you counting, that’s a full roundtrip before we can send data to the server, and before
fsockopen will even return. Once the connection is open, we can write our data to the socket. Typically this can take anywhere from
30-100msto establish a connection to our servers.
While TCP connections are relatively fast, the chief culprit here is the extra handshake required for SSL.
The SSL implementation works on top of TCP. After the TCP handshake happens, the client then begins a TLS handshake.
It ends up being 3 round trips to establish an SSL connection, not to mention the time required to set up the public key encryption.
SSL connections in the browser can avoid some of these round-trips by reusing a shared secret which has been agreed upon by client and server. Since normal sockets aren’t shared between PHP executions, we have to use a fresh socket each time and can’t re-use the secret!
It’s possible to use the
socket_set_nonblock to create a “non-blocking” socket. This won’t block on the open call but you’ll still have to wait before writing to it. Unless you’re able to schedule time-intensive work in between opening the socket and writing data, your page load will still be slowed by
A better approach is to open a persistent socket using
pfsockopen. This will re-use earlier socket connections made by the PHP process, which doesn’t require a TCP handshake each time. Though the initial latency is higher during the first time a request is made, I was able to send over 1000 events/sec from my development machine. Additionally we can decide to read from the responsebuffer when debugging, or choose to ignore it in production.
To sum it up:
Sockets can still be used when the daemon running PHP has limited privileges.
fsockopen is blocking and even non-blocking sockets must wait before writing.
Using SSL creates significant slowdown due to extra round-trips and crypto setup.
Opening a connection sets every page request back
pfsockopen will block the first time, but can re-use earlier connections without a handshake.
Sockets are great if you don’t have access to other parts of the machine, but an approach which will give you better performance is to log all of the events to a file. This log file can then be processed “out of band” by a worker process or a cron job.
The file-based approach has the advantage of minimizing outbound requests to the API. Instead of making a request whenever we call
identify from our PHP code, our worker process can make requests for
100 events at a time.
Another advantage is that a PHP process can log to a file relatively quickly, processing a write in only a few milliseconds. Once PHP has opened the file handle, appending to it with
fwrite is a simple task. The log file essentially acts as the “shared memory queue” which is difficult to achieve in pure PHP.
To read from the analytics log file, we wrote a simple python uploader which uses of our batching
analytics-python library. To ensure that the log files don’t get too large, the uploading script renames the file atomically. Activitely writing PHP files are still able to write to their existing file handles in memory, and new requests create a new log file where the old one used to be.
There’s not too much magic to this approach. It does require a more work on the side of the developer to set up the cron job and separately install our python library through PyPI. The key takeaways are:
Writing to a file is fast and takes few system resources.
Logging requires some drive space, and the daemon must have capabilities to write to the file.
You must run a worker process to process the logged messages out of band.
As a last alternative, your server can run
exec to make requests using a forked
curl process. The
curl request can complete as part of a separate process, allowing your PHP code to render without blocking on the socket connection.
In terms of performance, forking a process sits between the two of our earlier approaches. It is much quicker than opening a socket, but more resource intensive than opening a handle to a file.
To execute the forked curl process, our condensed code looks like this:
If we’re running in production mode, we want to make sure that we aren’t waiting on the forked process for output. That’s why we add the
"> /dev/null 2>&1 &" to our command, to ensure the process gets properly forked and doesn’t log anywhere.
The equivalent shell command looks like this:
It takes a little over
1ms to fork the process, which then uses around 4k of resident memory. While the
curl process takes the standard SSL
300ms to make the request, the
exec call can return to the PHP script right away! This lets us serve up the page to our clients much more quickly.
On my moderately sized machine, I can fork around
curl requests per second without them stacking up in memory. Without SSL, it can do significantly more:
Forking a process without waiting for the output is fast.
curl takes the same time to make a request as socket, but it is processed out of band.
Forking curl requires only normal unix primitives.
Forking sets a single request back only a few milliseconds, but many concurrent forks will start to slow your servers.
While not an approach to making async requests, we found that destructor functions help us batch API requests.
To reduce the number of requests we make to our API, we want to queue these requests in memory and then batch them to the API. Without using runtime extensions, this can only happen on a single script execution of PHP.
To do this we create a queue on initialization. When the script ends its execution, we send all the queued requests in batch:
We establish the queue when the object is created, and then flush the queue when the object is ready to be destroyed. This guarantees that our queue is only flushed once per request.
Additionally, we can create the socket itself in a non-blocking way in the constructor, then attempt to write to it in the destructor. This gives the connection more time to be established while the PHP interpreter is busy trying to render the page - but we will still have to wait before actually writing to the socket.
Our holy grail is a pure-PHP implementation which doesn’t interface with other processes, yet still is conservative when it comes to making requests. We’d like to make developer setup as easy as possible without requiring a dedicated queue on a separate host.
In practice, this is extremely hard to achieve. Each one of our methods have caveats and restrictions depending on how much traffic you’re dealing with and what your system allows you to do. Since no single approach can cover every use case, we built different adapters to support different users with different needs.
Originally we used the
curl forking approach as our default. Forking a process doesn’t cause a significant performance hit for page load, and is still able to scale out to many requests per second per host. However, this is limited to the configuration of the host, and can have scary consequences if your PHP program starts forking too many processes at once.
After switching to persistent sockets, we decided to make the socket approach our default. Without the TCP handshake per every request, the sockets can deal with thousands of request per second. This approach also has significantly better portability than the
For really high traffic clients who have bit more control over their own hardware, we still support the log file system. If the process which actually serves the PHP can’t re-use socket connections, then this is the best option from a performance perspective.
Ultimately, it comes to knowing a little bit about the limitations of your system and its load profile. It’s all about determining which trade-offs you’re comfortable making.
Edit 2/6/13: Originally I had stated that we used the
curl forking approach for our default and had ignored persistent sockets altoghether. After switching to persistent sockets, the performance of the socket approach increased enough to make it our default approach. It also has better portability across PHP installations.
PS. If you’re running WordPress, we also released a WordPress plugin that handles everything for you!