When we analyze usage and customers and Segment, we constantly need to join queries across Mongo and Redis. Why? Because our account information is in Mongo and our API usage is in Redis. Today we’re open sourcing Hydros. It’s a quick cheat to let us run SQL queries for analysis, while using NoSQL in production.
What we’ve noticed is that every business question boils down to a simple join across account info and usage. Here are some examples:
Enterprise integrations: find the integrations used by projects (Mongo) that send over 100 million API calls per month (Redis).
Mobile projects: get the names of projects (Mongo) that use our iOS or Android SDKs (Redis).
Power users: get the emails of users (Mongo) who have 20 or more active projects (Redis).
Before Hydros, I’d cobble together a bunch of 50-line node scripts that would connect to both databases. All the join and relational logic was in code. It was horrible. Just a huge, messy folder of code that I never wanted to touch again. Check out cohort.js for a taste of what should have been a simple SQL query.
For an engineer turned business guy, this is pretty frustrating. I wanted something maintainable, that we could build on as the company grows.
This was such an annoying problem for us that we even went to so far as to sync our entire database to Google Spreadsheets so that we could sort, filter and join the databases there. Ilya made some magic happen there, but Google Docs is just really slow and clunky.
Finally! one night at Happy Data Hour, Josh from Mode yammered my ear off about how Yammer’s internal analytics system worked. I was a couple beers in, but what I understood, I liked :) I definitely walked away with a bastardized version of what they’ve accomplished over there, but…
Yammer’s system was simple and badass.
It was akin to “data marts”… you sync your databases to SQL tables idempotently and transactionally, and then run the SQL queries there. Simple.
Here we were, getting all fancy with NoSQL, but the answer was right there all along. Good old SQL.
If we had a good syncing abstraction, all we’d need to do is:
Write idempotent transformations from production databases to SQL tables.
Run our queries against the SQL tables.
So that weekend I got really excited, and started building a similar system for ourselves. After a couple fresh starts and a rewrite, Hydros was born.
Hydros is a node module that lets you easily pull any data source into a MySQL table. You define the SQL table name, columns, and two functions:
list function is generates a list of all rows that should be in the Hydros table. For a user table this would be an array of all the user IDs. For a project table this would be an array of all the project IDs. Dropdead simple, that’s the point :)
get function is responsible for filling in a single row. The function is passed a single row id, and returns all the column values for that row. For a project table, this might mean looking up project metadata in Mongo, or looking up API usage in Redis.
Between those two functions, you get a full sync:
list all the rows, then
get the columns for each row.
Hydros handles table creation and manages the timing of
get for you automatically.
The goal is to have many simple tables in MySQL, and then have many simple Hydros instances syncing the data into them. We have a half-dozen tables already, and it’s growing quickly.
For example, a hydros implementation of an “Project API Usage” table might work like this:
listproject IDs from Mongo
geteach project’s API usage by pulling counters from Redis
The Hydros table gets a list of rows it should have by polling the
list function. Then, at a higher frequency, Hydros polls the
get function to fill out the columns for each row. You control the refresh time.
Here’s an incomplete implementation of that example:
How we use it.
At Segment we use Hydros to answer a ton of business questions. Combined with Chartio, even the nontechnical people on our team can run queries and dashboard the results.
We have seven tables so far:
Project API Usage (list: Mongo, get: Redis)
Project Integrations (list: Mongo, get: Mongo)
Project Channel Usage (list: Mongo, get: Redis)
Project Library Usage (list: Mongo, get: Redis)
Project Metadata (list: Mongo, get: Mongo)
User Metadata (list: Mongo, get: Mongo)
User Projects (list: Mongo, get: Mongo)
And from that we create 25 charts and tables. Here are some examples:
A table of client libraries, sorted by popularity.
A table of integrations sorted by popularity. Juicy competitor data!
A graph of monthly project cohorts, colored by payment tier.
A graph of monthly project cohorts, colored by client library.
The charts tell us which client libraries and integrations we should focus on, help us estimate future revenue, and even let us prioritize enterprise leads to contact.
Hydros keeps the underlying tables in sync with production Mongo and Redis, completing a sync every 8 hours. Chartio keeps the charts up to date within 30 minutes. This is plenty fast for product analytics!
With Hydros in place, we’re saving a ton of time. Instead of writing piles of janky query code for every business question, we just run SQL queries. Best of all, we get to use business intelligence tools like Chartio right out of the box. We’re actually able to build on our analysis instead of treading water.
If you want to take Hydros for a spin, we open-sourced it. Check it out on Github: https://github.com/segmentio/hydros.