Solon Aguiar, Brian Lai on January 20th 2022
Anthony Short on February 19th 2014
Every month we’re going to do a round-up of all the projects we’ve open-sourced on Github. We have hundreds of projects available for anyone to use, ranging from CSS libraries and UI components to static-site generators and server tools. Not to mention that Segment all started from analytics.js.
Myth is a preprocess that lets you write pure CSS without having to worry about slow browser support, or even slow spec approval. It’s a like CSS polyfill.
Diff two versions of a node module.
Yet-Another-Logger that pushes logs to log servers with axon/tcp to delegate network overhead.
Adds some concurrency to a transform stream for that multiple items may be transformed at once.
A FIFO queue for co.
If you want to see more of the awesome code we’re releasing, follow us on Github or follow any of our team members. We’re all open-source fanatics.
Peter Reinhardt on November 27th 2013
When we analyze usage and customers and Segment, we constantly need to join queries across Mongo and Redis. Why? Because our account information is in Mongo and our API usage is in Redis. Today we’re open sourcing Hydros. It’s a quick cheat to let us run SQL queries for analysis, while using NoSQL in production.
What we’ve noticed is that every business question boils down to a simple join across account info and usage. Here are some examples:
Enterprise integrations: find the integrations used by projects (Mongo) that send over 100 million API calls per month (Redis).
Mobile projects: get the names of projects (Mongo) that use our iOS or Android SDKs (Redis).
Power users: get the emails of users (Mongo) who have 20 or more active projects (Redis).
Before Hydros, I’d cobble together a bunch of 50-line node scripts that would connect to both databases. All the join and relational logic was in code. It was horrible. Just a huge, messy folder of code that I never wanted to touch again. Check out cohort.js for a taste of what should have been a simple SQL query.
For an engineer turned business guy, this is pretty frustrating. I wanted something maintainable, that we could build on as the company grows.
This was such an annoying problem for us that we even went to so far as to sync our entire database to Google Spreadsheets so that we could sort, filter and join the databases there. Ilya made some magic happen there, but Google Docs is just really slow and clunky.
Finally! one night at Happy Data Hour, Josh from Mode yammered my ear off about how Yammer’s internal analytics system worked. I was a couple beers in, but what I understood, I liked :) I definitely walked away with a bastardized version of what they’ve accomplished over there, but…
It was akin to “data marts”… you sync your databases to SQL tables idempotently and transactionally, and then run the SQL queries there. Simple.
Here we were, getting all fancy with NoSQL, but the answer was right there all along. Good old SQL.
If we had a good syncing abstraction, all we’d need to do is:
Write idempotent transformations from production databases to SQL tables.
Run our queries against the SQL tables.
So that weekend I got really excited, and started building a similar system for ourselves. After a couple fresh starts and a rewrite, Hydros was born.
Hydros is a node module that lets you easily pull any data source into a MySQL table. You define the SQL table name, columns, and two functions:
list function is generates a list of all rows that should be in the Hydros table. For a user table this would be an array of all the user IDs. For a project table this would be an array of all the project IDs. Dropdead simple, that’s the point :)
get function is responsible for filling in a single row. The function is passed a single row id, and returns all the column values for that row. For a project table, this might mean looking up project metadata in Mongo, or looking up API usage in Redis.
Between those two functions, you get a full sync:
list all the rows, then
get the columns for each row.
Hydros handles table creation and manages the timing of
get for you automatically.
The goal is to have many simple tables in MySQL, and then have many simple Hydros instances syncing the data into them. We have a half-dozen tables already, and it’s growing quickly.
For example, a hydros implementation of an “Project API Usage” table might work like this:
list project IDs from Mongo
get each project’s API usage by pulling counters from Redis
The Hydros table gets a list of rows it should have by polling the
list function. Then, at a higher frequency, Hydros polls the
get function to fill out the columns for each row. You control the refresh time.
Here’s an incomplete implementation of that example:
At Segment we use Hydros to answer a ton of business questions. Combined with Chartio, even the nontechnical people on our team can run queries and dashboard the results.
We have seven tables so far:
Project API Usage (list: Mongo, get: Redis)
Project Integrations (list: Mongo, get: Mongo)
Project Channel Usage (list: Mongo, get: Redis)
Project Library Usage (list: Mongo, get: Redis)
Project Metadata (list: Mongo, get: Mongo)
User Metadata (list: Mongo, get: Mongo)
User Projects (list: Mongo, get: Mongo)
And from that we create 25 charts and tables. Here are some examples:
A table of client libraries, sorted by popularity.
A table of integrations sorted by popularity. Juicy competitor data!
A graph of monthly project cohorts, colored by payment tier.
A graph of monthly project cohorts, colored by client library.
The charts tell us which client libraries and integrations we should focus on, help us estimate future revenue, and even let us prioritize enterprise leads to contact.
Hydros keeps the underlying tables in sync with production Mongo and Redis, completing a sync every 8 hours. Chartio keeps the charts up to date within 30 minutes. This is plenty fast for product analytics!
With Hydros in place, we’re saving a ton of time. Instead of writing piles of janky query code for every business question, we just run SQL queries. Best of all, we get to use business intelligence tools like Chartio right out of the box. We’re actually able to build on our analysis instead of treading water.
If you want to take Hydros for a spin, we open-sourced it. Check it out on Github: https://github.com/segmentio/hydros.
Calvin French-Owen on May 9th 2013
Five months ago, we released a small library called Analytics.js by submitting it to Hacker News. A couple hours in it hit the #1 spot, and over the course of the day it grew from just 20 stars to over 1,000. Since then we’ve learned a ton about managing an open-source library, so I wanted to share some of those tips.
At the very beginning, we knew absolutely nothing about managing an open-source library. I don’t think any of us had even been on the merging side of a pull request before. So we had to learn fast.
Since Analytics.js has over 2,000 stars now, lots of people are making amazing contributions from the open-source community. Along the way, we’ve learned a lot about what we can do to keep pull requests top-quality, and how to streamline the development process for contributors.
New contributors will look at your existing codebase to learn how to add functionality to your library. And that’s exactly what they should be doing. Every developer wants to match the structure of a library they contribute to, but don’t own. Your job as a maintainer is to make that as easy as possible.
The trouble starts when your library leaves ambiguity in its source. If you do the same thing two different ways in two different places, how are contributors going to know which way is recommended? Answer: they won’t.
In the worst case they might even decide that because you aren’t consistent, they don’t have to be either!
Solving this takes a lot of discipline and consistency. As a rule, you shouldn’t experiment with different styles inside a single open source repository. If you want to change styles, do it quickly and globally. Otherwise, newcomers won’t be able to differentiate new conventions from ones you abandoned months ago.
We started off being very poorly equipped to handle this. All of our code lived in a single file, and the functions weren’t organized at all. (And if you check out the commits, that was after a cleanup!) We hadn’t taken the time to set a consistent style - the library was a jumble of different conventions.
As the pull requests started coming in, each one conflicted with the others. Everyone was modifying the same parts of the same files and adding their own utility functions wherever they felt like it.
Which leads into my next point…
The initial way we structured our code was leading to loads of problems with merging pull requests: namely we had no structure! One of the big changes we made to fix our structure was moving over to Component.
We love Component because it eliminates the magic from our code and reduces our library’s scope. It lets us use CommonJS, so we can just
require the modules we need, right from where we need them. Everything is explicit, which means our is code much easier for newcomers to follow. It’s a maintainer’s dream.
While making the switch, we wrote a bunch of our own components to replace all the utility functions we had been attaching to our global
analytics object. And now, since components are easy to include and use everywhere else in the library, pull requesters just use them by default!
As soon as we released the right way and made it clear, pull request quality went up dramatically.
As far as keeping a consistent style goes, you have to be militant when it comes to new code. You cannot be afraid of commenting on pull requests even if it seems like a minor style correction, or refusing requests which needlessly clutter your API.
And remember, that goes for your own code as well! If you get lazy while adding new features, why shouldn’t contributors? The more clean code in your repository, the more good examples you have for newcomers to learn from.
Speaking of not getting lazy…
Having great test coverage is easily the best way to speed up development. We push changes all the time, so we can’t afford to spend time worrying about breaking existing functionality. We write lots of tests, and get lots of benefits: much fastoer development, more confidence in our own code, more trust from outsiders, and…
It also leads to much higher-quality pull requests!
When developers copy the library coding style, that extends to tests as well. We don’t let contributors add their own code without adding corresponding tests.
The good part is we don’t have to enforce this too much. Thanks to Travis-CI, nearly all of our pull requests come complete with tests patterned from our existing tests. And since we’ve made sure our existing tests are high quality, the copied tests naturally start off at a higher bar.
Travis is so well integrated with GitHub, that it will make merging your pull requests significantly easier. Both you and your contributors get notifications when a pull request doesn’t pass all your tests, so they’ll be incentivized to fix a breaking change.
Having a passing Travis badge also inspires trust that your library actually does work correctly and is still maintained. Not to mention peer pressures you into keeping your tests in good condition (which we all know is the hardest part).
Versioning properly, just like testing, takes discipline and is completely worth it. Most developers when they are having issues will immediately check to see if they are running the later version of a library. If not, they’ll update and pray the bug is fixed. Without versioning, you’ll get issues being reported about bugs you’ve already fixed.
When we started, we had no idea how to manage versions at all; our repository was basically just a pile of commits. If you push frequent updates, this will needlessly hassle the developers using your library.
From the start, there are three things your repo needs for versioning:
Readme.md describing what your library does.
Version numbers both in the source and in git tags.
History.md containing versioned and dated descriptions of your changes.
Having a changelog is essential for letting developers track down issues. Any time a developer finds a potential bug, the changelog if the first place they’ll go to see if an upgrade will fix it.
Tagging our repository also turned out to be immensely useful. Each version has a well defined point where the code has stabilized. If you’re using a frontend registry like Component or Bower, you also get the advantage of automatic packaging.
Don’t forget to put the version somewhere in your source too, where the developer using the library can access it. Because when you’re helping someone remotely debug you’ll want to have a quick way to check what version they’re running so you don’t waste time needlessly.
Oh, and use semver. It’s the standard for open-source projects.
I’ll let you in on a little secret: we have close to no documentation for people wanting to contribute to the library. (That’s another problem we need to fix.) But I’m amazed at how many pull requests we receive even without any building or testing instructions whatsoever.
Just from looking at our repository, how do new developers know how to build the script?
Considering that most of them have never seen a Component-based repository, my best guess is that they learn through our Makefile. Running
make forms a natural entry point to see how a new project works. It’s become the de facto instruction manual for editing code.
We use long flags in our scripts, and give each command a descriptive name. By using Component to structure and a build our library, we can essentially defer to their documentation before writing any of our own. Once you understand how Component works, you know how any Component-based repo is built.
We’ve done a lot to clean up our Makefile through the different iterations of ourcode. Remember, your build and test processes are part of your code as well. They should be clean and readable as they are the starting point newcomers.
Notice how all the tips I’ve mentioned are about streamlining your process? That’s because maintaining a popular repository is all about staying above water. You’ll be making lots of little changes throughout the day as new issues are filed, and if you don’t optimize your development process all of your free time will evaporate.
Not only that, but the quality of your library will suffer. Without a good build system, automated testing, and a clean codebase, fixing small bugs becomes a chore, so issues start taking longer and longer to resolve. No one wants that.
We’ve learned a lot when it comes to managing a repo of our size, but we still have a long ways to go in terms of managing a really big project. Managing a large open source project takes a lot of work. As the codebase grows, it will be harder and harder to make major overhauls of the code. More and more people will start depending on it, and the number of pull requests will start growing.
We still have a few more TODOs that will hopefully make maintaining Analytics.js even easier:
We want to split up our tests even more to make them as manageable as possible. Right now the file sizes are getting pretty out of control, which means it’s hard for newcomers to keep everything in their head at once.
Add better contributor documentation. It’s kind of ridiculous that we haven’t done this yet. We’re surprised we’ve even gotten any pull requests at all without it, so this is very high on our list.
Start pull requesting every change we make to the repository ourselves as well. This way we can always peer review each other’s changes, and other contributors can get involved with discussions.
Lots of these tactics come from the Node.js source, which has great guidelines for new contributors.
Every Node.js commit is first pull requested, and reviewed by a core contributor before it is merged. New features are discussed first as issues or pull requests, so multiple opinions are considered. Node commit logs are clean, yet detailed. They have an extensive guide for new contributors, and a linter to serve as a rough style guide.
No project is perfect, but learning these lessons firsthand has helped us adopt better practices across the rest of our libraries as well. Hopefully, they’ll help you too!
Calvin French-Owen on February 4th 2013
It’s been said that “constraints drive creativity.” If that’s true, then PHP is a language which is ripe for creative solutions. I just spent the past week building our PHP library for Segment, and discovered a variety of approaches used to get good performance making server-side requests.
When designing client libraries to send data to our API, one of our top priorities is to make sure that none of our code affects the performance of your core application. That is tricky when you have a single-threaded, “shared-nothing” language like PHP.
To make matters more complicated, hosted PHP installations come in many flavors. If you’re lucky, your hosting provider will let you fork processes, write files, and install your own extensions. If you’re not, you’re sharing an installation with some noisy neighbors and can only upload
Ideally, we like to keep the setup process minimal and address a wide variety of use cases. As long as it runs with PHP (and possibly a common script or two), you should be ready to dive right in.
We ended up experimenting with three main approaches to make requests in PHP. Here’s what we learned.
The top search results for PHP async requests all use the same method: write to a socket and then close it before waiting for a response.
The idea here is that you open a connection to the server and then write to it as soon as it is ready. Since the socket write is fast and you don’t need the response at all, you close the connection immediately post-write. This saves you from waiting on a single round-trip time.
But as you can see from some of the comments on StackOverflow, there’s some debate about what’s actually going on here. It left me wondering: “How asynchronous is the socket approach?”
Here’s what our own code using sockets looks like:
The initial results weren’t promising. A single
fsockopen call was taking upwards of 300 milliseconds, and occasionally much longer.
As it turns out,
fsockopen is blocking - not very asynchronous at all! To see what’s really going on here, you have to dive into the internals of how
fsockopen actually works.
As a refresher, the basic protocol on which the internet run is called TCP. It ensures that messages between computers are transmitted reliably and get ordered properly. Since nearly all HTTP runs over TCP, we use it for our API to make writing custom clients simple.
Here’s the gist of how a TCP socket gets started:
The client sends a
syn message to the server.
The server responds with an
The client sends a final
ack message and starts sending data.
For those of you counting, that’s a full roundtrip before we can send data to the server, and before
fsockopen will even return. Once the connection is open, we can write our data to the socket. Typically this can take anywhere from
30-100msto establish a connection to our servers.
While TCP connections are relatively fast, the chief culprit here is the extra handshake required for SSL.
The SSL implementation works on top of TCP. After the TCP handshake happens, the client then begins a TLS handshake.
It ends up being 3 round trips to establish an SSL connection, not to mention the time required to set up the public key encryption.
SSL connections in the browser can avoid some of these round-trips by reusing a shared secret which has been agreed upon by client and server. Since normal sockets aren’t shared between PHP executions, we have to use a fresh socket each time and can’t re-use the secret!
It’s possible to use the
socket_set_nonblock to create a “non-blocking” socket. This won’t block on the open call but you’ll still have to wait before writing to it. Unless you’re able to schedule time-intensive work in between opening the socket and writing data, your page load will still be slowed by
A better approach is to open a persistent socket using
pfsockopen. This will re-use earlier socket connections made by the PHP process, which doesn’t require a TCP handshake each time. Though the initial latency is higher during the first time a request is made, I was able to send over 1000 events/sec from my development machine. Additionally we can decide to read from the responsebuffer when debugging, or choose to ignore it in production.
To sum it up:
Sockets can still be used when the daemon running PHP has limited privileges.
fsockopen is blocking and even non-blocking sockets must wait before writing.
Using SSL creates significant slowdown due to extra round-trips and crypto setup.
Opening a connection sets every page request back
pfsockopen will block the first time, but can re-use earlier connections without a handshake.
Sockets are great if you don’t have access to other parts of the machine, but an approach which will give you better performance is to log all of the events to a file. This log file can then be processed “out of band” by a worker process or a cron job.
The file-based approach has the advantage of minimizing outbound requests to the API. Instead of making a request whenever we call
identify from our PHP code, our worker process can make requests for
100 events at a time.
Another advantage is that a PHP process can log to a file relatively quickly, processing a write in only a few milliseconds. Once PHP has opened the file handle, appending to it with
fwrite is a simple task. The log file essentially acts as the “shared memory queue” which is difficult to achieve in pure PHP.
To read from the analytics log file, we wrote a simple python uploader which uses of our batching
analytics-python library. To ensure that the log files don’t get too large, the uploading script renames the file atomically. Activitely writing PHP files are still able to write to their existing file handles in memory, and new requests create a new log file where the old one used to be.
There’s not too much magic to this approach. It does require a more work on the side of the developer to set up the cron job and separately install our python library through PyPI. The key takeaways are:
Writing to a file is fast and takes few system resources.
Logging requires some drive space, and the daemon must have capabilities to write to the file.
You must run a worker process to process the logged messages out of band.
As a last alternative, your server can run
exec to make requests using a forked
curl process. The
curl request can complete as part of a separate process, allowing your PHP code to render without blocking on the socket connection.
In terms of performance, forking a process sits between the two of our earlier approaches. It is much quicker than opening a socket, but more resource intensive than opening a handle to a file.
To execute the forked curl process, our condensed code looks like this:
If we’re running in production mode, we want to make sure that we aren’t waiting on the forked process for output. That’s why we add the
"> /dev/null 2>&1 &" to our command, to ensure the process gets properly forked and doesn’t log anywhere.
The equivalent shell command looks like this:
It takes a little over
1ms to fork the process, which then uses around 4k of resident memory. While the
curl process takes the standard SSL
300ms to make the request, the
exec call can return to the PHP script right away! This lets us serve up the page to our clients much more quickly.
On my moderately sized machine, I can fork around
curl requests per second without them stacking up in memory. Without SSL, it can do significantly more:
Forking a process without waiting for the output is fast.
curl takes the same time to make a request as socket, but it is processed out of band.
Forking curl requires only normal unix primitives.
Forking sets a single request back only a few milliseconds, but many concurrent forks will start to slow your servers.
While not an approach to making async requests, we found that destructor functions help us batch API requests.
To reduce the number of requests we make to our API, we want to queue these requests in memory and then batch them to the API. Without using runtime extensions, this can only happen on a single script execution of PHP.
To do this we create a queue on initialization. When the script ends its execution, we send all the queued requests in batch:
We establish the queue when the object is created, and then flush the queue when the object is ready to be destroyed. This guarantees that our queue is only flushed once per request.
Additionally, we can create the socket itself in a non-blocking way in the constructor, then attempt to write to it in the destructor. This gives the connection more time to be established while the PHP interpreter is busy trying to render the page - but we will still have to wait before actually writing to the socket.
Our holy grail is a pure-PHP implementation which doesn’t interface with other processes, yet still is conservative when it comes to making requests. We’d like to make developer setup as easy as possible without requiring a dedicated queue on a separate host.
In practice, this is extremely hard to achieve. Each one of our methods have caveats and restrictions depending on how much traffic you’re dealing with and what your system allows you to do. Since no single approach can cover every use case, we built different adapters to support different users with different needs.
Originally we used the
curl forking approach as our default. Forking a process doesn’t cause a significant performance hit for page load, and is still able to scale out to many requests per second per host. However, this is limited to the configuration of the host, and can have scary consequences if your PHP program starts forking too many processes at once.
After switching to persistent sockets, we decided to make the socket approach our default. Without the TCP handshake per every request, the sockets can deal with thousands of request per second. This approach also has significantly better portability than the
For really high traffic clients who have bit more control over their own hardware, we still support the log file system. If the process which actually serves the PHP can’t re-use socket connections, then this is the best option from a performance perspective.
Ultimately, it comes to knowing a little bit about the limitations of your system and its load profile. It’s all about determining which trade-offs you’re comfortable making.
Edit 2/6/13: Originally I had stated that we used the
curl forking approach for our default and had ignored persistent sockets altoghether. After switching to persistent sockets, the performance of the socket approach increased enough to make it our default approach. It also has better portability across PHP installations.
PS. If you’re running WordPress, we also released a WordPress plugin that handles everything for you!