Segment loads billions of rows of arbitrary events into our customers’ data warehouses every single day. How do we test a change that can corrupt only one field in millions, across thousands of warehouses? How can we verify the output when we don’t even control the input?
We recently migrated all our customers to a new data pipeline without any major customer impact. Through this process we had to solidify our testing story, to the point where we were able to compare billions of entries for row-per-row parity. In this post, I’d like to share a few of the techniques we used to make that process both fast and efficient.
Before going into our testing strategy, it’s worth understanding a little about the data we process. Segment has a single API for sending data that will be loaded into a customer’s data warehouse. Each API call generates what we call an ‘event’. We transform these events into rows in a database.
Below, you can see an example of how an event from a website source gets transformed into a database row.
Note that the information in the event depends on the customer. We accept any number of properties and any type of value: number, string, timestamp, etc.
Our first iteration of the warehouse pipeline (v1) was just a scheduler and a task to process the data. This worked well until our customer base increased in both number and volume. At that point the simplicity caught up with us and we had to make a change.
We came up with a new architecture (v2) that gave us greater visibility into our pipeline and let us scale at more granular levels. Though this was the right way forward, it was a completely new pipeline, and we needed a migration plan.
Our goal was to switch customers to the new pipeline without them noticing. All data should be written to the database exactly the same way in the v1 pipeline as in the v2 pipeline. If we found a bug in how we processed data in v1, we kept that bug in v2. Customers expect to receive our data in a certain way and have built tooling around it. We can’t just change course and expect all our customers to change their tooling accordingly.
We also want minimize any potential for new bugs. When we have a bug with how we load data, deploying a fix will only fix future data. Previous data has already been loaded into a customer’s database that we don’t control. To fix the bad rows, we have to re-run the pipeline over the old events that were incorrectly loaded.
In other parts of the Segment infrastructure, our normal testing strategies consist of some combination of code reviews and tests.
Code reviews are very useful but don’t guarantee that something won’t slip through the cracks. Unit and integration tests are great for testing basic functionality and edge cases. However, customers send us too much variance in data and we can’t exhaustively test, or know about, every case.
For out first pass at testing the new pipeline, we resorted to manually sending events and looking for any data anomalies. But this quickly fell short. As we started processing more and more data through our pipeline, we realized we needed a new solution.
Large Scale Testing
We needed to come up with a testing solution that gave us confidence.
To solve this problem, our approach was straightforward: build a system to do what we were trying to do manually. In essence, we wanted to run the same dataset through both the v1 and v2 warehouse pipelines and make sure the result from both pipelines is exactly the same. We call this service Warehouse QA.
When a pull request becomes ready for testing, a request is made on that pull request to trigger a webhook, which begins the QA run.
Let’s walk through a concrete example of how this works.
Step 1: Send a Request from Github
We trigger a QA request by adding a comment on a pull request in a format the service understands. We can request a specific dataset to run through the pipelines to verify if a bug is fixed or test the proposed changes under a certain type of load. If a dataset is not specified, we use one of our defaults.
Step 2: Process the Request
Once the service receives a new request, it starts runs of both the v1 and v2 pipelines over the chosen dataset. Each pipeline writes data under a unique schema in our test warehouse so we can easily query the results.
Step 3: Audit Results
The most important step is the validation or audit. We check every table, column, and value that was loaded from the v2 pipeline and make sure it matches what was loaded from the v1 pipeline.
Here is the struct that represents the results for a given run. We first check that the exact same set of tables were created by both pipelines.
For each table, we then dive deeper to populate the fields in the table struct below. This struct compares a given table from the v1 run to the table created by the v2 run.
Note that we are checking more than just the counts of the two runs. Counts can give false positives if there are both extra rows and missing rows. Counts also don’t find differences in the field values themselves. This piece is critical to check since we have to be sure we aren’t processing data differently in the new pipeline.
Step 4: Reporting
Now that we’ve compared what we loaded from the v1 run with the v2 run we need to be able to report the data succinctly. When the run request is complete, we post the following overview on the pull request:
If we open the results file under
detailed results above, this is what we see for each table that was identical under v1 and v2:
If the pipelines outputted different results, it looks like this:
Step 5: Debug Differing Results
At this point, we have all the information we need to figure out why the two pipelines wrote different values.
Below is a comparison of data that differed between the v1 and v2 pipelines, taken from the results of a QA run. Each row corresponds to a row in our test warehouse. The fields are the id of the row in the table, the column that differed in the two pipelines, and the differing values, respectively. The red value is the value from the v1 pipeline and the green from v2.
Why are timestamps getting dropped in v1 here? We know that timestamps aren’t all getting dropped, and confirmed this with tests. It turns out that dates before the epoch (January 1st, 1970) were getting dropped because we converted timestamps to integers using Golang’s .Unix() method and dropped values <= 0.
Now that we found the root cause we can alter v2 accordingly. We then run QA again with the fix, and see that it passes.
We found this tool so valuable that we still use it today, even though the migration to the v2 pipeline is complete. The QA system now compares the code running in production to the code on a pull request. We’ve even integrated with Github status checks. A QA run fo each warehouse type we support is automatically run on every pull request.
Warehouse QA was crucial to our successful migration. It allowed us to test the unknown at scale and find any and all differences without any manual intervention. It continues to allow us to deploy quickly and confidently with a small team by thoroughly testing hundreds of thousands of events.