Have you ever had that feeling of dread when making big changes to that mission critical service sitting in the corner which no one wants to touch? At Segment, ours was called authZ (authorization) service and it had not received much love in recent years. History has proven time and again that changes to the authZ service often led to engineering incidents. Worse yet, these changes would result in security vulnerabilities! The stress level was too much to bear.
What if I were to tell you there was a way to apply the changes safely? A way to make sweeping changes, deploy to production, and avoid high stress levels. Sounds too good to be true? Not for the teams here at Segment! I will share our experiences and help you address the root cause of that fear and make future changes as stress-free and secure as possible.
The impetus for change
Segment utilizes role-based access control (RBAC) to determine who can access what resources at any given time. Our original implementation of RBAC was a service written in the Go programming language. The performant Go runtime coupled with RBAC’s flexibility served us well for years. But, there was a problem: devs had been avoiding the service in those years. Few updates had been made and large changes were frowned upon.
At some point, the team that created the service moved on and the ownership fell to our shoulders. We were happy to inherit such an important service. But our engineering organization primarily dealt with TypeScript and Node.js servers. Lacking expertise in the Go stack made us shy away from making any big changes. Not to mention any mistake could lead to severe security vulnerabilities. Worst of all, there were very few tests! We had no way of knowing if our change broke anything. Eventually everyone adopted the “if it ain’t broke, don’t fix it” mentality. The authZ service had become that service sitting in the corner.
Then came the Organizations project, where a top level management entity, one that allows customers to group Segment workspaces, users and subresources together, was introduced. The project required fundamental changes to our authorization service and, to move forward, we had to make significant updates and to tackle the following challenges head on:
Developers are unwilling to make changes unless absolutely necessary
Changes often caused authorization issues which lead to security vulnerabilities
Difficulty debugging aforementioned issues arose due to lack of observability and instrumentation
Tackling the challenges
We determined delivering the Organizations project required significant changes to authZ service. This allowed us to prioritize something we wanted to do for a long time: rewrite it in TypeScript/Node.js stack. It would bring the codebase into our comfort zone and encourage developers to make changes instead of avoiding them. This would open the door to increase test coverage, improve observability, and make other enhancements to the service.
After some discussions on whether to take the plunge, we finalized our plan to tackle the challenges:
The existing service would be rewritten in TypeScript and served on Node.js servers
A test harness with all possible permutations of permission checks would be created and deployed to staging and production. Scheduled jobs would execute these tests, ensuring no access control rules are violated
Logging would be enhanced for each action being performed with data needed to debug permission related issues
The safe and secure way
Broken Access Control (BAC) is the number one security risk on OWASP top ten today. Migrating our authorization service while keeping Segment’s application (app.segment.com) running could create plenty of opportunities to introduce BAC-related bugs. We needed to avoid introducing security vulnerabilities to maintain customer trust.
Luckily, we have the perfect tool for the job here at Segment.
Our strategy was to break the migration into three distinct phases. Phase one takes a page out of the book, Working Effectively with Legacy Code, by focusing on building a test harness plus implementing the new service. Phase two involves side-by-side execution of both versions of the service and result comparison. Last phase utilizes the side-by-side setup from phase two to safely shift the traffic to the new service. Here are the detailed steps for each phase:
Build a test harness against the legacy service.
Implement the new service.
Run the tests against the new service, analyze test results and fix all the issues.
Repeat the previous step until the test harness yields identical results for both services.
Run both the legacy and the new service side by side.
Put a service facade to intercept requests and send them to both services. For every request, the facade will still return the response from the legacy service .
Compare responses from both services. For requests resulting in differences, collect metrics and log parameters. These would be done in the background without impacting response time of the permission check.
Analyze the differences and fix all the issues.
Repeat the previous step until we have 100% parity on all the responses.
Switch to return the response from the new service.
Keep the legacy service running and still compare the results for just a little bit longer.
After a period of time, if there are no surprises, shut down the legacy service and celebrate!
Details of our migration
The test harness and the rewrite
With this strategy as our guide, we started with the steps outlined in phase one. We broke up the tasks into two parallel workstreams:
A set of integration tests for all possible permutations of the permission checks. It was developed against the legacy Go service.
Rewriting the service in TypeScript.
Due to the large number of test cases we had to cover, the implementation of the harness was further divided into two distinct sets. The first set covered core permission check modules. Combination of different roles, resources, actions, and subjects were enumerated as test cases. The second set consisted of tests on endpoints requiring permission checks. Once completed, these would guard against any access control violation going forward.
During the rewrite, we ran the tests against the new AuthZ service whenever possible. This way, we were able to detect and fix issues as early as we could. The process was repeated until all test cases had passed.
Safely conducting a trial by fire
Once the new authZ service was ready, we ran it side-by-side with the legacy version. Below is a detailed diagram outlining our setup:
With this setup, both services handled the permission checks and the results could be compared immediately. Differences were logged along with the data required for analysis. For example, in each entry, we would log the subject, the role, the resource, the action, and the outcome indicating permission denied or granted. Last but not least, we continued to rely on the results from the legacy service, thus preventing any incorrect responses from the new version causing security vulnerabilities.
Since we were also migrating the database, we needed a way to keep the data in sync while both versions were running. Luckily, having both databases based on Amazon RDS and the AWS Data Migration Service (DMS) was the perfect solution to synchronize live-data. We implemented a one-directional data flow from the legacy database to the new one. Keep in mind we were at the testing phase, so the policy data from the legacy database was still the source of truth and would override the copy in the new one.
Both services and the data sync job ran for a month. During that time, we had monitors in DataDog looking for differences in permission checks. Upon detection, we were alerted immediately and then went through a round of analysis and fixes. This cycle was repeated until all permission checks yielded identical results.
Flipping the switch
At this point, we had reached 100% parity between the legacy Go service and the new TypeScript incarnation. We felt confident to complete the switch. The authorization service facade was updated to respond with the results from the new service and the data sync task in AWS DMS was turned off.
However, the legacy service was still running and the results verified were still monitoring for differences. The whole setup remained in this state for another month. This gave us an escape hatch in case we encountered any serious BAC-related vulnerability.
Alas! We have reached our destination. While there were a couple of performance related sevs along the journey, we did not encounter any security incidents. We achieved the perfect safety record by following the steps meticulously. At the end of the one-month mark, the legacy service was sunsetted and we headed out for a celebration!
Lessons we’ve learned during the migration
The temptation to refactor while rewriting the service
Many of us saw this migration as an opportunity to refactor and improve the legacy authZ service. There were some structural issues and bugs we were itching to address. At the same time, we were also tasked to deliver within a certain timeframe.
After much discussion, the team leaned toward minimal changes because we knew any differences could potentially lead to issues needing investigation. A direct rewrite from Go to TypeScript was the way to go. This maximized available time for stabilizing the new service to reduce possible vulnerabilities. The decision paid off big time since we were able to finish new service implementation quickly, leaving us plenty of time to debug differences between the new and old systems. Similar code flows in both versions of the authZ service also helped us quickly locate the root cause for the bugs. The end result was a successful migration with no serious security incidents!
Perplexing intermittent permission diffs that are not reproducible!
Replication lag between primary and secondary database instances
During the side-by-side comparison phase where the legacy service was still serving the responses, we started noticing failing tests and monitors whenever policies were updated. Permission differences would show up whenever policies were changed. And to make matters worse, we were not able to reproduce the differences when rerunning the permission check with the exact same set of parameters!
It took us a long time to figure out that database replication lag was the culprit. Prior to the migration, the legacy Authz service ran on a single database instance. Policy changes were reflected instantaneously in permission check results. The new service used a cluster of database instances to improve load balancing. The replication lag from the read/write primary to the read-only instances meant the policies were out-of-sync for a short period of time. If any permission checks were conducted during the time window, the results would be different.
The problem presented a unique challenge for us. We had to find a balance between consistency and load-balancing. So we did a benchmark to gain visibility into the load for permission check on primary-only. To our pleasure, we found the load only resulted in a small single-digit percentage increase in cpu usage. Then the fix was easy: let’s always use the primary database instance for permission checks. This solution gave us immediate consistency while not impacting overall performance. Problem solved and everyone was happy!
Data sync delay caused by AWS Data Migration Service (DMS)
After switching to reading from primary, we saw a reduction in diffs. But! Diffs are still happening. Once again, we noticed the permission check diffs only occurred when updating policies. The cause this time lay with the AWS DMS. Because we relied on the legacy service to update policies, the data needed time to sync between the two databases. If a permission check is done prior to completing the synchronization, it would be different between two services.
To mitigate this delay, the team came up with an ingenious solution: a delayed retry!
We calculated the average time needed for synchronization to complete. Whenever a diff was encountered, the code would wait a bit longer than the average before checking again. If the second check had the same result as the old service, we knew the diff was caused by data replication delay and it would be considered as a non-issue.
With this additional logic, we eliminated a whole class of errors needing to be investigated. This saved a huge amount of time!
Test harness = peace of mind
Tests, tests, tests! Get them to cover as many scenarios as possible, and run them as early and as often possible in the CI/CD pipeline. These tests catch violations early and give you peace of mind when your changes reach production. We used them as the litmus test for reaching feature parity with the legacy service. Many months after finalizing the migration, the test harness we created during the first phase continues to guard against broken access control vulnerabilities.
Did we hit our goals?
With our authorization service fully migrated, we looked back and reviewed each of the goals we set out to achieve:
The new service is now sitting in an ecosystem in which the majority of developers are comfortable making changes.
Automated tests in all environments ensure access control integrity is checked continuously. Alarms will sound if any violation is detected.
Now we have the details of each permission check in the logs allowing for a quick response in the case of any access control issues.
To sum it up: mission accomplished!
We laid a solid foundation to enable future changes to the authorization service and we completed a critical milestone which allowed the Organizations project to become a reality.
Last but not the least, the migration would not have been possible without countless colleagues contributing to the effort. All the brilliant ideas and thoughtful planning are the sole reason for our success. It was a long journey and a big shoutout to everyone on the team for carrying the migration to completion.