If you’re an engineer responding to an incident, you might need to access production servers for troubleshooting and to identify what’s wrong.
At Segment, to SSH to a server running one of your services, you used to go through a SSH Bastion host. A bastion host is a server used by an organization to provide access to a private network from an external network. Because bastion hosts are exposed to potential attacks, they must be extra secure to minimize the chances of them being compromised. Since the SSH Bastion has port 22 (SSH) exposed to the internet, no matter where you are, you can still access our bastion host.
To access a server using a bastion host, an engineer needed the following:
They had to be part of the Segment internal Okta group that grants SSH Access.
They had to create and upload a public key for their development machine to a private service. This key was then replicated to servers in our infrastructure.
They had to complete an MFA push on their phone when they connected to the bastion host.
With all this, we ensured only authorized employees accessed our backend servers.
Even though our authentication system met the requirements of preventing unauthorized access to our infrastructure, it wasn’t perfect. There are several issues that arose with this system for granting access to our infrastructure.
When using bastion hosts, we used to have an Okta group that granted SSH access. Just by being in that Okta group, you got access to our infrastructure, regardless of the AWS account. With this in mind, we needed to figure out how to follow our principle of least privilege, considering that Segment was creating AWS accounts at a faster rate than ever. We needed to scope infrastructure access to only what the engineer needed. Granting access to servers on all AWS accounts through a single Okta group was a major issue.
Managing Okta groups
Our IT team manages the Okta group that grants SSH access to engineers. Every time an update to the group has to be made, we need to create an IT ticket. As the organization changes, keeping this group up to date is hard. This manual process slows down engineering and adds load to the IT team.
For each new AWS account we create, we have to provision bastion hosts to access the new account internal network. Without it, engineers wouldn’t be able to access instances they spin up. This step means higher load on the tooling team to provision and maintain these servers.
Bastion hosts need to be patched very frequently, since they are reachable from anywhere, even if it only exposes the SSH service. This adds on top of the already large list of servers that need to be regularly updated.
In the past, it’s been difficult to rotate SSH keys, run regular audits of Okta groups, and other security maintenance tasks, even with a single network entry point to our infrastructure exposed to the internet. Granting any user permanent access to production infrastructure is a bad security practice, especially if you consider that we already limit access duration on AWS Accounts using Access Service.
To be successful, the new system needed the following:
Backwards-compatibility with legacy workflows.
Automated provisioning SSM access when setting up new AWS accounts. No manual infrastructure set up required.
Easy to understand, while providing a good developer experience.
Our systems are tightly coupled with AWS accounts, so we needed to find a solution that was compatible with AWS IAM. We had already been using Access Service to administer least privilege, time-bound access to Okta groups. Since IAM Roles are also accessed through Okta groups, we wanted to use Access Service to grant time-based access to IAM Roles in our AWS Accounts.
We decided to replace our SSH bastion hosts with AWS Systems Manager Session Manager.
AWS SSM gives us the functionality we need to replace the SSH Bastion hosts:
Terminal access, with session logging for the Security team.
Port-forwarding, so we can use our existing SSH stack. SSH still has a lot of flexibility that developers need for developing, testing and troubleshooting. If we only granted our engineers Terminal access, it would generate a lot of friction.
The AWS Systems Manager Session Manager service requires an ssm-agent running on the instances it grants access to. That meant that we had to add the ssm-agent to all instances in our infrastructure, and update the IAM Role attached to the instances so they could access the AWS APIs the SSM Agent requires. This was a massive team undertaking, because we had to update the launch template of all our Autoscaling Groups and rotate all the instances in all our AWS accounts.
With this solution, engineers can use specific AWS APIs to access the instances, which means we use AWS IAM as a gateway to our infrastructure.
Engineers who have access to a specific IAM Role can get terminal access to instances, and, for more complex workflows, they can use SSM to create a port-forwarding session to connect to the SSH server running on the instance.
Using IAM to scope down access
Each AWS account now has a new IAM role that grants engineers access to EC2 instances on that account. This role is different from the IAM roles that engineers use in their day-to-day work, so it’s only requested by engineers who require access to EC2 instances.
Engineers request access to IAM Roles in AWS Accounts through Access Service. We map these Okta groups to IAM Roles, so accessing an IAM Role in an AWS account is only one Okta tile away.
With this simple architecture, we ensure that Segment engineers only have access to instances in a specific AWS account, while no longer relying on the IT team to provision access.
As we move away from our monolithic AWS account into smaller and better-scoped ones, we automatically make SSH/Terminal access to these accounts follow the principle of least privilege more closely.
AWS SSM Session logging
One of the cool features of AWS SSM Session Manager is session logging. When you use SSM with terminal access, you can specify an S3 bucket as a destination for a screen recording of the terminal session.
All these configurations are kept in objects called SSM Documents. Since these configurations define whether sessions would be logged, the destination S3 bucket and other capabilities, we must make sure only an approved set of SSM Documents are used. To enforce it, we generate all SSM documents in one AWS account, sharing these documents with the rest of the AWS Accounts in the organization. We also limit the qIAM Roles to reference only SSM documents from the AWS account that created them.
The code below is an example of the IAM Policies we attached to the IAM Roles that grant access to AWS SSM, forcing them to only use the SSM Documents we create.
With the CLI command below, engineers can access an instance. Only the SSM documents from the specific AWS account will be valid:
aws ssm start-session --region us-east-1 --target i-XXXXXXXXXXXX --document "arn:aws:ssm:us-east-1:XXXXXXXXXXX:document/Example-Document-Shell-Access”
This project took roughly 6 months to complete, and was a joint effort between the Cloud Security, SRE, and Tooling teams.
To summarize, these are the main wins from this project:
Engineers get access to infrastructure only through IAM Roles. Engineers now request access to these IAM Roles via Access Service, and use them with aws-okta from their command line, therefore inheriting the Okta authentication standards (username/password + MFA).
We were able to reduce the cost, complexity, and maintenance of our infrastructure, as well as eliminating the need to distribute SSH Keys.
Last but not least, we reduced Segment’s attack surface by not having any SSH port open to the world.