Go back to Blog

Engineering

Tim French on April 29th 2021

Say goodbye to long customer support wait times and lengthy back-and-forth with Segment Personas and Twilio Flex.

All Engineering articles

Geoffrey Keating on June 21st 2021

This guide explains what data management is and how it lets organizations capture their data’s upside while removing the downsides of unmanaged data.

Geoffrey Keating on June 9th 2021

A customer data hub is the primary collection point for all your customer information. It connects to all the channels, platforms, and products your customers use.

Geoffrey Keating on June 8th 2021

Data Modeling 101: What Are Data Models?

Through data models, developers, data architects, business analysts, and other stakeholders can agree on the data they’ll capture and for what purposes before building databases and warehouses.

A data model specifies what information to capture, how it relates to each other, and how to store it, establishing data standards for your entire organization. For example, a model for an eCommerce website might specify the customer data you’ll capture. It will define how to label that data and its relation to product information and the sales process.

Like a blueprint for a house, a data model defines what to build and how, before starting construction, when things become much more complicated to change. This approach prevents database design and development errors, capturing unnecessary data, and duplicating data in multiple locations.

In this article, we’ll cover these basics of data modeling:

  • Understanding different types of data models

  • Why data models are necessary for building a data infrastructure

  • Top three data modeling techniques

Understanding different types of data models

Data models get divided into three categories: abstract, conceptual, and physical models. They help align stakeholders around the why, how, and what of your data project. Each type of model serves a different purpose and audience in the data modeling process.

Conceptual data models

Conceptual data models visualize the concepts and rules that govern the business processes you’re modeling without going into technical details. You use this visualization to align business stakeholders, system architects, and developers on the project and business requirements: what information the data system will contain, how elements should relate to each other, and their dependencies.

Typically, a conceptual model shows a high-level view of the system’s content, organization, and relevant business rules. For example, a data model for an eCommerce business will contain vendors, products, customers, and sales. A business rule could be that each vendor needs to supply at least one product.

There’s no standard format for conceptual models. What matters is that it helps both technical and non-technical stakeholders align and agree on the purpose, scope, and design of their data project. All of the below images could be examples of conceptual data models.

Logical data models

A logical data model is based on the conceptual model and defines the project’s data elements and relationships. You’ll see the names of specific entities in the database, as well as their attributes. To stay with the eCommerce example: A logical model shows products are identified through a “product ID,” with properties like a description, category, and unit price.

Data architects and business analysts use the logical data model to plan the implementation of a database management system—software that stores, retrieves, defines, and manages data in a database.

Physical data models

The physical data model gets technical. Database analysts and developers use it for the design of the database. The model specifies the types of data you’ll store along with technical requirements.

An example of data type specifications is whether a piece of data will be an integer—a number without a decimal point—or a float—a number with a decimal place. Technical requirements include details on storage needs, access speed, and data redundancy—storing a piece of data in multiple locations to increase durability and improve query performance.

In practice, only very large projects, say modeling a container shipping business, move from conceptual to logical to physical models. Most other projects skip the conceptual phase and spend most of their time in logical modeling. Some teams even cover elements from the physical phase simultaneously because the people working on the logical model also do the technical implementation.

Why data models are necessary for building a data infrastructure

Data models turn abstract ideas (“we want to track our global container shipments in real time”) into a technical implementation plan (“we will store an attribute called ‘container GPS location’ in a table called ‘Containers’ as an integer”). They help avoid costly demolition and reconstruction of your data infrastructure because you need to think about the data you’ll need, its relations, the database framework, and even whether your project is viable before creating databases and warehouses.

Data models also help with data governance and legal compliance. They allow you to set standards from the start of the project so teams don’t end up with conflicting data formats that need cleaning up before they can use it or, worse, can’t use at all.

Data models and standardization help avoid situations like a sign-up field labeled in nearly a dozen different ways across the organization.

You can also identify sensitive information—social security numbers, passwords, credit card numbers—while you’re modeling so you can involve security and legal experts before you start building.

With safe, accurate, and high-quality data, all teams benefit. Product teams can iterate faster and build immersive user experiences. Analytics teams can create queries without heavy workarounds. And marketing teams can improve advertising efforts by personalizing messaging according to user behaviors and traits.

Customer Data Platforms (CDPs) like Segment can do much of the heavy-lifting during data modeling projects. Segment’s Connections feature makes it easy to capture, organize, and visualize every customer-facing interaction with your business, whether digital or offline. Protocols lets you define your data standards and enforce them at the point of collection. Using real-time data validation and automatic enforcement controls, you can diagnose issues before they pollute your marketing and analytics tools or data warehouse.

Top three data modeling techniques

There are many different techniques to design and structure a database. You should explore these techniques and decide on the most suitable one for your project at the end of the conceptual phase. These data modeling methodologies define how the database gets structured and closely relate to the type of formatting or technology you can use to manage your data project.

For example, many people now default to graph modeling because it’s new and popular, even when a simple relational model would suffice. Understanding the most popular techniques helps you avoid such mistakes.

1. Relational data modeling

In a relational data model, data gets stored in tables, of which specific elements link to information in other tables. Entities can have a one-to-one, one-to-many, or many-to-many relationship.

Relational databases often use SQL (Structured Query Language), a programming language, for accessing and managing data. They’re frequently used in point-of-sale systems, as well as for other types of transaction processing.

The Entity-Relationship Model—sometimes referred to as ER model—is similar to the relational model. It visualizes the relationships between different elements in a system but without going into technical details. You can use the ER model during the conceptual phase to align technical and non-technical stakeholders.

2. Dimensional data modeling

To understand dimensional data models, picture a cube. Each side of the cube represents an aspect of the data you’re trying to capture.

For example, suppose your business sells multiple products to different customer segments, and you want to evaluate sales performance over time. You can visualize this as a data cube, with dimensions for time, products, and customer segments. By traveling up, down, left, and right on the axes of the cube, you can make comparisons across all those dimensions. You’ll see how the sales of each of these products compare to each other and different customer segments at any point in time.

You use the cube model during the conceptual phase. One of the most frequent manifestations of such a cube in the logical stage is the “star schema,” like the one below. At first, it might look like a relational model. Still, the star schema is different because it has a central node that connects to many others.

3. Graph data modeling

During the conceptual phase, most people sketch a data model on a whiteboard. Such a sketch resembles the graph model. It consists of “nodes” and edges—a node represents where the data is stored, the edge the relation between nodes. It’s also the main advantage of this approach: “what you sketch on the whiteboard is what you store in the database.”

Other techniques require you to translate the output from the conceptual phase into a different format for the logical and physical implementation—for example, going from an ER to a relational model or from a cube model to a star schema. Not so with graph models. You can implement them straight away using technology like Neo4j, a native graph database platform.

Data models don't have to be difficult

When you understand the purpose of data models and the process to follow, they’re not challenging to create, especially if you also collect, organize, and standardize your data with Segment. You’ll align all stakeholders before starting technical implementation and avoid costly mistakes or rebuilds. You’ll know what expertise you need on the team to execute your plan and have your data governance defined, too.

Andy Li on June 1st 2021

Access is always changing. When you start at a new company, you usually are given access to a set of apps provisioned to you on day one, based on your team and role. Even on day one, there can be a difference between the access you are granted and the access you need to do your job. This results in two outcomes: underprovisioned or overprovisioned access. 

For the IT and security teams who manage cloud infrastructure accounts, securing access to them can be difficult and scary; the systems are complex, and the stakes are high. If you grant too much access, you might allow bad actors access to your tools and infrastructure, which at best results in a breach notification; at worst, it results in a company-ending, game-over scenario. If you grant too little access, you put roadblocks between your colleagues and the work they need to do, meaning you are decreasing your company’s productivity.

Overprovisioned access

A common approach taken by startups and small companies is to grant access permissively. In these companies, early productivity can be critical to the success of the business. An employee locked out of a system because of missing access means lost productivity and lost income for the business. 

If you give employees permanent admin access to every system, you optimize for velocity, but at the expense of increased risks from compromised employee accounts and insider threats. This results in an increased attack surface. As your company grows, it becomes more important to secure access to critical resources, and this requires a different approach.

Underprovisioned access

If you give employees too little access, it forces them to request access more often. Although new employees are initially given access based on their team and role, new duties and new projects can quickly increase the scope of the access they need. Depending on your company’s process for providing access, this can be cumbersome for the requester, for the approver, or oftentimes, for both. 

Here at Segment, we have production environments across Amazon Web Services (AWS) and Google Cloud Platform (GCP). We need to secure access to these accounts thoughtfully so that our engineers can continue to build fast and safely. At many companies, you might rely on a centralized team to manage internal access. While this is a simple approach, it does not scale – team members have a limited amount of context surrounding requests, and might accidentally over-provision the requester’s access. At Segment, we approached the problem of managing least-privilege cloud access by building Access Service: a tool that enables time-based, peer-reviewed access.

Setting the stage: access at Segment

At Segment, we have hundreds of roles across dozens of SaaS apps and cloud providers representing different levels of access. In the past, we used to have to log in to each app or system individually to grant a user access. Our IT team managed to “federate” our cloud access and use Okta as our Identity Provider. This gave us a single place to manage which users have access to which roles and applications. The rest of this blog post builds on this federated access system. 

If your organization hasn’t built something similar, the following resources that can help you build and set up your own federated cloud access system.

Blog posts:

Docs:

Mapping Okta apps to AWS roles

By configuring Okta applications to cloud provider roles, engineers are one click away from authenticating to a cloud provider with single sign-on (SSO) with appropriate permissions.

Each Okta app is mapped to a “Cloud Account Role” (or “Cloud Project Role” for GCP). For example, in AWS, we have a Staging account with a Read role that provides read access to specific resources. In Okta, we have a corresponding app named “Staging Read - AWS Role” that allows engineers to authenticate to the AWS Staging Account and assume the Read role.

This requires configuring an Okta app for each “Cloud Account Role” combination, which at the time of writing is 150+ Okta apps.

Configuring GCP with Okta is slightly different, and technical details for how to do this are at the bottom of this blog.

Mapping Okta groups to SaaS app groups

In addition to authentication, Identity Providers can also help with authorization. Users get understandably frustrated when they get access to an application, but don’t have the correct permissions to do their job. 

Identity Providers have agreed upon a common set of REST APIs for managing provisioning, deprovisioning, and group mapping called SCIM (the System for Cross-domain Identity Management).

If an application supports SCIM, you can create groups within your Identity Provider (e.g. Okta), which will map user membership into the application. With this setup, adding users to the Okta group will automatically add them to the corresponding group in the application. Similarly, when a user is unassigned from the application in Okta, their membership in the application group will also be lost. 

SCIM allows us to provide granular, application-level access, all while using our Identity Provider as the source of truth.

With a single place to manage access for all of our cloud providers, the problem should be solved, right? Not quite… 

While the underlying Okta apps and groups system worked great, we quickly ran into more human problems.

Pitfalls of centralized access management

Even with our awesome new Okta+AWS system, we still needed a process for a centralized team to provision access through Okta. At many companies, this team would be IT. At Segment, this was a single person named Boggs. Requests would go into his inbox, and he would manually review the request reason, and decide if there was a more suitable level of access for the task. Finally, he would go to the Okta admin panel and provision the appropriate app to the user. Although this system worked for a time, it was not scalable and had major drawbacks.

Permanent access

Once an app was provisioned to a user, they would have access until they left Segment. Despite having permanent access, they might not need permanent access. Unfortunately, our manual provisioning process did not have a similar scalable way to ensure access was removed after it was no longer needed. People granted access for one-off tasks now had permanent access that hung around long after they actually needed it.

Difficulty scaling due to limited context 

As an engineering manager, Boggs had a strong sense of available IAM roles and their access levels. This allowed him to reduce unnecessary access by identifying opportunities to use less sensitive roles. This context was difficult to replicate and was a big reason why we could not simply expand this responsibility to our larger IT team. 

Most centralized IT teams don’t work closely with all of the apps that they provision, and this makes it difficult for them to evaluate requests. Enforcing the principle of least privilege can require intimate knowledge of access boundaries within a specific app. Without this knowledge, you’ll have a hard time deciding if a requester really needs “admin”, or if they could still do the work with “write” permissions, or even just with “read” access.

Kyle from the Data Engineering team is requesting access to the Radar Admin role to “debug”. What do they actually need access for? Would a Read only role work? And wait… who is Kyle?! Did they start last week? They say that they need this access to do their job and I still need to do mine… APPROVE. 

It was slow

Despite being better equipped than most people to handle access requests, Boggs was a busy engineering manager. Although at first provisioning access was an infrequent task, as the company grew, it began to take up valuable chunks of time and became increasingly difficult to understand the context of each request. 

We considered involving extra team members from our IT team, but this would still take time, as they would need to contact the owners of each system to confirm that access should be granted. Ultimately, having a limited pool of centralized approvers working through a shared queue of requests made response times less than ideal.

Breaking Boggs 

Boggs tried automating parts of the problem away using complex scripted rules based on roles and teams, but there were still situations that broke the system. How would he handle reorgs where teams got renamed, switched, merged, or split? What happens when a user switches teams? What happens when a team had a legitimate business need for short-term access to a tool they didn’t already have? Using that current system, any access Boggs provisioned lasted forever - unless somebody went in and manually audited Okta apps for unused access.

Ultimately, we found ourselves in a situation where we had a lot of over-provisioned users with access to sensitive roles and permissions. To make sure we understood how bad the problem actually was, we measured the access utilization of our privileged roles. We looked at how many privileged roles each employee had access to, and compared them to how many privileged roles had actually been used in the last 30 days.

The results were astonishing: 60% of access was not being used.

Managing long-lived access simply did not scale. We needed to find a way to turn our centralized access management system into a distributed one. 

Access Service

In the real world of access, we shouldn’t see a user's access footprint as static, but instead view it as amorphous and ever-changing.

When we adopted this perspective, it allowed us to build Access Service, an internal app that allows users to get the access they really need, and avoid the failure modes of provisioning too little or too much access.

Access Service allows engineers to request access to a single role for a set amount of time, and have their peers approve the request. The approvers come from a predefined list, which makes the access request process similar to GitHub pull requests with designated approvers

As soon as the request is approved, Access Service provisions the user with the appropriate Okta app or group for the role. A daily cron job checks if a request has expired, and de-provisions the user if it has. 

At a high level this is a simple web app, but let’s look closer at some specific features and what they unlock.

Temporary access

The magic of Access Service is the shift from long-lived access to temporary access. Usually, an engineer only needs access temporarily to accomplish a defined task. 

Once that task is done, they have access they no longer need, which violates the principle of least privilege. Fixing this using the old process would mean manually deprovisioning Okta apps – adding yet another task to a workflow that was already painfully manual.

With Access Service, users specify a duration with their access request. Approvers can refuse to approve the request if they think the duration is unnecessarily long for the task. This duration is also used to automatically deprovision their access once the request expires.

Access Service offers two types of durations: “time-based access” and “activity-based access”. 

Time-based access is a specific time period, such as one day, one week, two weeks, or four weeks. This is ideal for unusual tasks such as: 

  • fixing a bug that requires a role you don’t usually need

  • performing data migrations

  • helping customers troubleshoot on production instances you don’t usually access

Activity-based access is a dynamic duration that extends the access expiration each time you use the app or role you were granted. This is ideal for access that you need for daily job functions – nobody wants to make a handful of new access requests every month. However, we don’t offer this type of access for our more sensitive roles. Broad-access roles, or roles that have access to sensitive data require periodic approvals to maintain access. Activity-based access provides a more practical balance between friction and access, aligning with our goals of enabling our engineers to build quickly and safely

Designated approvers

One of the biggest limitations with our previous process was that one person had to approve everything. In Access Service, each app has a vetted list of approvers who work closely with the system. By delegating decision making to experts, we ensure that access is approved by the people who know who should have it. 

To start out with, you can’t approve your own access requests. (Sorry red team.) Each app has a “system owner” who is responsible for maintaining its list of approvers. When a user creates an access request, they select one or more approvers to review their request. Because the approvers list contains only people who work closely with the system, the approvers have better context and understanding of the system than a central IT team.

This makes it easier for approvers to reject unreasonable or too-permissive access requests, and encourages users to request a lower tier of access (for example, telling them to request a read-only role instead of a read/write role). Since incoming requests are “load balanced” between approvers, users also see a much faster response time to their requests. 

Provisioning access always requires two people, much like a GitHub pull request. Users cannot select themselves as an approver, even if they are a system owner. Access Service also supports an “emergency access” mechanism with different approval requirements. This prevents Access Service from blocking an on-call or site reliability engineer if they need access in the middle of the night. 

With system owners appointed for each app, our distributed pool of approvers continues to scale as we introduce new tools with new access roles and levels. This is what the security community calls “pushing left”

When you “push left”, you introduce security considerations earlier in the development lifecycle, instead of trying to retrofit a system after it is in use. In the software engineering space, “pushing left” resulted in engineers learning more about security. This means that the people most familiar with the systems are the most knowledgeable people to implement security fixes. Since the engineers are the ones who designed and now maintain the software, they have much more context than the central security team. Similarly, Access Service unburdens the central IT team, and empowers system owners to make decisions about who should have access to their systems, and at what level. This significantly reduces the amount of time the IT team spends provisioning access, and frees them up to do more meaningful work.

How it works

Access Service, like many of our internal apps, is accessible to the open internet, but protected behind Okta.

The basic unit of Access Service is a “request”. A user who wants access creates a request that includes four pieces of information: 

  • the application they want access to

  • the duration they want access for

  • a description for why they need access

  • the approver(s) they want to review the request

When they click “Request Access”, Access Services sends the selected approvers a Slack notification. Segment, like many modern companies, has a high degree of Slack presence. Using this platform makes Access Service a more natural, less disruptive part of people’s workflows. Even if the user requesting access is an approver for the particular app, they must receive approval from a different approver – every request must involve two people.

The access request is tracked in a web app, so you can see what requests you have open, and what roles you currently have access to.

The requester is notified via Slack when their request has been approved, so they know they can now get back to the task they needed access for in the first place.

The results

After we migrated our access process to Access Service, the result was zero long-lived access to any of our privileged cloud roles in AWS and GCP. All access granted to these roles expires if it is not actively used. 

In the graph below, “Access Points” refers to the number of users with access to each admin role. After moving to Access Service, we reduced the number of people who had privileged access by 90%. 

In the next graph below, “Active” refers to the number of people who used an app within the last 30 days. Because this number is higher than the number of Access Points, this shows that more access was used in the last 30 days than was currently provisioned

That seems strange – how could admin apps have been used by more people than the total amount of people provisioned access? That’s because expired access had already been automatically deprovisioned, reducing the number of Access Points by the end of the 30 day window!

Conclusion

By acknowledging that access needs are constantly changing, we were able to create a more practical way to manage access control.

Access Service allows us to streamline the access approval process. By routing requests directly to designated approvers, we are able to get fast approvals from people with rich context. The time-based component of access requests allows the service to regularly remove unneeded access, preventing our access attack surface from growing too large. Finally, integrating Slack into the system makes approvals faster, ensures that you know immediately when your request has been approved, and reminds you when the request is expiring so you don’t run into unexpected access loss when just trying to do your job.

While it can be daunting to try to reinvent an existing, well-established process, the results can be incredibly rewarding. Start by writing down your goals, thinking about what you don’t like and what is painful about the current state, and reevaluate your core assumptions. Companies are always changing, and your processes have to keep up; the circumstances that led to the previous system may no longer be applicable today. Most importantly, remember to build with the user’s workflow in mind, because security depends on participation of the whole company.

Future development

Policies

Apps in Access Service are currently individually customizable. However, this can lead to issues with scalability if we want to make changes across multiple, similar apps. For example, if we decide that we want to limit access to several AWS accounts to no more than one week, we would currently have to edit the allowed durations for each individual role. With the introduction of policies, we would be able to map several roles to a single policy, allowing us to easily apply the change from the previous example. 

Dynamic Roles

Currently, Access Service grants users access to predefined AWS roles. These roles are typically made to be general-purpose, but there may be use-cases not fully captured by an existing role. Instead of configuring a new role for one-off needs, or using an overly permissive role, Access Service could allow users to create a dynamic role. When making a request, users would check boxes corresponding to what permissions they wanted (e.g. “S3 Read”, “CloudWatch Full Access”, etc) to create a custom, dynamic role.


Special thanks to David Scrobonia for creating Access Service and setting up the foundation for this blog. Thank you to John Boggs, Rob McQueen, Anastassia Bobokalonova, Leif Dreizler, Eric Ellett, Pablo Vidal, Arta Razavi, and Laura Rubin, all of whom either built, designed, inspired, or contributed to Access Service along the way.


References

Configuring GCP roles in Okta

Connecting a GCP role to Okta is harder than with AWS, and after struggling to figure it out for a while, we thought it would be worth sharing. To connect a GCP role to our Okta instance, we had to use Google Groups in GSuite. 

First, we created a single GSuite Group for each of our Project-Role pairs. In GCP, a Google Group is a member (principal) that can be assigned a role, and all users added to the group are also assigned that role. 

We then assigned each GCP role to its corresponding Google Group. Next, we needed to connect the Google Groups to Okta. 

You can do this by using Okta Push Groups, which link an Okta “group” to a Google Group. Adding a user to an Okta Push Group automatically adds the correct GSuite user to the Google group. We created an Okta Group for each of the roles and configured it as a Push Group to its corresponding Google Group.

To summarize, the flow looked like this: 

  1. Add Okta User david@segment.com to Okta Group “Staging Read - GCP Role” 

  2. Okta Push Groups adds the GSuite user david@segment.com to the “Staging Read” Google Group

  3. Because he is a member of the “Staging Read” Google Group , david@segment.com is assigned the “Read” IAM role for the “Staging” project.

A BeyondCorp approach to internal apps

All of our internal apps use an OpenID Connect (OIDC) enabled Application Load Balancer (ALB) to connect to Okta. This provides a BeyondCorp approach to access for our internal apps: all are publicly-routable, but are behind Okta. 

This is also nice from a tooling developer standpoint, because not only is authentication taken care of, but we can use the signed JSON web token (JWT) that Okta returns to the server through the ALB to get the identity of the user interacting with Access Service. This allows us to use Okta as a coarse authorization layer and manage which users have access to different internal apps.

Stephanie Evans, Kevin Niparko, Gerhard Esterhuizen on May 31st 2021

A guiding framework for how EMs + PMs should partner together.

Sonia Sidhpura, Tasha Alfano, Pooya Jaferian, Ivayr Dieb Farah Netto on May 25th 2021

Analytics.js 2.0 is now Generally Available. Analytics.js 2.0 is a fully backward-compatible and lighter-weight implementation of Analytics.js. Simply put, the Analytics.js you know and love just got even faster with a more extensible architecture.

Maggie Yu on May 14th 2021

At Segment, we rely heavily on custom-built, internal tools for running ad-hoc tasks and debugging problems in our infrastructure. These are run not just by the engineers doing product development but also by our support teams, who use them to efficiently investigate and resolve customer-reported issues.

This internal tooling has been especially critical for maintaining the Segment Data Lakes product, which, as described in our previous blog post, loads data into a customer’s S3 bucket and Glue Catalog via EMR jobs. 

Because the latter infrastructure is all in a customer’s AWS account and not a Segment-owned one, configuring, operating, and debugging issues in this product is a bit trickier than it otherwise would be for something running completely in Segment infrastructure.

One of the most common problems customers have when trying to set up Segment Data Lakes is missing permissions in their IAM policies. So, one of our tools tests if Segment’s role is allowed to access the target S3 bucket, Glue Catalog, and EMR cluster in the customer account. Another common tooling task is to submit EMR jobs to replay archived data into Segment Data Lakes, so customers can bootstrap the latter from historical data.

Since the people running these internal tools can access a customer’s resources, security is extremely important. Hence, all the security best practices that apply for sensitive production systems need to apply for our Data Lakes tooling as well. Among other examples, the access to the tools should be limited to a small group of people, and all the operations performed using the tools need to be tracked and auditable.

We evaluated multiple approaches to creating and securing these tools, but ultimately settled on using AWS’s API Gateway product. In this blog post, we’ll explain how we integrated the latter with the Data Lakes backend, and how we used IAM authorization and API Gateway resource policies to tighten up access control. We will also provide some examples of setting up the associated infrastructure in Terraform.

Design Evolution

Data Lakes tooling has two components – a CLI client and a RPC service. Each CLI command issues a request to the RPC service, which can then access the associated customer resources. 

When Segment Data Lakes was launched, only the Data Lakes engineers could access and use the CLI tool. This was fine for the initial launch, but in order to scale the product, we realized that we needed to allow wider access inside Segment. We thus set out to design some improvements that would allow us to open up this access in a safe, controlled manner.

Our initial design depended on SAML to authenticate users, which is how several other, internal tools at Segment address this problem. Segment uses the Okta-AWS SAML integration to manage and assign access to AWS IAM roles, and then we use aws-okta to authenticate with AWS and run tooling commands with the resulting credentials.

The SAML approach would allow us to verify that the operator of our tools is an authenticated and authorized Segment employee. However, this would add a lot of complexity because we’d need to create and maintain the code for both the CLI client and the associated SAML auth flow. 

Just as we were about to implement the SAML solution, AWS released several new features for their AWS API Gateway product, including improved support for IAM-role-based API authentication. Since Segment was already using IAM roles to delegate access to Segment employees, it seemed fitting to also control access to the Data Lakes APIs using IAM-based policies.

With this solution, we could offload responsibility for authentication and authorization to an external service and spend more time focusing on just the Data Lakes aspects of our tooling. 

After working on a POC to validate the setup would work, we officially decided to lean into the AWS API Gateway with IAM authorization instead of supporting a SAML flow in our code.

Before getting into the details of how we integrated AWS API Gateway into our design, let’s review how this product works and the various features it offers.

What is AWS API Gateway?

AWS API Gateway sits between API clients and the services that back these APIs. It supports integrations with various backend types including AWS Lambda and containerized services running in ECS or EKS. The product also provides a number of features to help maintain and deploy APIs, for instance managing and routing traffic, monitoring API calls, and controlling access from a security standpoint.

API Gateway supports two types of APIs – HTTP APIs and REST APIs. Many features are available for both, but some are exclusive to just one or the other. Before deciding on which type of API Gateway best fits your use case, it’s important to understand the differences between the two. Check out this guide in the AWS documentation library for a good summary.

Important API Gateway Concepts

Here are some API Gateway concepts that will be mentioned in this blog post and that you may come across when you set up your own AWS API Gateway.

Resource policies

IAM policies are attached to IAM users, groups or roles whereas API Gateway resource policies are attached to resources. You can use IAM policies and/or resource policies to define who can access the resource and what actions can be performed.

Proxy integration

A proxy integration allows requests to be proxied to a backing service without intermediate body transformations. If a proxy integration is used, only the headers, path parameters, and query string parameters on each request can be modified. If a proxy integration is not used, mapping templates can also be defined to transform request bodies.

Proxy resource

A proxy resource allows bundling access to multiple resources via a greedy path parameter, {proxy+}. With a proxy resource, you don’t need to define a separate resource for each path. This is especially useful if all the resources are backed by the same backing service. For more details, refer to this AWS doc.

Stage

A Stage in API Gateway is a named reference to a deployment. You can create and manage multiple stages for each API, e.g. alpha, beta and production. Each stage can be integrated with different backend services.

Why we chose a REST API over an HTTP one

In our case, we really wanted to use resource policies to define access control, and this feature is only available in REST APIs. HTTP APIs are cheaper than REST APIs, but since we expected low request volumes from our tooling, total monthly cost wouldn’t be a concern. Therefore, we decided to go with a REST API.

If you need to integrate a REST API with an ALB, or if you want to use API Gateway to handle API access control, you may experience some challenges. In the following sections, we share how we addressed these issues.

Setting up a REST API Gateway

ALB Integration

Our backend tooling service resides in a private VPC and sits behind an ALB. HTTP APIs support integrating with an ALB directly, but REST APIs do not (see comparisons between HTTP APIs and REST APIs). To integrate a REST API with an ALB, you need to place an NLB between the two. 

The setup here involves two steps:

  1. NLB → ALB Integration

  2. REST API Gateway → NLB Integration

NLB → ALB Integration

The first step is to create an NLB which can forward traffic to the ALB:

This AWS blog explains how to set up this integration using a Lambda function, and it includes the Lambda function code and a CloudFormation template that you can use for the setup.

REST API Gateway → NLB Integration

The next step is to link the REST API to the NLB as shown below:

This can be achieved by first creating a VPC link to the NLB, and then using this VPC link in the Gateway API method integration. This guide from AWS provides the instructions for integrating an API with a VPC link to a NLB. 

We set up our API using both a proxy integration and a proxy resource. If a new API is introduced or our API schema evolves, we don’t need to update the API Gateway.

Terraform example

We use Terraform to organize and manage our AWS infrastructure at Segment. The following is an example of integrating a REST API with an NLB, and creating a proxy integration with a proxy resource:

Authorization & access control

To ensure  only authorized users can access our tooling service, we use a combination of IAM Authorization and API Gateway resource policies. In the following sections, we go into more details about what these are and how we configured them for our tool.

IAM Authorization

API Gateway supports enabling IAM authorization for the APIs. Once this is enabled, the API requests sent need to be signed with AWS security credentials (AWS doc). If a request is not signed (i.e. anonymous identity), then the request will fail. Otherwise, API Gateway will allow or deny the API requests based on the IAM policies and/or resource policies defined. This AWS guide explains how API Gateway decides whether to allow or deny based on the combination of an IAM policy and resource policy.

Resource Policy

An IAM policy and/or resource policy can be used to specify who can access a resource and what actions can be performed, but we use only the latter for our use case. 

At Segment, we have multiple roles in the same AWS account, e.g. a read role with a ReadOnlyAccess policy, an admin role with an AdministratorAccess policy, etc. Some policies grant execute-api permission, which is required for invoking APIs in API Gateway, and some do not. For instance, the AWS managed policy AdministratorAccess allows execute-api for all resources. 

If you want to restrict execute-api permission to a specific role api-invoke-role, you need to attach an IAM policy that denies execute-api to all of the other roles that have this permission, or explicitly deny those roles in the resource policy. 

For the former approach, you would create a policy to deny invoking execute-api against your API Gateway and then attach this policy to the roles that should not be allowed to invoke the APIs. Here’s an example policy:

For the latter approach, you would list all the roles that should not be allowed to invoke the APIs in the resource policy. The following is an example:

If there are many roles in the account and new roles are created over time, it can be very tedious to create these policies and keep them all up-to-date. This is even worse if there are multiple APIs in the same account and each has a separate set of allowed IAM roles.

As an alternative to this, you can create a resource policy that explicitly allows access to a subset of roles while blocking access to everything else by default.

This is the approach that we decided to go with:

Terraform example

The following is an example of implementing the above approach in Terraform:

Access logging

Another important security requirement is to keep an audit trail of the requests made to the API Gateway. Each record should contain the caller identity and the details of the request that was made. 

API Gateway has a built-in integration with CloudWatch that makes it easy to generate these access logs in the latter. You can select a log format (CLF, JSON, XML or CSV) and specify what information should be logged.

In addition to using CloudWatch, we also created a Segment source for the audit logs. Our backend service issues a track call for every request, and all the track events are synced into our warehouse. With this, we didn’t need to implement a connector to export the data from CloudWatch and load it into our warehouse.

Terraform example

The following shows how to configure CloudWatch audit logs for an API gateway. Note that you need to create an IAM role that has the appropriate read and write permissions on the CloudWatch side:

Signing requests

With IAM authorization enabled, all API requests need to be signed with an AWS access key so AWS can identify the caller. There are multiple versions of the signing process, with v4 being the latest (and recommended) one. The various AWS SDKs provide helper functions that handle the low-level details of this process.

Here’s an example using the AWS SDK for go:

In our use case, the CLI client automatically wraps the input, signs the request using the user’s AWS credentials, and sends the request to our API Gateway, making it really easy for users to use. 

For example, we expose the following command to check the status of an EMR cluster. The CLI client maps the command and flags to the request path and body in our API:

Deployment

One thing to keep in mind when using an API Gateway REST API is that it does not support automatic deployments (HTTP APIs do). So, if you make changes to API resources or stages, you need to manually trigger a deployment.

Terraform example

This example demonstrates how to trigger a deployment with Terraform 0.11. The trigger hash could be a hash of the Terraform file(s) containing resources associated with the API Gateway. If there are any changes in these files, the hash will change and thus a deployment will be triggered.

Note that Terraform 0.12 supports an explicit triggers argument that allows you to provide the trigger hash inside of the latter as opposed to doing this via the deployment variables.

Conclusion

We have been using this API-Gateway-based setup for several months, and it has been working really well for the Data Lakes tooling use case. Before this setup became available, the tooling operations were executed solely by our engineering team to resolve customers’ issues. After the solution was deployed, we opened up the tooling service to our support team, reducing the dependency on the engineering team and the operational costs.

This was our first time using the API Gateway product, so we had to spend some time figuring out various solutions and workarounds (e.g. integrating with an ALB, triggering deployments with Terraform, etc.), but overall we had a pleasant experience once we worked through these issues.

There are many other features of API Gateway that we didn’t use for our tooling and thus didn’t discuss in this blog post, but may still be helpful for your use case. For example, API Gateway allows you to set throttling limits and set up a cache. These features will help to prevent your system from being overwhelmed by the high volume of requests. You can also use API Gateway to manage and run multiple versions of the API simultaneously by configuring the API with different backend integrations. 

Dustin Bailey on May 5th 2021

Here's how you can reasonably determine a third party’s risk using only a handful of questions and select documents…no lengthy questionnaires or time-consuming audits required.

Tim French on April 29th 2021

Say goodbye to long customer support wait times and lengthy back-and-forth with Segment Personas and Twilio Flex.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.