Solon Aguiar, Brian Lai on January 20th 2022
Kelly Kirwan on August 13th 2021
Geoffrey Keating on June 9th 2021
Geoffrey Keating on June 8th 2021
By using data models, various stakeholders such as developers, data architects, and business analysts can agree on the data they’ll capture and how they want to use it before building databases and warehouses.
Like a blueprint for a house, a data model defines what to build and how, before starting construction, when things become much more complicated to change. This approach prevents database design and development errors, capturing unnecessary data, and duplicating data in multiple locations.
In this article, we’ll cover these basics of data modeling:
What is data modeling?
Understanding different types of data models
Why data models are necessary for building a data infrastructure
Top three data modeling techniques
Data modeling is the process of conceptualizing and visualizing how data will be captured, stored, and used by an organization. The ultimate aim of data modeling is to establish clear data standards for your entire organization.
For example, a model for an eCommerce website might specify the customer data you’ll capture. It will define how to label that data and its relation to product information and the sales process.
Data models get divided into three categories: abstract, conceptual, and physical models. They help align stakeholders around the why, how, and what of your data project. Each type of model has a different use case and audience in the data modeling process.
Conceptual data models visualize the concepts and rules that govern the business processes you’re modeling without going into technical details. You use this visualization to align business stakeholders, system architects, and developers on the project and business requirements: what information the data system will contain, how elements should relate to each other, and their dependencies.
Typically, a conceptual model shows a high-level view of the system’s content, organization, and relevant business rules. For example, a data model for an eCommerce business will contain vendors, products, customers, and sales. A business rule could be that each vendor needs to supply at least one product.
There’s no standard format for conceptual models. What matters is that it helps both technical and non-technical stakeholders align and agree on the purpose, scope, and design of their data project. All of the below images could be examples of conceptual data models.
A logical data model is based on the conceptual model and defines the project’s data elements and relationships. You’ll see the names and data attributes of specific entities in the database. To stay with the eCommerce example: A logical model shows products are identified through a “product ID,” with properties like a description, category, and unit price.
Data architects and business analysts use the logical data model to plan the implementation of a database management system—software that stores, retrieves, defines, and manages data in a database.
The physical data model gets technical. Database analysts and developers use it for the design of the database and relevant data structures. The model specifies the types of data you’ll store along with technical data requirements.
An example of data type specifications is whether a piece of data will be an integer—a number without a decimal point—or a float—a number with a decimal place. Technical requirements include details on storage needs, access speed, and data redundancy—storing a piece of data in multiple locations to increase durability and improve query performance.
In practice, only very large projects, say modeling a container shipping business, move from conceptual to logical to physical models. Most other projects skip the conceptual phase and spend most of their time in logical modeling. Some teams even cover elements from the physical phase simultaneously because the people working on the logical model also do the technical implementation.
Data models are a visual representation that turns abstract ideas (“we want to track our global container shipments in real time”) into a technical implementation plan (“we will store an attribute called ‘container GPS location’ in a table called ‘Containers’ as an integer”). They help avoid costly demolition and reconstruction of your data infrastructure because data modelers need to think about the data they'll need, its relations, the data architecture, and even whether your project is viable before creating databases and warehouses.
Data models also help with data governance and legal compliance, as well as ensuring data integrity. They allow you to set standards from the start of the project so teams don’t end up with conflicting data sets that need cleaning up before they can use it or, worse, can’t use at all.
Data models and standardization help avoid situations like a sign-up field labeled in nearly a dozen different ways across the organization.
You can also identify sensitive information—social security numbers, passwords, credit card numbers—while you’re modeling so you can involve security and legal experts before you start building.
Safe, accurate, and high-quality data, confers a range of real-world benefits for various teams in your organization. Product teams can iterate faster and build immersive user experiences. Analytics and business intelligence teams can create queries without heavy workarounds. And marketing teams can improve advertising efforts by personalizing messaging according to user behaviors and traits.
Customer Data Platforms (CDPs) like Segment can do much of the heavy-lifting during data modeling projects by simplifying and systematizing data storage and organization. Segment’s Connections feature makes it easy to capture, organize, and visualize every customer-facing interaction with your business, whether digital or offline. Protocols lets you define your data standards and enforce them at the point of collection. Functionality, like real-time data validation and automatic enforcement controls, allows you to diagnose issues before they pollute your marketing and data analytics tools or data warehouse.
There are many different techniques to design and structure a database, as well as a variety of data modeling tools that you can use. You should explore these techniques and decide on the most suitable one for your project at the end of the conceptual phase. These data modeling methodologies define how the database gets structured and closely relate to the type of formatting or technology you can use to manage your data sources.
For example, many people now default to graph modeling because it’s new and popular, even when a simple relational model would suffice. Understanding the most popular techniques for the process of data modeling helps you avoid such mistakes.
In a relational data model, data gets stored in tables, of which specific elements link to information in other tables. Entities can have a one-to-one, one-to-many, or many-to-many relationship.
Relational databases often use SQL (Structured Query Language), a programming language, for accessing and managing data. They’re frequently used in point-of-sale systems, as well as for other types of transaction processing.
The Entity-Relationship Model—sometimes referred to as ER model—is similar to the relational model. It is a relationship diagram that visualizes the different elements in a system but without going into technical details. You can use the ER model during the conceptual phase to align technical and non-technical stakeholders.
To understand dimensional models, picture a cube. Each side of the cube represents an aspect of the data you’re trying to capture.
For example, suppose your business sells multiple products to different customer segments, and you want to evaluate sales performance over time. You can visualize this as a data cube, with dimensions for time, products, and customer segments. By traveling up, down, left, and right on the axes of the cube, you can make comparisons across all those dimensions. You’ll see how the sales of each of these products compare to each other and different customer segments at any point in time.
You use the cube model during the conceptual phase. One of the most frequent manifestations of such a cube in the logical stage is the “star schema,” like the one below. At first, it might look like a relational model. Still, the star schema is different because it has a central node that connects to many others.
During the conceptual phase, most people sketch a data model on a whiteboard. Such a sketch resembles the graph model. It consists of “nodes” and edges—a node represents where the data is stored, the edge the relation between nodes. It’s also the main advantage of this approach: “what you sketch on the whiteboard is what you store in the database.”
Other techniques require you to translate the output from the conceptual phase into a different format for the logical and physical implementation—for example, going from an ER to a relational model or from a cube model to a star schema. Not so with graph models. You can implement them straight away using technology like Neo4j, a native graph database platform.
When you understand the purpose of data models and the process to follow, they’re not challenging to create, especially if you also collect, organize, and standardize your data with Segment. You’ll align all stakeholders before starting technical implementation and avoid costly mistakes or rebuilds. You’ll know what expertise you need on the team to execute your plan and have your data governance defined, too.
Andy Li on June 1st 2021
Access is always changing. When you start at a new company, you usually are given access to a set of apps provisioned to you on day one, based on your team and role. Even on day one, there can be a difference between the access you are granted and the access you need to do your job. This results in two outcomes: underprovisioned or overprovisioned access.
For the IT and security teams who manage cloud infrastructure accounts, securing access to them can be difficult and scary; the systems are complex, and the stakes are high. If you grant too much access, you might allow bad actors access to your tools and infrastructure, which at best results in a breach notification; at worst, it results in a company-ending, game-over scenario. If you grant too little access, you put roadblocks between your colleagues and the work they need to do, meaning you are decreasing your company’s productivity.
A common approach taken by startups and small companies is to grant access permissively. In these companies, early productivity can be critical to the success of the business. An employee locked out of a system because of missing access means lost productivity and lost income for the business.
If you give employees permanent admin access to every system, you optimize for velocity, but at the expense of increased risks from compromised employee accounts and insider threats. This results in an increased attack surface. As your company grows, it becomes more important to secure access to critical resources, and this requires a different approach.
If you give employees too little access, it forces them to request access more often. Although new employees are initially given access based on their team and role, new duties and new projects can quickly increase the scope of the access they need. Depending on your company’s process for providing access, this can be cumbersome for the requester, for the approver, or oftentimes, for both.
Here at Segment, we have production environments across Amazon Web Services (AWS) and Google Cloud Platform (GCP). We need to secure access to these accounts thoughtfully so that our engineers can continue to build fast and safely. At many companies, you might rely on a centralized team to manage internal access. While this is a simple approach, it does not scale – team members have a limited amount of context surrounding requests, and might accidentally over-provision the requester’s access. At Segment, we approached the problem of managing least-privilege cloud access by building Access Service: a tool that enables time-based, peer-reviewed access.
At Segment, we have hundreds of roles across dozens of SaaS apps and cloud providers representing different levels of access. In the past, we used to have to log in to each app or system individually to grant a user access. Our IT team managed to “federate” our cloud access and use Okta as our Identity Provider. This gave us a single place to manage which users have access to which roles and applications. The rest of this blog post builds on this federated access system.
If your organization hasn’t built something similar, the following resources that can help you build and set up your own federated cloud access system.
By configuring Okta applications to cloud provider roles, engineers are one click away from authenticating to a cloud provider with single sign-on (SSO) with appropriate permissions.
Each Okta app is mapped to a “Cloud Account Role” (or “Cloud Project Role” for GCP). For example, in AWS, we have a Staging account with a Read role that provides read access to specific resources. In Okta, we have a corresponding app named “Staging Read - AWS Role” that allows engineers to authenticate to the AWS Staging Account and assume the Read role.
This requires configuring an Okta app for each “Cloud Account Role” combination, which at the time of writing is 150+ Okta apps.
Configuring GCP with Okta is slightly different, and technical details for how to do this are at the bottom of this blog.
In addition to authentication, Identity Providers can also help with authorization. Users get understandably frustrated when they get access to an application, but don’t have the correct permissions to do their job.
Identity Providers have agreed upon a common set of REST APIs for managing provisioning, deprovisioning, and group mapping called SCIM (the System for Cross-domain Identity Management).
If an application supports SCIM, you can create groups within your Identity Provider (e.g. Okta), which will map user membership into the application. With this setup, adding users to the Okta group will automatically add them to the corresponding group in the application. Similarly, when a user is unassigned from the application in Okta, their membership in the application group will also be lost.
SCIM allows us to provide granular, application-level access, all while using our Identity Provider as the source of truth.
With a single place to manage access for all of our cloud providers, the problem should be solved, right? Not quite…
While the underlying Okta apps and groups system worked great, we quickly ran into more human problems.
Even with our awesome new Okta+AWS system, we still needed a process for a centralized team to provision access through Okta. At many companies, this team would be IT. At Segment, this was a single person named Boggs. Requests would go into his inbox, and he would manually review the request reason, and decide if there was a more suitable level of access for the task. Finally, he would go to the Okta admin panel and provision the appropriate app to the user. Although this system worked for a time, it was not scalable and had major drawbacks.
Once an app was provisioned to a user, they would have access until they left Segment. Despite having permanent access, they might not need permanent access. Unfortunately, our manual provisioning process did not have a similar scalable way to ensure access was removed after it was no longer needed. People granted access for one-off tasks now had permanent access that hung around long after they actually needed it.
As an engineering manager, Boggs had a strong sense of available IAM roles and their access levels. This allowed him to reduce unnecessary access by identifying opportunities to use less sensitive roles. This context was difficult to replicate and was a big reason why we could not simply expand this responsibility to our larger IT team.
Most centralized IT teams don’t work closely with all of the apps that they provision, and this makes it difficult for them to evaluate requests. Enforcing the principle of least privilege can require intimate knowledge of access boundaries within a specific app. Without this knowledge, you’ll have a hard time deciding if a requester really needs “admin”, or if they could still do the work with “write” permissions, or even just with “read” access.
Kyle from the Data Engineering team is requesting access to the Radar Admin role to “debug”. What do they actually need access for? Would a Read only role work? And wait… who is Kyle?! Did they start last week? They say that they need this access to do their job and I still need to do mine… APPROVE.
Despite being better equipped than most people to handle access requests, Boggs was a busy engineering manager. Although at first provisioning access was an infrequent task, as the company grew, it began to take up valuable chunks of time and became increasingly difficult to understand the context of each request.
We considered involving extra team members from our IT team, but this would still take time, as they would need to contact the owners of each system to confirm that access should be granted. Ultimately, having a limited pool of centralized approvers working through a shared queue of requests made response times less than ideal.
Boggs tried automating parts of the problem away using complex scripted rules based on roles and teams, but there were still situations that broke the system. How would he handle reorgs where teams got renamed, switched, merged, or split? What happens when a user switches teams? What happens when a team had a legitimate business need for short-term access to a tool they didn’t already have? Using that current system, any access Boggs provisioned lasted forever - unless somebody went in and manually audited Okta apps for unused access.
Ultimately, we found ourselves in a situation where we had a lot of over-provisioned users with access to sensitive roles and permissions. To make sure we understood how bad the problem actually was, we measured the access utilization of our privileged roles. We looked at how many privileged roles each employee had access to, and compared them to how many privileged roles had actually been used in the last 30 days.
The results were astonishing: 60% of access was not being used.
Managing long-lived access simply did not scale. We needed to find a way to turn our centralized access management system into a distributed one.
In the real world of access, we shouldn’t see a user's access footprint as static, but instead view it as amorphous and ever-changing.
When we adopted this perspective, it allowed us to build Access Service, an internal app that allows users to get the access they really need, and avoid the failure modes of provisioning too little or too much access.
Access Service allows engineers to request access to a single role for a set amount of time, and have their peers approve the request. The approvers come from a predefined list, which makes the access request process similar to GitHub pull requests with designated approvers.
As soon as the request is approved, Access Service provisions the user with the appropriate Okta app or group for the role. A daily cron job checks if a request has expired, and de-provisions the user if it has.
At a high level this is a simple web app, but let’s look closer at some specific features and what they unlock.
The magic of Access Service is the shift from long-lived access to temporary access. Usually, an engineer only needs access temporarily to accomplish a defined task.
Once that task is done, they have access they no longer need, which violates the principle of least privilege. Fixing this using the old process would mean manually deprovisioning Okta apps – adding yet another task to a workflow that was already painfully manual.
With Access Service, users specify a duration with their access request. Approvers can refuse to approve the request if they think the duration is unnecessarily long for the task. This duration is also used to automatically deprovision their access once the request expires.
Access Service offers two types of durations: “time-based access” and “activity-based access”.
Time-based access is a specific time period, such as one day, one week, two weeks, or four weeks. This is ideal for unusual tasks such as:
fixing a bug that requires a role you don’t usually need
performing data migrations
helping customers troubleshoot on production instances you don’t usually access
Activity-based access is a dynamic duration that extends the access expiration each time you use the app or role you were granted. This is ideal for access that you need for daily job functions – nobody wants to make a handful of new access requests every month. However, we don’t offer this type of access for our more sensitive roles. Broad-access roles, or roles that have access to sensitive data require periodic approvals to maintain access. Activity-based access provides a more practical balance between friction and access, aligning with our goals of enabling our engineers to build quickly and safely
One of the biggest limitations with our previous process was that one person had to approve everything. In Access Service, each app has a vetted list of approvers who work closely with the system. By delegating decision making to experts, we ensure that access is approved by the people who know who should have it.
To start out with, you can’t approve your own access requests. (Sorry red team.) Each app has a “system owner” who is responsible for maintaining its list of approvers. When a user creates an access request, they select one or more approvers to review their request. Because the approvers list contains only people who work closely with the system, the approvers have better context and understanding of the system than a central IT team.
This makes it easier for approvers to reject unreasonable or too-permissive access requests, and encourages users to request a lower tier of access (for example, telling them to request a read-only role instead of a read/write role). Since incoming requests are “load balanced” between approvers, users also see a much faster response time to their requests.
Provisioning access always requires two people, much like a GitHub pull request. Users cannot select themselves as an approver, even if they are a system owner. Access Service also supports an “emergency access” mechanism with different approval requirements. This prevents Access Service from blocking an on-call or site reliability engineer if they need access in the middle of the night.
With system owners appointed for each app, our distributed pool of approvers continues to scale as we introduce new tools with new access roles and levels. This is what the security community calls “pushing left”.
When you “push left”, you introduce security considerations earlier in the development lifecycle, instead of trying to retrofit a system after it is in use. In the software engineering space, “pushing left” resulted in engineers learning more about security. This means that the people most familiar with the systems are the most knowledgeable people to implement security fixes. Since the engineers are the ones who designed and now maintain the software, they have much more context than the central security team. Similarly, Access Service unburdens the central IT team, and empowers system owners to make decisions about who should have access to their systems, and at what level. This significantly reduces the amount of time the IT team spends provisioning access, and frees them up to do more meaningful work.
Access Service, like many of our internal apps, is accessible to the open internet, but protected behind Okta.
The basic unit of Access Service is a “request”. A user who wants access creates a request that includes four pieces of information:
the application they want access to
the duration they want access for
a description for why they need access
the approver(s) they want to review the request
When they click “Request Access”, Access Services sends the selected approvers a Slack notification. Segment, like many modern companies, has a high degree of Slack presence. Using this platform makes Access Service a more natural, less disruptive part of people’s workflows. Even if the user requesting access is an approver for the particular app, they must receive approval from a different approver – every request must involve two people.
The access request is tracked in a web app, so you can see what requests you have open, and what roles you currently have access to.
The requester is notified via Slack when their request has been approved, so they know they can now get back to the task they needed access for in the first place.
After we migrated our access process to Access Service, the result was zero long-lived access to any of our privileged cloud roles in AWS and GCP. All access granted to these roles expires if it is not actively used.
In the graph below, “Access Points” refers to the number of users with access to each admin role. After moving to Access Service, we reduced the number of people who had privileged access by 90%.
In the next graph below, “Active” refers to the number of people who used an app within the last 30 days. Because this number is higher than the number of Access Points, this shows that more access was used in the last 30 days than was currently provisioned.
That seems strange – how could admin apps have been used by more people than the total amount of people provisioned access? That’s because expired access had already been automatically deprovisioned, reducing the number of Access Points by the end of the 30 day window!
By acknowledging that access needs are constantly changing, we were able to create a more practical way to manage access control.
Access Service allows us to streamline the access approval process. By routing requests directly to designated approvers, we are able to get fast approvals from people with rich context. The time-based component of access requests allows the service to regularly remove unneeded access, preventing our access attack surface from growing too large. Finally, integrating Slack into the system makes approvals faster, ensures that you know immediately when your request has been approved, and reminds you when the request is expiring so you don’t run into unexpected access loss when just trying to do your job.
While it can be daunting to try to reinvent an existing, well-established process, the results can be incredibly rewarding. Start by writing down your goals, thinking about what you don’t like and what is painful about the current state, and reevaluate your core assumptions. Companies are always changing, and your processes have to keep up; the circumstances that led to the previous system may no longer be applicable today. Most importantly, remember to build with the user’s workflow in mind, because security depends on participation of the whole company.
Apps in Access Service are currently individually customizable. However, this can lead to issues with scalability if we want to make changes across multiple, similar apps. For example, if we decide that we want to limit access to several AWS accounts to no more than one week, we would currently have to edit the allowed durations for each individual role. With the introduction of policies, we would be able to map several roles to a single policy, allowing us to easily apply the change from the previous example.
Currently, Access Service grants users access to predefined AWS roles. These roles are typically made to be general-purpose, but there may be use-cases not fully captured by an existing role. Instead of configuring a new role for one-off needs, or using an overly permissive role, Access Service could allow users to create a dynamic role. When making a request, users would check boxes corresponding to what permissions they wanted (e.g. “S3 Read”, “CloudWatch Full Access”, etc) to create a custom, dynamic role.
Special thanks to David Scrobonia for creating Access Service and setting up the foundation for this blog. Thank you to John Boggs, Rob McQueen, Anastassia Bobokalonova, Leif Dreizler, Eric Ellett, Pablo Vidal, Arta Razavi, and Laura Rubin, all of whom either built, designed, inspired, or contributed to Access Service along the way.
Connecting a GCP role to Okta is harder than with AWS, and after struggling to figure it out for a while, we thought it would be worth sharing. To connect a GCP role to our Okta instance, we had to use Google Groups in GSuite.
First, we created a single GSuite Group for each of our Project-Role pairs. In GCP, a Google Group is a member (principal) that can be assigned a role, and all users added to the group are also assigned that role.
We then assigned each GCP role to its corresponding Google Group. Next, we needed to connect the Google Groups to Okta.
You can do this by using Okta Push Groups, which link an Okta “group” to a Google Group. Adding a user to an Okta Push Group automatically adds the correct GSuite user to the Google group. We created an Okta Group for each of the roles and configured it as a Push Group to its corresponding Google Group.
To summarize, the flow looked like this:
Add Okta User firstname.lastname@example.org to Okta Group “Staging Read - GCP Role”
Okta Push Groups adds the GSuite user email@example.com to the “Staging Read” Google Group
Because he is a member of the “Staging Read” Google Group , firstname.lastname@example.org is assigned the “Read” IAM role for the “Staging” project.
All of our internal apps use an OpenID Connect (OIDC) enabled Application Load Balancer (ALB) to connect to Okta. This provides a BeyondCorp approach to access for our internal apps: all are publicly-routable, but are behind Okta.
This is also nice from a tooling developer standpoint, because not only is authentication taken care of, but we can use the signed JSON web token (JWT) that Okta returns to the server through the ALB to get the identity of the user interacting with Access Service. This allows us to use Okta as a coarse authorization layer and manage which users have access to different internal apps.
Sonia Sidhpura, Tasha Alfano, Pooya Jaferian, Ivayr Dieb Farah Netto on May 25th 2021