Data Lakes (Beta)
Segment Data Lakes lets you send Segment data to a cloud data store (for example AWS S3) in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. However, without Data Lakes, you might need to do a lot of processing to get real value out of your data.
Segment Data Lakes is available to Business tier customers only.
To learn more, read our blog post on Cultivating your Data Lake.
How Segment Data Lakes Work
Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also builds in logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or Machine Learning vendors like DataBricks or DataRobot.
Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapReduce) cluster within your AWS account using an assumed role. Customers using Data Lakes own and pay AWS directly for these AWS services.
Using a data lake with a data warehouse
Data Lakes provides a flexible blob storage solution to Data teams as they scale.
When you use Data Lakes, you can either use Data Lakes as your only source of data and query all of your data directly from S3, or you can use Data Lakes in addition to a data warehouse.
The Data Lakes and Warehouses products are compatible using a mapping, but do not maintain exact parity with each other. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. You can read more about the differences between Data Lakes and Warehouses.
Set up Segment Data Lakes
Detailed set up instructions can be found in the Data Lakes catalog page.
Data Lakes Schema
S3 Partition Structure
Segment partitions the data in S3 by the Segment source, event type, then the day and hour an event was received by Segment, to ensure that the data is actionable and accessible.
The file path looks like:
Here are a few examples of what events look like:
By default, the date partition structure is
day=<YYYY-MM-DD>/hr=<HH> to give you granular access to the S3 data. You can change the partition structure during the set up process, where you can choose from the following options:
- Day/Hour [YYYY-MM-DD/HH] (Default)
- Year/Month/Day/Hour [YYYY/MM/DD/HH]
- Year/Month/Day [YYYY/MM/DD]
- Day [YYYY-MM-DD]
Glue Data Catalog
Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type.
New columns are appended to the end of the table in the Glue Data Catalog as they are detected.
The Glue database stores the schema inferred by Segment. Segment stores the schema for each source in its own Glue database to organize the data so it is easier to query. To make it easier to find, Segment writes the schema to a Glue database named using the source slug by default. The database name can be modified from the Data Lakes settings.
The recommended IAM role permissions grant Segment access to create the Glue databases on your behalf. If you do not grant Segment these permissions, you must manually create the Glue databases for Segment to write to.
Data Lakes infers the data type for the event it receives. Data Lakes looks at the group of events received every hour to infer the data type for that event.
The data types supported in Glue are bigint, decimal(38,6), string, boolean, and timestamp.
Data Lakes does not change the data type for a column in the Glue tables once it has been set. If incoming data does not match the data type in an existing Glue table, Data Lakes tries to cast the column to the target data type.
If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and replay to ensure no data is lost. Learn more about type casting here.
Changing data types
If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. Contact Segment Support if you find a data type needs to be corrected.
Data Lakes uses an EMR cluster to run jobs which load events from all sources into Data Lakes. The set up instructions have you deploy an EMR cluster using a
m5.xlarge node type. Currently Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it’s not always running at full capacity. Check the Terraform module documentation for the EMR specifications.
AWS IAM Role
Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
- external_ids: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment source ID which you want to connect to Data Lakes. The Segment source ID and can be retrieved from the Segment app.
- s3_bucket: Name of the S3 bucket used by the Data Lake.
How often is data synced to Data Lakes?
Data Lakes currently offers 12 syncs in a 24 hour period. Data Lakes does not currently offer a custom sync schedule, or allow you use Selective Sync to manage what data is sent.
What should I expect in terms of duplicates in Data Lakes?
Segment’s overall guarantee for duplicate data also applies to data in Data Lakes: 99% guarantee of no duplicates for data within a 24 hour look-back window.
If you have advanced requirements for de-duplication, you can add de-duplication steps downstream to reduce duplicates outside this look back window.
Can I send all of my Segment data into Data Lakes?
Data Lakes currently supports data from all event sources, including website libraries, mobile, server and event cloud sources.
Data Lakes does not currently support loading object cloud source data, as well as the users and accounts tables from event cloud sources.
Are user deletions and suppression supported?
User deletions are currently not supported in Data Lakes, however user suppression is supported.
How does Data Lakes handle schema evolution?
Any new columns detected will be appended to the end of the table in the Glue Data Catalog.
How does Data Lakes work with Protocols?
Data Lakes does not currently have a direct integration with Protocols.
Today, any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
Mutated events - If Protocols mutates an event due to a rule set in the Tracking Plan, then that mutation appears in Segment’s internal archives and is reflected in your data lake. For example, if you used Protocols to mutate the event
productID, then the event appears in both Data Lakes and Warehouses as
Blocked events - If a Protocols Tracking Plan blocks an event, the event is not forwarded to any downstream Segment destinations, including Data Lakes. However events which are only marked with a violation are passed to Data Lakes.
Data types and labels available in Protocols are not currently supported by Data Lakes.
- Data Types - Data Lakes infers the data type for each event using its own schema inference systems, instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set
product_idto be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
- Labels - Labels set in Protocols are not sent to Data Lakes.
What is the cost to use AWS Glue?
You can find details on Amazon’s pricing for Glue page. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
What limits does AWS Glue have?
AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the full list of Glue limits for more information.
The most common limits to keep in mind are:
- Databases per account: 10,000
- Tables per database: 200,000
- Characters in a column name: 250
Segment stops creating new tables for the events after you exceed this limit. However you can contact your AWS account representative to increase these limits.
You should also read the additional considerations when using AWS Glue Data Catalog.
This page was last modified: 29 Jul 2020
Questions? Problems? Need more info? Contact us, and we can help!