Data Lakes

Storage Destination

Segment Data Lakes provides a turnkey customer data lake in your AWS infrastructure using EMR clusters to power Data Analytics and Data Science workloads.

Data Lakes

How Segment Data Lakes works

Segment Data Lakes blends the experience of using Segment’s existing S3 destination and a warehouse destination. 

Data Lakes stores Segment data to a cloud object store (Amazon S3 to start) where the files are published in parquet, a binary encoding format, partitioned by event type and date. This storage layer is integrated with AWS Glue Catalog to provide easy data discoverability and integration with your favorite tools like Spark, Athena, or Machine Learning vendors like DataBricks or DataRobot.

Segment Data Lakes Architecture
Segment Data Lakes Architecture
Data Lakes

Getting started with Segment Data Lakes

To set up Segment Data Lakes, configure the necessary AWS resources such as EMR, IAM, and Glue using the open source Terraform module or manual set up instructions. After this, enable the Data Lakes destination by entering the credentials of the AWS resources into the Data Lakes settings page. Once the destination is enabled, the first sync begins within 2 hours. Detailed instructions and common questions such as how to identify sync errors, query sync reports, and add new sources can be found in the Data Lakes set up guide.

Integrate Data Lakes with Segment

Segment makes it easy to set up Data Lakes.