How Segment Data Lakes works
Segment Data Lakes blends the experience of using Segment’s existing S3 destination and a warehouse destination.
Data Lakes stores Segment data to a cloud object store (Amazon S3 to start) where the files are published in parquet, a binary encoding format, partitioned by event type and date. This storage layer is integrated with AWS Glue Catalog to provide easy data discoverability and integration with your favorite tools like Spark, Athena, or Machine Learning vendors like DataBricks or DataRobot.
Getting started with Segment Data Lakes
To set up Segment Data Lakes, configure the necessary AWS resources such as EMR, IAM, and Glue using the open source Terraform module or manual set up instructions. After this, enable the Data Lakes destination by entering the credentials of the AWS resources into the Data Lakes settings page. Once the destination is enabled, the first sync begins within 2 hours. Detailed instructions and common questions such as how to identify sync errors, query sync reports, and add new sources can be found in the Data Lakes set up guide.