Comparing Data Lakes and Warehouses
As Segment builds new data storage products, each product evolves from prior products to best support the needs of customers. Segment Data Lakes is an evolution of the Warehouses product that meets the changing needs of customers.
Data Lakes and Warehouses are not identical, but are compatible with a configurable mapping. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related.
Data Lakes and Warehouses offer different sync frequencies:
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and selectively sync collections and properties within a source to Warehouses.
- Data Lakes offers 12 syncs in a 24 hour period, and does not offer custom sync schedules or selective sync.
Segment’s overall guarantee for duplicate data also applies to data in Data Lakes: 99% guarantee of no duplicates for data within a 24 hour look-back window. The guarantee remains the same for Warehouses.
Both Data Lakes and Warehouses (and all Segment destinations) rely on the de-duplication process at time of event ingest, to ensure:
- The 24 hour look back window duplicate guarantee is met
- Processing costs for customers are managed appropriately
Warehouses also have a secondary de-duplication system built in to further reduce the volume of duplicates. If you have advanced requirements for duplicates in Data Lakes, you can add de-duplication steps downstream to reduce duplicates outside this look back window.
Object vs Event Data
Warehouses support both event and object data, while Data Lakes supports only event data.
See the table below for information about the source types supported by Warehouses and Data Lakes.
|Object Cloud Sources||✅||⬜️|
|Event Cloud Sources||✅||✅|
Warehouses and Data Lakes both infer data types for the events each receives. Since events are received by Warehouses one by one, Warehouses look at the first event received every hour to infer the data type for subsequent events. Data Lakes uses a similar approach, however because it receives data every hour, Data Lakes is able to look at a group of events to infer the data type.
This approach leads to a few scenarios where the data type for an event may be different between Warehouses and Data Lakes. Those scenarios are:
- Schema evolution - Events reach Warehouses and Data Lakes at different times, due to their differing sync schedules. As a result, there is no way to guarantee that data types do not change since the field may have varying data types.
- Different data type inferred based on sample size - Warehouses and Data lakes use a different number of events to infer the schema. Warehouses receive one event at a time, and use the first received event to infer the data type. Data Lakes receive events in batches, and use a larger number of events to more accurately infer the data type.
Variance in data types between Warehouses and Data Lakes don’t happen often for booleans, strings, and timestamps, however it can occur for decimals and integers.
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. Contact us if you want to correct data types in the schema and perform a replay to ensure no data is lost.
Tables between Warehouses and Data Lakes will be the same, except for in these two cases:
tracks- Warehouses provide one table per specific event (
track_button_clicked) in addition to a summary table listing all
trackmethod calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the
users- Both Warehouses and Data Lakes create an
identifiestable (as seen here), however Warehouses also create a
userstable just for user data. Data Lakes does not create this, since it does not support object data. The
userstable is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
accounts- Group calls generate the
accountstable in Warehouses. However because Data Lakes does not support object data (Groups are objects not events), there is no
accountstable in Data Lakes.
- (Redshift only) Table names which begin with numbers - Table names are not allowed to begin with numbers in the Redshift Warehouse, so they are automatically given an underscore ( _ ) prefix. Glue Data Catalog does not have this restriction, so Data Lakes don’t assign this prefix. For example, in Redshift a table name may be named
_101_account_update, however in Data Lakes it would be named
101_account_update. While this nuance is specific to Redshift, other warehouses may show similar behavior for other reserved words.
Similar to tables, columns between Warehouses and Data Lakes will be the same, except for in a few specific scenarios:
event_text- Each property within an event has its own column, however the naming convention for these columns differs between Warehouses and Data Lakes. Warehouses snake case the original payload value and preserves the original text within the
event_textcolumn. Data Lakes use the original payload value as-is for the column name, and does not need an
version- These columns are Segment internal data which are not found in Warehouses, but are found in Data Lakes. Warehouses is intentionally very detailed about it’s transformation logic and does not include these. Data Lakes does include them due to its more straightforward approach to flatten the whole event.
- (Redshift only)
uuid_ts- Redshift customers will see columns for
uuid_ts, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren’t relevant for Data Lakes so the columns won’t appear there.
sent_at- Warehouses computes the
sent_atvalue based on timestamps found in the original event in order to account for clock skews and timestamps in the future. This was done when the Segment pipeline didn’t do this on it’s own, however it now calculates for this so Data Lakes does not need to do any additional computation, and will send the value as-is when computed at ingestion.
integrations- Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the
This page was last modified: 06 Sep 2020
Questions? Problems? Need more info? Contact us, and we can help!