Go back to Blog

Engineering

Solon Aguiar, Brian Lai on January 20th 2022

With the launch of Twilio Engage, the team needed to update the data delivery system to support the latest platform features. This blog covers the process, including questions and lessons learned, to get to the end result.

All Engineering articles

Pooya Jaferian, Tasha Alfano on November 18th 2022

With Segment Edge SDK, built on Cloudflare Workers, developers can collect high-quality first-party data and use Segment Edge SDK to access realtime user profiles and state, to deliver personalized app experiences without managing a ton of infrastructure.

Segment on November 4th 2022

Data lakes and data warehouses are often used interchangeably, but they don’t actually refer to the same thing.

Data lakes and data warehouses are built for different types of data and are intended to be used for different purposes. Understanding their differences and specific use cases will guide you when choosing a big data storage system for your business.

What is a data lake?

A data lake is a cloud infrastructure that rapidly collects and stores massive amounts of any and all types of data with their original attributes. Just like a lake collects water from several sources in their different natural states, a data lake holds unstructured and structured data from different sources until your business needs them.

How do data lakes work?

New data enters the data lake through the data ingestion tier. Upon ingestion, the data lake breaks the new data into segments and organizes these segments in metadata catalogs. These catalogs specify the source, date of acquisition, and other attributes of every piece of data.

The data lake’s architecture follows strict data governance to maintain data quality. Good data governance – like pruning outdated data and assigning roles and permissions for controlled access – keeps the data lake organized, regardless of the amount of data in it. Without these, the data lake will become a data swamp. This means that all the data will be mixed up and disorganized, making it nearly impossible for anyone to find, trust, and use it.

What is a data warehouse?

A data warehouse is a central repository of structured data intended for analytical purposes. Structured data has been processed and organized to allow humans and computer programs to access and interpret it seamlessly.

For example, let’s say a company gathered data about its customers. In a data warehouse, this information would be organized according to demographics, like age or geographical location. So anyone could access the data warehouse and look at information for customers based on any of these parameters.

How do data warehouses work?

A data warehouse architecture has three core components:

  1. A relational database system with multiple tables. Each table is like a grid of boxes – each row is a complete entry, and the columns are chunks of similar data like names or addresses.

  2. Online Analytical Processing (OLAP) servers that map and act on multidimensional data processes. In other words, these servers allow you to extract and query data across multiple tables in your database concurrently.

  3. The front-end client interface that displays meaningful insights derived from the data.

When the data warehouse ingests structured data, it stores the data in tables defined by a schema. Think of a schema as a logical description of what each table contains and how it relates to other tables in the data warehouse. Query tools use the schema to determine which data tables to access and analyze.

The major differences between data lakes and data warehouses

Amount of data they’re built to hold

A data lake can hold an unlimited amount of data in multiple formats until you need to analyze it. This is why data lakes are a great option for enterprise businesses that generate lots of data but don’t have an immediate use for it.

A data warehouse can hold hundreds of gigabytes of data, depending on its configuration. It has a limit beyond which adding more data will slow down queries; if you want timely results, you have to limit the amount of data in your data warehouse.

How is the data processed?

Data lakes use the ELT (Extract Load Transform) process, while data warehouses use a traditional ETL (Extract Transform Load) process.

ELT means you extract data from its source in its original form and load it into the data lake directly without processing the data first. ELT simplifies the data lake’s architecture, allowing it to ingest all forms of data in real time.

ETL is a legacy method where the data is transformed into a relational format before it’s loaded into the data warehouse. After extraction, the data goes into a staging area where it’s processed and structured based on a predefined algorithm and then ingested into the data warehouse for analysis.

Cost of data storage

The cost of data storage largely depends on the amount of data in your data warehouse or data lake. On average, expect to spend more data storage in a data warehouse compared to a data lake.

The main reason for this is the data warehouses’ complex architecture, which is expensive to maintain and difficult to scale. Since a data warehouse isn’t built to hold unlimited amounts of data, you have to invest in an expansive storage solution as your business generates more data.

Number of users who can access and use the data

Accessing and using data in a data lake requires an advanced knowledge of how data systems work. This is why, in many organizations, only professionals like data engineers, analysts, and scientists are authorized to access and retrieve data from the data lake.

The data in a data warehouse is primarily meant for business analysts and decision makers. Since the data is already structured, they can easily access and analyze it for business insights without requiring any deep technical knowledge. For example, let’s say your marketing team wants to know how many people from a specific location shop at your store. They can use demographic and behavioral data stored in the data warehouse to segment your customers.

Type of data analysis they’re suited for

Data lakes are generally used for AI and machine learning purposes because AI and machine learning algorithms rely on raw data to process information in new ways. For example, generative AI applies learning algorithms to raw data to create new outputs like images, text, and videos.

A data warehouse is used for historical data analysis. Historical data is data that shows past occurrences like purchase frequency, and it’s useful for predicting future trends and behaviors. Analyzing this type of data requires batch reporting, BI, and visualizations.

Which should my business use: A data lake or a data warehouse?

Whether you choose a data lake or a data warehouse for your business depends on two factors: your business needs and the data structure. For example, enterprise businesses that rely on historical data for business insights need data warehouses instead of data lakes.

When to use a data lake

Use a data lake if:

  1. You’re looking for a cost-effective way to store large volumes of data in multiple formats. 

  2. You need to store internet of things data for real-time analysis. 

  3. Your business relies on raw, unstructured data to generate output. For example, machine learning businesses will use data lakes. 

Let's take a look at some industry verticals that will benefit from having data lakes. 

Medical research 

Data lakes are useful in medical research because they allow the researchers to requery raw data infinitely. In practice, this means medical researchers can reanalyze data sets in their original form to discover new insights, even when these data sets have been used for previous research. 

Streaming services 

Streaming services can use data lakes to improve content recommendations and create better content for their customers. 

As customers stream content, the data lake ingests raw data, like their content selections, and stores it. In the future, the streaming service can retrieve the data from the data lake and analyze it to know what customers stream frequently. This insight will guide the streaming service on the types of content to acquire or produce to better align with their customers’ preferences. 

E-commerce businesses

E-commerce businesses pool large amounts of data to help them understand consumers' purchasing behaviors and ever-changing market trends. Investing in a data lake means they can store these large amounts of data infinitely. When the need arises, they can retrieve a subset of the relevant data and analyze it.

When to use a data warehouse

Use a data warehouse if: 

  1. You need to visualize data and extract insights from structured data quickly.

  2. You’re using data for decision making, not just collecting large amounts of data for analysis.

  3. Your original data source is not suitable for querying, and you need to separate your analytical data from your transactional data.

Let's take a look at some industry verticals that will benefit from having data warehouses.

Financial services

Data warehouses help financial institutions like banks to simplify and standardize how they store historical data like KYC (Know Your Customer) information.

Instead of storing bits of KYC information in different data silos – like having proof of identification in one silo and home addresses in another – financial institutions can centralize all of this structured data in a data warehouse. This makes it easier for them to track and analyze historical data.

Education

A data warehouse provides a 360-degree view of a school’s data – from students’ demographic information to performance records and administrative information. Having all of this structured data in one place means that educators can easily access and analyze data to inform decision making. Teachers can analyze performance data to identify trends and patterns and come up with ideas to improve students’ grades.

Data storage is just the first step: Segment’s CDP puts your data into action

Data lakes and data warehouses each have their own use cases. But to truly unlock the power of your data, you need a customer data platform like Twilio Segment that will consolidate and organize it so you can segment audiences, analyze customer journeys, and create personalized experiences across all channels.

Michael Saah on November 2nd 2022

This blog looks at basic and advanced HPA configuration & a case where proper tuning led to a doubling in service efficiency and over $100,000 of monthly cloud compute value saved.

Jordan Kohl on September 28th 2022

In this blog, we share what went right and where we had to pivot in our plan to upgrade the API we use for SMS 2FA, affecting thousands of people.

Marín Alcaraz, Kevin Wang on September 20th 2022

Both an intern and their mentor reflect on their learnings from a summer software engineering internship, including tips and best practices.

Alan Charles on September 19th 2022

Analytics React Native 2.0 makes it easier than ever to customize your tracking implementation to meet the bespoke requirements of your app.

Ben Link on August 30th 2022

The question whether it's best to buy or build your own solution is not new. But in this economic downtown, you must consider the bottom line. How can you estimate the true cost of building, and what's the financial benefit of buying?

Kelly Kirwan on August 17th 2022

Data comes in all different forms. It can be structured or unstructured, or have variations in its format (e.g. a date being written as month/day/year versus day/month/year). 

As a result, businesses need a process for data integration – which is where ETL and ELT come in. These are two different ways to integrate data from multiple sources and make it ready for analysis. But how do these methods differ, and which approach is best suited for your business? 

To help answer these questions, this article will explain:

  • What is ETL and how does it work?

  • What is ELT and how does it work?

  • ETL vs. ELT: how they’re different?

  • ETL vs. ELT: which is best?

  • ETL & ELT tools

  • How Segment integrates with ETL/ELT tools

What is ETL and how does it work?

ETL means “extract, transform, load.” With this approach, you would first extract data from different sources like websites, email systems, spreadsheets, and CRMs. This data would then be copied and temporarily stored in a “staging area,” like a database on a secondary processing server or in the cloud. This is where the transformation takes place, which includes cleaning and validating data, removing/masking personally identifiable information (PII), or even doing calculations. 

At this stage, you can also reformat data to make it compatible with the system it’s being sent to (e.g. your data source uses strings for a certain field but you’re sending it to a system that uses integers). 

Finally, you load the transformed, structured data into your desired system, whether it’s a relational database, data warehouse, or analytics platform that can accept a direct data feed. From there, analysts, developers, and users can access and analyze the data using business intelligence tools like Looker or Tableau.

What is ELT and how does it work?

ELT means “extract, load, transform.” In this approach, you extract raw, unstructured data from its source and load it into a cloud-based data warehouse or data lake, where it can be queried and infinitely re-queried. 

When you need to use the data in a semi-structured or structured format, you transform it right in the data warehouse or data lake.

Say you’re in e-commerce, and a marketer wants to analyze customer attributes and browsing habits to create new audience segments. You would only transform the data needed (i.e. attributes and browsing behavior) to ensure it’s in a format that’s compatible with their customer segmentation tool. You then can query and transform the same data fields again at another time for a different tool or use case. 

Transformation in ELT is much the same as in ETL: you cleanse and validate the data, convert it to match the destination’s data format, and perform calculations. You can also encrypt and anonymize the data to comply with privacy regulations.

ETL vs. ELT: how they’re different

To determine whether ETL or ELT is best for your business, consider your priorities and current capability. For example, is access to real-time data critical to your operations? If so, then you need the fast loading speed of ELT. Is the data you collect highly sensitive and subject to stringent, industry-specific privacy laws? If yes, ETL lets you set up tighter privacy guardrails.

Take a look at the table below to explore the key differences between ETL and ELT. 

Transformation process

ETL requires you to know in advance what types of data you need, what you’ll use them for, and what data formats your target systems accept. That’s because the staging area typically doesn’t store the entire raw data set you extracted. If later on you identify a new analysis case, you would need to extract the data from the source again – if it’s still there.

That means planning is paramount before setting up an ETL pipeline. At Segment, we encourage our customers to create a tracking plan that clarifies:

  • The types of data to track

  • Business justifications for tracking this information

  • Tracking methods

  • Data naming conventions

If you’re going the ETL way, check out our guide on how to create a successful data tracking plan. It comes with templates for SaaS, eCommerce, mobile, and video use cases.

ELT gives you more flexibility as it only asks you to transform data when you need to. You’ll still benefit from a tracking plan, especially for transformations you want to automate. If you need to add more data types to your analysis or implement a new analytics tool, you can go back to your data warehouse at any time to get what you need.

Privacy & compliance

ETL lets you comply with privacy rules as it removes personal identifiers from data before making it accessible. Aside from territory-based laws like GDPR, you may also need to comply with industry-specific regulations, such as the California Consumer Privacy Act (CCPA), Health Insurance Portability and Accountability Act (HIPAA), and Family Educational Rights and Privacy Act (FERPA).

If you’re using ELT and you collect personal data generated in the EU, you may need to use a data warehouse or data lake that’s hosted within the territory unless you have applied privacy safeguards approved by the European Commission. Consult a data privacy specialist to make sure your data pipeline complies with relevant laws.

ETL vs. ELT: which is best?

Your purpose for collecting and analyzing data will help you determine whether ETL or ELT is best for your organization.

When to use ETL

Choose ETL if you:

  • Deal with smaller datasets and complex transformations

  • Need to comply with privacy regulations like HIPAA

  • Want non-technical users to analyze data using business intelligence tools

Companies that use a customer data platform (CDP) will find ETL compatible with their process. A CDP uses APIs and native connectors to ingest data from different sources.Once in the CDP, the data gets cleaned and transformed, as well as unified to create a single customer view.

The CDP also pushes data to downstream tools like email marketing, messaging, and A/B testing. That means you don’t need to repeat the ETL process every time you want to send clean data to a different tool.

When to use ELT

Choose ELT if you:

  • Collect large volumes of unstructured data and need to have them in your data repository fast

  • Need to re-query datasets multiple times or apply several different transformations

ELT is especially useful for massive streams of data that must be loaded in real-time – think stock-market data or information from IoT sensors. If data-intensive tech like AI, robotics, and IoT are crucial to your business operations, an ELT pipeline will give you the scale, speed, and flexibility you need. With ELT, you can send data to different downstream tools that have different requirements using either a reverse ETL tool or a CDP.

ETL & ELT tools

ETL and ELT software automate the data integration process. They use SQL queries and API connectors to extract data, and they work with the most popular sources as well as with cloud data warehouses like Amazon Redshift, Google BigQuery, and Snowflake.

Before implementing an ETL/ELT tool, check whether it integrates with your existing data systems and infrastructure, such as whether or not it works with your CDP.

Stitch ETL

Stitch is a no-code ETL tool that can extract data from any source using integrations built on Singer, an open-source standard for ETL. It’s best used for sales, marketing, and product intelligence. Pricing starts at $100/month.

Serenytics ETL

Serenytics is a low-code business intelligence platform for business users and software publishers. It lets you configure ETL as an automated process within the platform. Plans for medium-sized companies start at $490/month, and plans for enterprises start at $1,520/month.

Blendo ELT

Blendo is an ELT solution best used for integrating sales, marketing, and financial data. It automatically infers the schema of the data source and prepares the necessary tables in your warehouse, reducing the time you spend preparing and transforming data. Blendo doesn’t disclose pricing on its website.

How Segment integrates with ETL & ELT tools

Segment can serve as the source of customer data for your ETL and ELT tools. By adding Segment as a data source, you don’t need to manage multiple ETL/ELT connectors for your sources of first-party customer data. The data is cleaned and transformed in Segment and enriched through the creation of unified customer profiles and customer segments. You can then analyze customer data alongside other types of business data, such as financial and supply chain information.

Find out how Segment integrates with Stitch, Serenytics, and Blendo.

Ilya Galperin on July 18th 2022

The Endeavor Digital Engineering team used Segment, allowing them to rapidly onboard additional data sources and expose those sources to the data science team for analysis.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.