Data lakes: what they are & why companies use them

Geoffrey Keating on February 16th 2021

A data lake is a key component of a modern data management strategy. Data lakes gather and store raw data in its original form.

Segment Data Lakes helps you unlock the full potential of your data by providing ready-to-use data architecture. Unlike traditional data lake solutions, Segment takes care of designing, building, and maintaining the data lake architecture, so you don’t have to. Segment Data Lakes loads data automatically and reduces the amount of processing required to derive insights, while providing low-cost data storage costs and saving you valuable engineering hours.

Table of contents

What are data lakes?

Data lakes are central repositories used to store any and all raw data. A data lake has no predefined schema, so it retains all original attributes of the data collected, making it best suited for storing data that doesn’t have an intended use case yet.

James Dixon, founder at Pentaho, who coined the term “data lake,” explains the concept like this: “If you think of a datamart as a store of bottled water — cleansed and packaged and structured for easy consumption — the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

A data lake allows for easy, flexible storage because it doesn’t have to be processed on the way in. It’s important, however, to have good data-quality and data-governance practices in place. Otherwise, you can end up with a data swamp, making it hard to access data and get real value out of it.

What are the differences between data lakes and data warehouses?

While a data lake stores unfiltered and unprocessed data in its native format, a traditional data warehouse stores data that has already been filtered and processed. The data in a data warehouse is stripped of any excess attributes except those needed to run predefined queries against your data sets.

Zoom with margin

Data warehouses are best suited to structured and semi-structured data. Data lakes, on the other hand, can hold any data type, including unstructured data (think images, audio files, PDFs, etc.), at a low cost.

While a data lake is optimal for storing archival data, a data warehouse aggregates and organizes all stored data to make it easy to analyze. A data warehouse’s organizational schema allows you to efficiently run queries and visualize your data to aid in decision-making.

This makes for quick analysis, but because the data in a data warehouse has already been processed for a specific use case, you can’t get answers to questions that the data hasn’t been prepared for. A data lake is a valuable asset to retain data attributes for questions that may come up in the future.

Why do companies use data lakes?

Data lakes are able to store a large amount of data at a relatively low cost, making them an ideal solution to house all of your company’s historical data. A data lake offers companies more cost-effective storage options than other systems because of the simplicity and scalability of its function. For companies storing vast amounts—sometimes petabytes—of data, using a data lake results in significant cost savings for data storage.

Because data lakes keep all data in its native form, you can send the data through ETL (extract, transform, load) pipelines later, when you know what queries you want to run, without prematurely stripping away vital information.

Zoom with margin

A data lake gives you a central repository for your data, making data available across the organization. When you store data in individual databases, you create data silos. Data lakes remove those silos and give access to historical data analysis so every department can understand customers more deeply with the same data.

By combining all your data into a data lake, you can power a wide range of functions, including business intelligence, big-data analytics, data archiving, machine learning, and data science.

Why Segment Data Lakes is better than a traditional data lake

Traditional data lakes, like Hadoop, require engineers to build and maintain the data lake and its pipelines and can take anywhere from three months to a year to deploy. But the demand for relevant and personalized customer experiences, which require well-governed data, won’t wait. Companies need a data lakes solution that can be implemented right now to attain deeper insights on their customers with their historical data

Segment Data Lakes is a turnkey customer data lake solution built on top of AWS services that provides companies with a data-engineering foundation for data science and advanced analytics use cases. It automatically fills your data lake with all your customer data without additional engineering effort on your part. It’s optimized for speed, performance, and efficiency. Unlike traditional data lakes, with Segment Data Lakes, companies can unlock scaled analytics, machine learning, and AI insights with a well-architected data lake that can be deployed in just minutes.

Additionally, Segment Data Lakes makes data discovery easy. Data scientists and analysts can use engines, like Amazon Athena, or load it directly into their Jupyter notebook with no additional set up for easy data querying. And Segment Data Lakes converts raw data from JSON into compressed Apache Parquet for quicker and cheaper queries.

When Rokfin implemented Segment Data Lakes, the company was able to decrease data storage costs by 60%. Furthermore, Rokfin unlocked richer customer insights by leveraging the complete dataset without extra engineering effort. These richer insights provided content creators at Rokfin with valuable information about factors that led to higher acquisition and retention rates and helped them increase dashboard engagement by 20%.

Zoom with margin

Segment Data Lakes provides foundational data architecture to enable companies to create cutting-edge customer experiences using raw customer data.

Discover the untapped power of your data lake with a customer data platform

While data lakes are essential for storing archival data, you also need to be able to put that data to use. By pairing your data lake with a customer data platform (CDP), like Segment’s, you can combine your historical data with real-time data to power and optimize your marketing and product teams with actionable customer insights based on a complete customer profile.

Segment’s CDP improves data accessibility across the business. Segment’s CDP automatically cleans and standardizes your data before sending it on to third-party systems such as your analytics, marketing customer service tools, customer engagement platforms, and more. So IT and engineering teams can use the data for broader data insights to form a long-term strategy. At the same time, nontechnical users, such as marketing and product teams, will be able to draw actionable insights and supercharge personalized engagement strategies with historical and real-time data.

With a customer data platform, you can make even more informed decisions with a comprehensive, single customer view. Through identity resolution, Segment’s CDP gathers data points from your data lake and other data sources and merges each customer's history into a single profile. With identity resolution, you can glean actionable insights, power your customer interactions, and create relevant, personalized experiences with data.

Segment Data Lakes and Segment’s CDP activate all the historical data you have on a customer, with new data collected more recently for accurate insights and meaningful customer interactions.


Segment Data Lakes is available to all Segment business-tier customers as part of the current plan. Get started today by checking out our technical documentation and setup guide.

New to Segment? Sign up for a demo to learn how Segment can help you better understand your customers and engage with them effectively.

What will your tech stack look like in 2030?

In our new report, we surveyed over 4,000 customer data decision-makers to gauge current and future predictions for the customer data industry.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.