COURSE 3 • Lesson 5

How to choose the right data warehouse

Choosing the right data warehouse for your business is no easy task. Not only are there quite a few options on the market, but each has a slightly different set of features to consider. Continue your journey through Analytics Academy to see which one is right for you!

Choosing the right data warehouse for your business is no easy task. Not only are there quite a few options on the market, but each has a slightly different set of features to consider. And, to top it all off, setting up a data warehouse without a tool like Segment can be extremely difficult and take a lot of engineering hours to get right. So it's not the type of thing you'd want to change your mind about after implementation. (You might get some looks from the engineers on your team.)

What is a cloud data warehouse?

For those just getting up to speed, here's the scoop. You can think of a data warehouse as a home for all of your data. Companies use a data warehouse to aggregate data from a number of different data sources so it's easy to analyze.

A cloud data warehouse, as you may have guessed, is one that lives entirely online. Unlike on-premise data warehouses, cloud-based data warehouses don't require any physical hardware, are much easier to implement and scale, and, are typically less expensive than on-premise data warehouses.

 

Diagram showing the data flow from various sources through a data warehouse

 

When and why should you consider a data warehouse?

With a data warehouse, you have ultimate flexibility for how you store and later query your data. It helps you answer those tough analytical questions that your board may be asking about that aren't possible to do with your standard analytics tool.

For example, Google Analytics can give you a good sense of what actions customers are taking on your website or app. However, you're limited to asking questions that can be answered with the number of variables, properties, and types of charts that it provides.

The reports and analyses you run in data warehouses can include elements from each and every one of the data sources you combine in it. Pretty powerful stuff! You can analyze data from your website and app, as well as other platforms you may use like Salesforce, Zendesk, Stripe, and others. When you have all of your data in one place, you can easily run queries directly in your warehouse or through a BI tool like Tableau, Looker, or Mode (check out the next section to learn more about those tools).

You should consider a data warehouse if you want to do the following:

  • Centrally store all of your business-critical data
  • Analyze your web, mobile, CRM, and other applications together in a single place
  • Dive deeper than traditional analytics tools by querying raw data with SQL
  • Provide multiple people access to the same data set simultaneously

Considerations for choosing a data warehouse

Once you've decided that a data warehouse is necessary for your team's needs, there are a number of important factors to consider when making a selection:

  • Data types: what type of data you want your warehouse to store
  • Scale: the amount of data you plan to store
  • Performance: how quickly you need your data when you query it
  • Maintenance: how much engineering effort you're willing and able to dedicate to your warehouse
  • Cost: how much you are willing to spend on your data warehouse
  • Community: how connected your warehouse is to other critical tools and services

Keep in mind that many of the factors listed will directly influence one another, and tradeoffs may be necessary. For example, opting for less scale may decrease performance but will typically be more cost-effective. Throughout the selection guide, we'll highlight use cases that will help you optimize for each factor.

Data Types

The first step to understand your data warehouse needs is to determine what "type" of data you will store in it. For a warehouse, there are two main "types" of data: structured and unstructured.

  • relational database works well with structured data or data that_ fits nicely into rows and columns. If your data could be organized into one extra large spreadsheet, then a relational data warehouse would be a good fit for your company.
  • non-relational database excels with extremely large amounts of semi-structured data. Classic examples of semi-structured data are emails, books, social media posts, audio/visual data, and geographical data. You should consider a data lake over a classic data warehouse if you are working with purely unstructured data.

 

Table for criteria for different data warehouses

 

Scale

The next consideration is how much data you're accessing and the scale of data that your warehouse needs to support. Relational cloud-data warehouses are all typically able to store massive amounts of data without much overhead cost. You probably won't need more than what they offer, especially if analytics is the primary use case.

However, in cases where extreme scale is needed (greater than 2 terabytes of data), a non-relational warehouse will typically be a better fit because it won't impose restraints on incoming data, allowing you to write faster.

You'll also want to consider how a particular warehouse scales during times of demand. For example, Redshift can support massive amounts of data but will require you to manually add more nodes (for added storage and compute power). Snowflake, on the other hand, offers an auto-scale function which spins up and down clusters dynamically, as needed.

Performance

The next thing to consider is how quickly you'll need your data. This comes down to how fast your queries can run and how you maintain that speed in times of high demand. As you can imagine, performance and scale are closely connected. Performance will increase as you scale up the size of your warehouse or manually add additional nodes (for example, Amazon Redshift).

While real-time analytics is critical for some use cases, most analyses don't require real-time data or immediate insights. When you're answering questions like "what is causing users to churn?" or "how are people moving from our app to our website?" accessing your data with a slight lag is fine. Your data doesn't change that much minute by minute and your ability to follow bigger trends won't be impacted.

Maintenance

The smaller your overall team, the more likely it is that you'll need your engineers focusing on building products rather than worrying about ETL pipelines and day-to-day management of your warehouse. For data warehouses that aren't self-optimizing, you'll need to have someone spend time vacuuming, resizing, and monitoring the cluster to ensure performance remains strong.

However, maintaining a warehouse manually allows you to optimize it precisely for your company's needs. More time spent manually tuning and scaling your data warehouse will mean you have greater control over the performance and cost. To an experienced warehouse admin, "more maintenance" means more flexibility and control.

Cost

Cost will be one of the most important considerations for your data warehouse. It's also one of the most volatile, based on your company's specific use case for a warehouse. Typically, with a data warehouse, you'll be required to pay for some combination of storage, size of the warehouse, run time, or queries.

If you are constantly running queries on the data, you'll want to look for a solution that has a lower compute cost. If you have a lot of data but only have one team using it, you'll want to look for a solution that has low storage costs. A benefit of all cloud-based, relational data warehouses is that storage costs are typically very low and there's not a huge upfront cost to purchase, install, and configure the solution.

 

Cost structure table for different warehouses

 

If you're a startup or a company with limited budget, Postgres could be a good choice for a business. It does much of what the other warehouses listed above do, but you don't have a to pay to use it. Also, when it's time to upgrade your solution, it's fairly straightforward to switch, especially with a solution like Segment. (Sorry, we couldn't help it!).

Community

Lastly, you'll want to look into how easy your warehouse will be to set up. If you're already using a number of tools in your organization (which some of you may be), you'll want to make sure whatever you choose plays nicely with your existing technology stack. Not only will it make implementation easier, but it will also save your team from having to develop multiple custom ETL pipelines to get your data where it needs to be. (You still may need to write a custom ETL to get data into your warehouse).

Vendors to take a look at

Based on the categories listed above, here's how some of the leading data warehouse providers shake out.

 

Table describing differing data types across 5 companies

 

Making the final call

As with many of the technologies listed in this lesson of the Analytics Academy, there are many things to consider before making a final choice. Establishing clear use cases, even with data warehousing will the most useful thing you can do before starting your evaluation.

As mentioned above, questions like how often do you plan on querying the data and how much data will be stored in the warehouse are important to nail down before kicking off your evaluation. Doing so will ensure that your data warehouse has the capabilities you need and isn't more tool than you really need (not to mention, doesn't break the bank).

For more information on the data warehouses that Segment offers, check out our catalog of integrations here.


Side note: Are you a data warehousing tool that's interested in snagging a spot on this list? Get more information on Segment's partner program here.

icon-choosing-the-right-stackChoosing the right stack

Next lesson

Loading

Get every lesson delivered to your inbox

Enter your email below and we’ll send lessons directly to you so you can learn at your own pace.