Redshift Warehouse Destination
This guide explains the process to provision a Redshift cluster and allow the Segment warehouse connector to write to it.
Complete the following steps to provision your Redshift cluster, and connect Segment to it:
- Choose the best instance for your needs
- Provision a new Redshift Cluster
- Create a database user
- Connect Redshift to Segment
Choose the best instance for your needs
While the number of events (database records) are important, the storage capacity usage of your cluster depends primarily on the number of unique tables and columns created in the cluster. Keep in mind that each unique
.track() event creates a new table, and each property sent creates a new column in that table. To avoid storing unnecessary data, start with a detailed tracking plan before you install Segment libraries to ensure that only the necessary events are passed to Segment.
Redshift gives the option of three cluster types:
- Dense Compute: These clusters are designed to maximize query speed and performance at the expense of storage capacity. This is done by using fast CPUs, large amounts of RAM and solid-state storage. While there are no hard and fast rules for sizing a cluster, customers with fewer than 20 million monthly events should start with a single DC1 node cluster and add nodes as needed. A single node cluster includes 200GB, with a max size of 2.56TB.
- Dense Storage These clusters are designed to maximize the amount of storage capacity for customers who have 100s of millions of events and prefer to save money on Redshift hosting costs. This is done by using slower CPUs, less RAM, and disk-based storage. A single DS2 node cluster includes 2TB of space, with a max size of 16TB.
- RA3: These clusters provide managed storage to help optimize your data warehouse splitting the cost of compute and storage.
Provision a new Redshift Cluster
Follow the steps below to create a new Redshift cluster. If you have a cluster already provisioned, skip this step.
From the Redshift dashboard, click Create Cluster.
Name your new cluster, and select the type and size of the nodes within the cluster.
Configure the database connection details and admin user for the cluster.
Now that you’ve provisioned your Redshift cluster, you’ll need to configure your Redshift cluster to allow Segment to access it.
Create a Database User
The username and password you’ve already created for your cluster is your admin password, which you should keep for your own use. For Segment, and any other 3rd-parties, it is best to create distinct users. This allows you to isolate queries from one another using WLM and perform audits easier.
To create a new user, log into the Redshift cluster directly (using the credentials you defined in Step 3 above) and run the following SQL commands:
-- create a user named "segment" that Segment will use when connecting to your Redshift cluster. CREATE USER segment PASSWORD '<enter password here>'; -- allows the "segment" user to create new schemas on the specified database. (this is the name you chose when provisioning your cluster) GRANT CREATE ON DATABASE "<enter database name here>" TO "segment";
When you configure your warehouse in Segment, use the username/password you’ve created here instead of your admin account.
Connect Redshift to Segment
After creating a Redshift warehouse, the next step is to connect Segment:
- In the Segment App, navigate to the Connections tab and click Add Destination
- Search for and select
- Add the necessary connection details, add your Redshift credentials
- Select which sources and collections/properties will sync to this Warehouse
VPCs keep servers inaccessible to traffic from the internet. With VPC, you’re able to designate specific web servers access to your servers. In this case, you will be whitelisting the Segment IPs to write to your data warehouse.
Always require SSL/TLS and make sure your data warehouse accepts only secure connections. Segment only connects to your data warehouse using SSL/TLS.
Redshift clusters are created in a VPC subnet. To configure:
Navigate to the Redshift cluster you created previously. Click Edit.
Expand the Network and security section and click Open tab to access the Network and security settings.
Click the VPC security group to access its settings. The Security group opens in a new tab.
Click the Security group in the list to access its settings.
On the Inbound tab, add or edit a rule to enable Segment to write to your Redshift port from
On the Outbound tab, ensure Redshift can make outbound requests to the Segment S3 bucket. The default behavior is to allow all outbound traffic, but security groups can limit outbound behavior.
Navigate back to the cluster’s settings, and click Edit publicly accessible to allow access to the cluster from outside of the VPC.
Electing to encrypt data
You can encrypt data in the Redshift console. Encryption does not affect Segment’s ability to read or write.
id column is the common distribution key used across all tables. When you execute a query, the Redshift query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.
Redshift limits the use of reserved words in schema, table, and column names. Additionally, avoid naming traits or properties that conflict with top level Segment fields (for example,
messageId). These traits and properties that conflict with Redshift or Segment fields are
_-prefixed when Segment creates columns for them in your schema.
Redshift limits the use of integers at the start of a schema or table name. Segment prepends an underscore
_ to any schema, table or column name that starts with an integer. A source named
3doctors is loaded into a Redshift schema named
In a usual workload Redshift around 20-40% of CPU. Segment takes advantage of the COPY command to make full use of your cluster to load your data as efficiently as possible.
How do I improve Query Speed?
The speed of your queries depends on the capabilities of the hardware you have chosen as well as the size of the dataset. The amount of data use in the cluster will also impact query speed. For Redshift clusters, if you’re above 75% capacity, you will likely experience degradation in query speed. Here’s a guide on how to improve your query speeds.
How do I sync data in and out between Redshift and Segment?
It’s often the case that customers want to combine 1st-party transactional and operational data with Segment data to generate a full view of the customer. The challenge is that those data sets are often stored in separate data warehouses.
If you’re interested in importing data into a Redshift cluster, it’s important that you follow these guidelines.
Additionally, there a number of tools which provide syncing services between databases (mySQL, SQL Server, Oracle, PostgreSQL). Here is a list of some we’ve seen used by customers.
You can also unload data to a s3 bucket and then load the data into another Redshift instance manually.
Can I customize my sync schedule?
Your data will be available in Warehouses between 24 and 48 hours from your first sync. Your warehouse then syncs once or twice a day depending on your Segment Plan.
Segment allows Business Tier (BT) customers to schedule the time and frequency of warehouse data syncs.
If you are on a BT plan, you can schedule warehouse syncs by going to Warehouse > Settings > Sync Schedule in the Segment web app. You can schedule up to the number of syncs allowed on your billing plan.
Can I use an SSH tunnel to connect to my Redshift instance?
Segment does not currently support SSH tunneling to Redshift. You can usually allow Segment’s ETL to write to Redshift without leaving the cluster available to other connections by using IP level restrictions.
This page was last modified: 22 Oct 2021
Questions? Problems? Need more info? Contact Segment Support for assistance!