Redshift Destination

Getting Started

This guide will explain how to provision a Redshift cluster and allow the Segment warehouse connector to write to it.

The Segment warehouse connector runs a periodic ETL (Extract - Transform - Load) process to pull raw event data stored in S3 and programatically inserts the structured event data into your Redshift cluster.

To do so, our connector first establishes an encrypted connection with the cluster over TLS. The connector then pulls the unstructured raw events from the S3 bucket associated with the Segment source. The connector processes those raw events into a structured format, at which point a COPY command is executed to transfer the data from our bucket to your Redhift cluster.

Pick the best instance for your needs

While the number of events (database records) are important, the storage capacity utilization of your cluster depends primarily on the number of unique tables and columns created in the cluster. Keep in mind that each unique .track() event creates a new table, and each property sent creates a new column in that table. For reason, we highly recommend starting with a detailed tracking plan before implementing Segment libraries to ensure that only necessary events are being passed to Segment in a consistent way.

There are two kinds of Redshift clusters: Dense Compute and Dense Storage

Dense Compute clusters are designed to maximize query speed and performance at the expense of storage capacity. This is done by using fast CPUs, large amounts of RAM and solid-state storage. While there are no hard and fast rules for sizing a cluster, we recommend that customers with fewer than 20 million monthly events start with a single DC1 node cluster and add nodes as needed. A single node cluster includes 200GB, with a max size of 2.56TB.

Dense Storage clusters are designed to maximize the amount of storage capacity for customers who have 100s of millions of events and prefer to save money on Redshift hosting costs. This is done by using slower CPUs, less RAM, and disk-based storage. A single DS2 node cluster includes 2TB of space, with a max size of 16TB.

Provision a new Redshift Cluster

You can skip this step if you already have a Redshift cluster

  1. Open the Redshift Console: Image

  2. Click on “Launch Cluster”: Image

  3. Fill out the cluster details (make sure to select a secure password!): Image

  4. Choose your cluster size: Image

  5. Setup your cluster Security Group or VPC and proceed to review (see below for instructions on settings up a VPC group)

Permissioning Segment to Redshift

Now that you’ve provisioned your Redshift cluster, you’ll need to configure your Redshift cluster to allow Segment to access it.

Database User

The username and password you’ve already created for your cluster is your admin password, which you should keep for your own usage. For Segment, and any other 3rd-parties, it is best to create distinct users. This will allow you to isolate queries from one another via WLM and perform audits easier.

To create a new user, you’ll need to log into the Redshift database directly and run the following SQL commands:

-- create a user named "segment" that Segment will use when connecting to your Redshift cluster.
CREATE USER segment PASSWORD "<enter password here>";

-- allows the "segment" user to create new schemas on the specified database. (this is the name you chose when provisioning your cluster)
GRANT CREATE ON DATABASE "<enter database name here>" TO "segment";

When setting up your warehouse in Segment, use the username/password you’ve created here instead of your admin account.

Networking

Redshift clusters can either be in a EC2 Classic subnet or VPC subnet.

If your cluster has a field called Cluster Security Groups, proceed to EC2 Classic

Or if your cluster has a field called VPC Security Groups, proceed to EC2 VPC

EC2-Classic

  1. Navigate to your Redshift Cluster settings: Redshift Dashboard > Clusters > Select Your Cluster

  2. Click on the Cluster Security Groups

  3. Open the Cluster Security Group

  4. Click on “Add Connection Type”

  5. Choose Connection Type CIDR/IP and authorize Segment to write into your Redshift Port using 52.25.130.38/32

EC2-VPC

  1. Navigate to your Redshift Dashboard > Clusters > Select Your Cluster

  2. Click on the VPC Security Groups

  3. Select the “Inbound” tab and then “Edit”

  4. Allow Segment to write into your Redshift Port using 52.25.130.38/32

    You can find more information on that here.

  5. Navigate back to your Redshift Cluster Settings: Redshift Dashboard > Clusters > Select Your Cluster

  6. Select the “Cluster” button and then “Modify”

  7. Make sure the “Publicly Accessible” option is set to “Yes” Image

Electing to encrypt your data

You can elect to encrypt your data in your Redshift console and it will not affect Segment’s ability to read or write.

Syncing data in and out of Segment Warehouse

It’s often the case that our customers want to combine 1st party transactional and operational data their Segment data to generate a 360 degree view of the customer. The challenge is that those data sets are often stored in separate data warehouses.

If you’re interested in importing data into a Redshift cluster, it’s important that you follow these guidelines.

Additionally, there a number of tools which provide syncing services between databases (mySQL, SQL Server, Oracle, PostgreSQL). Here is a list of some we’ve seen used by customers.

You can also unload data to a s3 bucket and then load the data into another Redshift instance manually.

Sync schedule

Your data will be available in Warehouses within 24-48 hours after your first sync. Your warehouse will then be on a sync schedule based on your Warehouse Plan.

Segment allows you to schedule the time and frequency of loading data into your data warehouse.

You can schedule your warehouse syncs by going to Warehouse > Settings > Sync Schedule. You can schedule up to the number of syncs allowed on your billing plan.

sync schedule image

Distribution Key

The id column is the common distribution key used across all tables. When you execute a query, the Redshift query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.

Reserved Words

Redshift limits the use of reserved words in schema, table, and column names. Additionally, you should avoid naming traits or properties that conflict with top level Segment fields (e.g. userId, receivedAt, messageId, etc.). These traits and properties that conflict with Redshift or Segment fields will be _-prefixed when we create columns for them in your schema, but keeping track of which is which (Segment-reserved vs. custom property columns) can be tricky!

Redshift limits the use of integers at the start of a schema or table name. We will automatically prepend a _ to any schema, table or column name that starts with an integer. So a source named ‘3doctors’ will be loaded into a Redshift schema named _3doctors.

Query Speed

The speed of your queries depends on the capabilities of the hardware you have chosen as well as the size of the dataset. The amount of data utilization in the cluster will also impact query speed. For Redshift clusters if you’re above 75% utilization, you will likely experience degradation in query speed. Here’s a guide on how to improve your query speeds.

Security

VPCs keep servers inaccessible to traffic from the internet. With VPC, you’re able to designate specific web servers access to your servers. In this case, you will be whitelisting the Segment IPs to write to your data warehouse.

CPU

In an usual workload we have seen Redshift using around 20-40% of CPU, we take advantage of the COPY command to ensure to make full use of your cluster to load your data as fast as we can.


If you have any questions or see anywhere we can improve our documentation, please let us know or kick off a conversation in the Segment Community!