BigQuery Destination

Overview

Segment’s BigQuery connector makes it easy to load web, mobile, and third-party source data like Salesforce, Zendesk, and Google AdWords into a BigQuery data warehouse. This guide will explain how to set up BigQuery and start loading data into it.

The Segment warehouse connector runs a periodic ETL (Extract - Transform - Load) process to pull raw events and objects and programmatically insert the structured data into your BigQuery cluster.

To do so, our connector first establishes an encrypted connection with the cluster over TLS. The connector then pulls the unstructured raw events from the S3 bucket and associated with the Segment source. The connector processes those raw events into a structured format, at which point a COPY command is executed to transfer the data from our bucket to your BigQuery cluster.

Using BigQuery through Segment means you’ll get a fully managed data pipeline loaded into one of the most powerful and cost-effective data warehouses today.

Setup

First, you’ll want to set up your BigQuery instance. Once you have an instance with the proper permissions, you’ll add your projectId to Segment and we’ll begin the first sync.

  1. Navigate to the Google Developers Console
  2. Configure Cloud Platform:
  1. Go to IAM > Project Name > Project-Name

  2. Add connector@segment-1119.iam.gserviceaccount.com as a BigQuery User

  3. In Segment, go to Workspace > Warehouses > Add Warehouse

  4. Select BigQuery

  5. Add your Project Id from the BigQuery Console

  6. Click Connect. Your data will begin loading!

Schema

BigQuery datasets are broken down into tables and views.

Partitioned Tables

The Segment connector takes advantage of partitioned tables. Partitioned tables allow you to query a subset of data, thus increasing query performance and decreasing costs.

To query a full table, you can query like this…

select *
from <project-id>.<source-name>.<collection-name>

To query a specific partitioned table, you can query like this…

select *
from <project-id>.<source-name>.<collection-name>$20160809

Views

A view is a virtual table defined by a SQL query. We use views in our de-duplication process to ensure that events that you are querying unique events, and the latest objects from third-party data. All our views are setup to show information from the last 60 days. Whenever possible, we recommend that you query from these views.

Views are appended with _view , which you can query like this…

select *
from <project-id>.<source-name>.<collection-name>_view

Best Practices

There are a few best practices tips that you should keep in mind as your query BigQuery.

  1. BigQuery charges based on the amount of data scanned by your queries. Views are a derived view over your tables that we use for de-duplication of events. Therefore, we recommend you query a specific view whenever possible to avoid duplicate events and historical objects. It’s important to note that BigQuery views are not cached.

BigQuery’s views are logical views, not materialized views, which means that the query that defines the view is re-executed every time the view is queried. Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query.

  1. To save money, you can query the view and set a destination table, and then query the destination table.
  2. If you typically start exploratory data analysis with SELECT * consider specifying the fields to reduce costs.
  3. You can connect to BigQuery using a BI tool like Mode or Looker, or query directly from the BigQuery console.
  4. BigQuery now supports standard SQL, which you can enable via their query UI. This does not work with views, or with a query that utilizes table range functions.

Pricing

BigQuery offers both a scalable, pay-as-you-go pricing plan based on the amount of data scanned, or a flat-rate monthly cost. You can learn more about BigQuery pricing here.

BigQuery allows you to setup Cost Controls and Alerts to help control and monitor costs. If you want to learn more about what BigQuery will cost you, they’ve provided this calculator to estimate your costs.

Streaming Inserts

Segment’s connector does not support streaming inserts at this time. If you have a need for streaming data into BigQuery, please contact us.

Sync schedule

Your data will be available in Warehouses within 24-48 hours after your first sync. Your warehouse will then be on a sync schedule based on your Warehouse Plan.

Segment allows you to schedule the time and frequency of loading data into your data warehouse.

You can schedule your warehouse syncs by going to Warehouse > Settings > Sync Schedule. You can schedule up to the number of syncs allowed on your billing plan.

sync schedule image


If you have any questions or see anywhere we can improve our documentation, please let us know or kick off a conversation in the Segment Community!