Segment’s BigQuery connector makes it easy to load web, mobile, and third-party source data like Salesforce, Zendesk, and Google AdWords into a BigQuery data warehouse. This guide will explain how to set up BigQuery and start loading data into it.
The Segment warehouse connector runs a periodic ETL (Extract - Transform - Load) process to pull raw events and objects and load them into your BigQuery cluster.
Using BigQuery through Segment means you’ll get a fully managed data pipeline loaded into one of the most powerful and cost-effective data warehouses today.
To store your Segment data in BigQuery, complete the following steps:
- Enable BigQuery for your Google Cloud project
- Create a GCP service account for Segment to assume
- Create a warehouse in the Segment app
Create a Project and Enable BigQuery
- Navigate to the Google Developers Console
- Configure Cloud Platform:
- If you don’t have a project already, create one.
- If you have an existing project, you will need to enable the BigQuery API. Once you’ve done so, you should see BigQuery in the “Resources” section of Cloud Platform.
- Note: make sure billing is enabled on your project, or Segment will not be able to write into the cluster.
- Copy the project ID. You will need it when you create a warehouse source in the Segment app.
Create a Service Account for Segment
Refer to Google Cloud’s documentation about service accounts for more information.
- From the Navigation panel on the left, go to IAM & admin > Service accounts
- Click Create Service Account along the top
- Enter a name for the service account (for example: “segment-warehouses”) and click Create
- When assigning permissions, make sure to grant the following roles:
BigQuery Data Owner
BigQuery Job User
- Create a JSON key. The downloaded file will be used to create your warehouse in the next section.
Create the Warehouse in Segment
- In Segment, go to Workspace > Add destination > Search for “BigQuery”
- Select BigQuery
- Add a name for the destination to the Name your destination field
- Enter your project ID in the Project field
- Copy the contents of the credentials (the JSON key) into the Credentials field
Optional: Enter a region code in the Location field (the default will be “US”)
- Click Connect
- If Segment can connect with the provided Project ID and Credentials, a warehouse will be created and your first sync should begin shortly
BigQuery datasets are broken down into tables and views. Tables contain duplicate data, views do not.
The Segment connector takes advantage of partitioned tables. Partitioned tables allow you to query a subset of data, thus increasing query performance and decreasing costs.
To query a full table, you can query like this:
select * from <project-id>.<source-name>.<collection-name>
To query a specific partitioned table, you can query like this:
select * from <project-id>.<source-name>.<collection-name>$20160809
A view is a virtual table defined by a SQL query. Segment uses views in the de-duplication process to ensure that events that you are querying unique events, and the latest objects from third-party data. All Segment views are set up to show information from the last 60 days. Whenever possible, query from these views.
Views are appended with
_view , which you can query like this:
select * from <project-id>.<source-name>.<collection-name>_view
For early customers using BigQuery with Segment, rather than providing Segment
with credentials, access was granted to a shared Service Account
firstname.lastname@example.org). While convenient for early
adopters, this presented potential security risks that Segment would prefer to address
As of March 2019, Segment requires BigQuery customers to create their own Service Accounts and provide the app with those credentials instead. In addition, any attempts to update warehouse connection settings will also require these credentials. This effectively deprecates the shared Service Account.
To stay ahead of this change, migrate your warehouse by following the instructions in the “Create a Service Account for Segment” section above. Then, head to your warehouse’s connection settings and update with the Credentials you created.
BigQuery charges based on the amount of data scanned by your queries. Views are a derived view over your tables that Segment uses for de-duplication of events. Therefore, Segment recommends you query a specific view whenever possible to avoid duplicate events and historical objects. It’s important to note that BigQuery views are not cached.
BigQuery’s views are logical views, not materialized views, which means that the query that defines the view is re-executed every time the view is queried. Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query.
To save more money, you can query the view and set a destination table, and then query the destination table.
If you typically start exploratory data analysis with
SELECT * consider
specifying the fields to reduce costs.
See the section on partitioned tables for details on querying sub-sets of tables.
I need more than 60 days of data in my views. Can I change the view definition?
Absolutely! You will just need to modify one of the references to 60 in the view definition to the number of days of your choosing.
Segment chose 60 days as it suits the needs of most customers. However, you’re welcome to update the definition of the view as long as the name stays the same.
Here is the base query Segment uses when first setting up your views. Included in the base query are the placeholders (
%s.%s.%s) that you would want to include the project,
dataset and table (in that order).
SELECT * EXCEPT (ROW_NUMBER) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY loaded_at DESC) ROW_NUMBER FROM ` + "`%s.%s.%s`" + ` WHERE _PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_MICROS(UNIX_MICROS(CURRENT_TIMESTAMP()) - 60 * 60 * 60 * 24 * 1000000), DAY, 'UTC') AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC') ) WHERE ROW_NUMBER = 1
How does BigQuery pricing work?
BigQuery offers both a scalable, pay-as-you-go pricing plan based on the amount of data scanned, or a flat-rate monthly cost. You can learn more about BigQuery pricing here.
BigQuery allows you to set up Cost Controls and Alerts to help control and monitor costs. If you want to learn more about what BigQuery will cost you, they’ve provided this calculator to estimate your costs.
How do I query my data in BigQuery?
You can connect to BigQuery using a BI tool like Mode or Looker, or query directly from the BigQuery console.
BigQuery now supports standard SQL, which you can enable using their query UI. This does not work with views, or with a query that uses table range functions.
Does Segment support streaming inserts?
Segment’s connector does not support streaming inserts at this time. If you have a need for streaming data into BigQuery, contact Segment support.
Can I customize my sync schedule?
Your data will be available in Warehouses between 24 and 48 hours from your first sync. Your warehouse then syncs once or twice a day depending on your Segment Plan.
Segment allows Business Tier (BT) customers to schedule the time and frequency of warehouse data syncs.
If you are on a BT plan, you can schedule warehouse syncs by going to Warehouse > Settings > Sync Schedule in the Segment web app. You can schedule up to the number of syncs allowed on your billing plan.
I’m seeing duplicates in my tables.
This behavior is expected. Segment only de-duplicates data in your views. See the section on views for more details.
This page was last modified: 06 Dec 2021
Questions? Problems? Need more info? Contact Segment Support for assistance!