Segment’s BigQuery connector makes it easy to load web, mobile, and third-party source data like Salesforce, Zendesk, and Google AdWords into a BigQuery data warehouse. When you integrate BigQuery with Segment you get a fully managed data pipeline loaded into a powerful and cost-effective data warehouse.
The Segment warehouse connector runs a periodic ETL (Extract - Transform - Load) process to pull raw events and objects from your sources and load them into your BigQuery cluster. For more information about the ETL process, including how it works and common ETL use cases, refer to Google Cloud’s ETL documentation.
To store your Segment data in BigQuery, complete the following steps:
- Create a project and enable BigQuery
- Create a service account for Segment
- Create the warehouse in Segment
Create a Project and Enable BigQuery
To create a project and enable BigQuery:
- Navigate to the Google Developers Console.
- Configure the Google Cloud Platform:
- Copy the project ID. You’ll need it when you create a warehouse source in the Segment app.
When you create your project, you must enable billing so Segment can write into the cluster.
Create a service account for Segment
To create a service account for Segment:
- From the Navigation panel on the left, select IAM & admin > Service accounts.
- Click Create Service Account.
- Enter a name for the service account (for example,
segment-warehouses) and click Create.
- Assign the service account the following roles:
BigQuery Data Owner
BigQuery Job User
- Create a JSON key. The downloaded file will be used to create your warehouse in the Segment app.
If you have trouble creating a new service account, refer to Google Cloud’s documentation about service accounts for more information.
Create the Warehouse in Segment
To create the warehouse in Segment:
- From the homepage of the Segment app, select Connections > Add Destination and search for BigQuery.
- Click BigQuery.
- Select the source(s) you’d like to sync with the BigQuery destination, and click Next.
- Enter a name for your destination in the Name your destination field.
- Enter your Project ID in the Project ID field.
Optional: Enter a region code in the Location field (the default is US.)
- Copy the contents of the JSON key that you created for the Segment service account into the Credentials field.
- Click Connect.
If Segment is able to connect to your project, a warehouse will be created and your first sync will begin shortly.
The Segment connector uses partitioned tables. Partitioned tables allow you to query a subset of data, which increases query performance and decreases costs.
To query a full table, use the following command:
select * from <project-id>.<source-name>.<collection-name>
To query a specific partitioned table, use the following command:
select * from <project-id>.<source-name>.<collection-name>$20160809
A view is a virtual table defined by a SQL query. Segment uses views in the de-duplication process to ensure that events that you are querying are unique events and contain the latest objects from third-party data. All Segment views are set up to show information from the last 60 days. Segment recommends querying from these views when possible to avoid duplicate events and historical objects.
Views are appended with
_view , which you can query using this format:
select * from <project-id>.<source-name>.<collection-name>_view
For early customers using BigQuery with Segment, rather than providing Segment
with credentials, access was granted to a shared Service Account
email@example.com). While convenient for early
adopters, this presented potential security risks.
Segment now requires BigQuery customers to create their own Service Accounts and provide the app with those credentials instead. In addition, any attempts to update warehouse connection settings will also require these credentials. This effectively deprecates the shared Service Account.
Migrate your warehouse from a shared Service Account to a dedicated Service Account by creating a new Service Account using the Create a Service Account for Segment section. Then, head to your warehouse’s connection settings and update with the credentials you created. Once you’ve verified that data is loading properly to your warehouse, remove access to the shared Service Account.
Remove access to the shared Service Account
You can remove access to the shared Service Account
firstname.lastname@example.org) using the following instructions:
To remove access to the shared Service Account:
- Create a new Service Account for Segment using the linked instructions.
- Verify that the data is loading into your warehouse.
- Sign in to the Google Developers Console.
- Open the IAM & Admin product, and select IAM.
- From the list of projects, select the project that has BigQuery enabled.
- On the project’s page, select the Permissions tab, and then click view by PRINCIPALS.
- Select the checkbox for the
email@example.com and then click Remove to remove access to this shared Service Account.
For more information about managing IAM access, refer to Google’s documentation, Manage access to projects, folders, and organization.
BigQuery charges based on the amount of data scanned by your queries. Views are a derived view over your tables that Segment uses for de-duplication of events. Therefore, Segment recommends you query a specific view whenever possible to avoid duplicate events and historical objects. It’s important to note that BigQuery views aren’t cached.
Understanding BigQuery views
BigQuery’s views are logical views, not materialized views, which means that the query that defines the view is re-executed every time the view is queried. Queries are billed according to the total amount of data in all table fields referenced directly or indirectly by the top-level query.
To save money, you can query the view and set a destination table, and then query the destination table.
If you start exploratory data analysis with
SELECT *, consider
specifying the fields to reduce costs.
Refer to the section on partitioned tables for details on querying sub-sets of tables.
I need more than 60 days of data in my views. Can I change the view definition?
Yes! You just need to modify one of the references to
60 in the view
definition to the number of days of your choosing. You can update the definition of the view as long as the name stays
Here is the base query Segment uses when first setting up your views. Included in the base query are the placeholders (
%s.%s.%s) that you would want to include the project,
dataset and table (in that order).
SELECT * EXCEPT (ROW_NUMBER) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY loaded_at DESC) ROW_NUMBER FROM ` + "`%s.%s.%s`" + ` WHERE _PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_MICROS(UNIX_MICROS(CURRENT_TIMESTAMP()) - 60 * 60 * 60 * 24 * 1000000), DAY, 'UTC') AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY, 'UTC') ) WHERE ROW_NUMBER = 1
How does BigQuery pricing work?
BigQuery offers both a scalable, pay-as-you-go pricing plan based on the amount of data scanned, or a flat-rate monthly cost. You can learn more about BigQuery pricing on Google Cloud’s BigQuery pricing page.
BigQuery allows you to set up Cost Controls and Alerts to help control and monitor costs. If you want to learn more about the costs associated with BigQuery, Google Cloud provides a calculator to estimate your costs.
How do I query my data in BigQuery?
You can connect a BI tool like Mode or Looker to BigQuery, or query directly from the BigQuery console.
BigQuery supports standard SQL, which you can enable using Google Cloud’s query UI. This doesn’t work with views, or with a query that uses table range functions.
Does Segment support streaming inserts?
Segment’s connector doesn’t support streaming inserts at this time. If you have a need for streaming data into BigQuery, contact Segment support.
Can I customize my sync schedule?
Your data will be available in Warehouses between 24 and 48 hours from your first sync. Your warehouse then syncs once or twice a day depending on your Segment Plan.
Segment allows Business Tier (BT) customers to schedule the time and frequency of warehouse data syncs.
If you are on a BT plan, you can schedule warehouse syncs by going to Warehouse > Settings > Sync Schedule in the Segment web app. You can schedule up to the number of syncs allowed on your billing plan.
I see duplicates in my tables.
This behavior is expected. Segment only de-duplicates data in your views. Refer to the schema section for more details.
This page was last modified: 02 Jun 2022
Questions? Problems? Need more info? Contact Segment Support for assistance!