How Twilio Segment Handled the Facebook Outage of October 2021
When Facebook went down, our customers lost valuable data. Segment's archival tool came to the rescue.
When Facebook went down, our customers lost valuable data. Segment's archival tool came to the rescue.
On the morning of Monday, October 4, 2021, we started to hear rumblings around the “office” (read: our Slack channels) asking if Facebook was down for everyone. The chatter increased — evidence that Facebook is not only a ubiquitous part of our lives but also a signal for our team to investigate further. We learned that there was a widespread outage, and our Facebook destinations were beginning to fail as a result.
The delivery success rate dropped dramatically for Facebook Conversions API
For context, Segment integrates with the following Facebook products server-side:
Facebook App Events
Facebook Conversions API
Facebook Offline Conversions
These destinations are valuable to our customers as they are used to track key conversion events, such as online, in-store, and mobile purchases. Businesses use conversion data downstream to understand how their Facebook Ad campaigns are performing. Without this information, our customers cannot tie ad engagement to revenue, understand which ads are the most effective, and make informed decisions on how to allocate ad spend moving forward.
How Segment works with Facebook
Our system has a 4-hour retry window for failed event delivery to partner tools. While we hoped the retry window would capture failed events, we had no idea how long the outage would last so we started to prepare tooling to recover the data in the event the outage was longer than 4 hours (spoiler alert: the outage lasted 6-7 hours).
There was some good news though! With Segment, the data was not technically lost. When events fail to send to a destination, we store those discarded events so that they are available for future use in Segment Replays. Replays enable us to send historical data to our server-side integrations. The benefits of Replays include:
Customers can evaluate new tools with their own data almost immediately.
We (Segment) can recover lost data when other tools go down.
Customers can avoid data lock-in with existing vendors.
When we plan to replay data, we also consider a few things: (1) whether the partner tool accepts timestamps, (2) how old timestamped data can be, and (3) whether the partner tool has deduplication mechanisms.
Does the tool accept historical data?
Not all partner tools can accept timestamped, historical data. Tools that don’t accept timestamped data will treat the replayed data as though it is brand new, causing events to be processed in the wrong chronological order, which could create problems in reporting or in understanding the state of a user.
In this case, Facebook Offline Conversions does not accept backfilled data, whereas Facebook App Events and Facebook Conversions API can. We understood that this meant full recovery could only be achieved in those two tools.
Are there any limitations to how far back the historical data can be?
Even if a tool accepts historical data, it may have constraints on how old the data can be. Data sent to Facebook Conversions API can go back 7 days; if the timestamp on an event is greater than 7 days in the past, Facebook will return an error and not process the event. This meant our team needed to act fast to replay data in a timely manner.
How does the tool handle duplicate data?
Duplicates are a concern to our customers because they can lead to misleading results downstream. What happens if we send Joe Shopper’s $250 shopping spree to Facebook twice and it is not deduplicated? It may appear that an ad campaign led to $500 in purchases when in reality it was one $250 purchase.
When a partner tool goes down, we are able to avoid duplicate data with internal tooling that allows us to replay data from discards. Our Discards feature stores all failed events (after retries are exhausted) so we are able to precisely replay only those events that failed. This reduces any risk of duplicates in cases where some events may have still succeeded. With the Facebook outage, we saw the delivery rate drop to zero over the course of 30 seconds; during that time, some events failed and some succeeded. Using our Discards feature, we were able to narrow our replay to only those events that failed during the outage window.
Our team worked to gather answers to these questions during the outage so that once Facebook was restored, we could focus all our efforts on replaying the data.
Facebook was restored on October 4 afternoon PT, and while many events were successfully retried in our 4-hour retry window, there were approximately two hours in the morning in which data delivery failed beyond the retry window. Our team set out to leverage the Replay system to recover lost data.
The delivery success rate to Facebook Conversions API skyrockets as real-time and retried messages are sent.
As of this blog post, setting up each Replay is a manual process that our team must trigger. What this comes down to is a lot of engineering time scheduling and monitoring Replays. But we knew proactively replaying data was the right thing to do to make sure our customers did not feel any pain from data loss.
Several engineers on our Integrations team spent October 5-6 replaying the data and as issues surfaced, other teams leaned in to support. For example, at one point we noticed a number of scheduled Replays were failing to execute. The team that owns the Replay system jumped in and made a minor (and quick!) update to improve scheduling reliability, allowing our team to move even faster.
By the time Friday rolled around, all Replays had been successfully completed and data was restored. In total, we:
Ran thousands of Replays
Processed 130 million additional events
More than 130M events were processed as a result of these Replays
System outages are never fun, but they are inevitable and can happen to anyone at any time (us included!). At Segment, we continuously strive to improve our systems, tooling, and alerting so that when an outage does occur, we are prepared to dive in and find a resolution quickly.
A couple of actionable items came out of the Facebook outage:
We improved our monitoring to be notified more quickly about partner outages
We added a roadmap item to automate and recover data in scenarios like this (stay tuned for a follow-up blog post!)
At Segment, we help our customers send their data wherever they need it to go. We can’t always control if a partner tool will consume the data, but we can safely archive data and help our customers recover from outages like this one, making all parties successful.
Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.