How Segment Handled the Facebook Outage of October 2021

Kiara Daswani on January 5th 2022

“Is Facebook down for you?”

On the morning of Monday, October 4, 2021, we started to hear rumblings around the “office” (read: our Slack channels) asking if Facebook was down for everyone. The chatter increased — evidence that Facebook is not only a ubiquitous part of our lives but also a signal for our team to investigate further. We learned that there was a widespread outage, and our Facebook destinations were beginning to fail as a result.

Zoom with margin

The delivery success rate dropped dramatically for Facebook Conversions API

For context, Segment integrates with the following Facebook products server-side:

  • Facebook App Events

  • Facebook Conversions API

  • Facebook Offline Conversions

These destinations are valuable to our customers as they are used to track key conversion events, such as online, in-store, and mobile purchases. Businesses use conversion data downstream to understand how their Facebook Ad campaigns are performing. Without this information, our customers cannot tie ad engagement to revenue, understand which ads are the most effective, and make informed decisions on how to allocate ad spend moving forward.

Zoom with margin

How Segment works with Facebook

Our system has a 4-hour retry window for failed event delivery to partner tools. While we hoped the retry window would capture failed events, we had no idea how long the outage would last so we started to prepare tooling to recover the data in the event the outage was longer than 4 hours (spoiler alert: the outage lasted 6-7 hours).

Recovering “lost” data

There was some good news though! With Segment, the data was not technically lost. When events fail to send to a destination, we store those discarded events so that they are available for future use in Segment Replays. Replays enable us to send historical data to our server-side integrations. The benefits of Replays include:

  • Customers can evaluate new tools with their own data almost immediately.

  • We (Segment) can recover lost data when other tools go down.

  • Customers can avoid data lock-in with existing vendors.

When we plan to replay data, we also consider a few things: (1) whether the partner tool accepts timestamps, (2) how old timestamped data can be, and (3) whether the partner tool has deduplication mechanisms.

Does the tool accept historical data?

Not all partner tools can accept timestamped, historical data. Tools that don’t accept timestamped data will treat the replayed data as though it is brand new, causing events to be processed in the wrong chronological order, which could create problems in reporting or in understanding the state of a user.

In this case, Facebook Offline Conversions does not accept backfilled data, whereas Facebook App Events and Facebook Conversions API can. We understood that this meant full recovery could only be achieved in those two tools.

Are there any limitations to how far back the historical data can be?

Even if a tool accepts historical data, it may have constraints on how old the data can be. Data sent to Facebook Conversions API can go back 7 days; if the timestamp on an event is greater than 7 days in the past, Facebook will return an error and not process the event. This meant our team needed to act fast to replay data in a timely manner.

How does the tool handle duplicate data?

Duplicates are a concern to our customers because they can lead to misleading results downstream. What happens if we send Joe Shopper’s $250 shopping spree to Facebook twice and it is not deduplicated? It may appear that an ad campaign led to $500 in purchases when in reality it was one $250 purchase.

When a partner tool goes down, we are able to avoid duplicate data with internal tooling that allows us to replay data from discards. Our Discards feature stores all failed events (after retries are exhausted) so we are able to precisely replay only those events that failed. This reduces any risk of duplicates in cases where some events may have still succeeded. With the Facebook outage, we saw the delivery rate drop to zero over the course of 30 seconds; during that time, some events failed and some succeeded. Using our Discards feature, we were able to narrow our replay to only those events that failed during the outage window.

Our team worked to gather answers to these questions during the outage so that once Facebook was restored, we could focus all our efforts on replaying the data.

Obstacles our team ran into

Facebook was restored on October 4 afternoon PT, and while many events were successfully retried in our 4-hour retry window, there were approximately two hours in the morning in which data delivery failed beyond the retry window. Our team set out to leverage the Replay system to recover lost data.

Zoom with margin

The delivery success rate to Facebook Conversions API skyrockets as real-time and retried messages are sent.

As of this blog post, setting up each Replay is a manual process that our team must trigger. What this comes down to is a lot of engineering time scheduling and monitoring Replays. But we knew proactively replaying data was the right thing to do to make sure our customers did not feel any pain from data loss.

Several engineers on our Integrations team spent October 5-6 replaying the data and as issues surfaced, other teams leaned in to support. For example, at one point we noticed a number of scheduled Replays were failing to execute. The team that owns the Replay system jumped in and made a minor (and quick!) update to improve scheduling reliability, allowing our team to move even faster.

The final results

By the time Friday rolled around, all Replays had been successfully completed and data was restored. In total, we:

  • Ran thousands of Replays

  • Processed 130 million additional events

Zoom with margin

More than 130M events were processed as a result of these Replays

Takeaways

System outages are never fun, but they are inevitable and can happen to anyone at any time (us included!). At Segment, we continuously strive to improve our systems, tooling, and alerting so that when an outage does occur, we are prepared to dive in and find a resolution quickly.

A couple of actionable items came out of the Facebook outage:

  • We improved our monitoring to be notified more quickly about partner outages

  • We added a roadmap item to automate and recover data in scenarios like this (stay tuned for a follow-up blog post!)

At Segment, we help our customers send their data wherever they need it to go. We can’t always control if a partner tool will consume the data, but we can safely archive data and help our customers recover from outages like this one, making all parties successful.

The State of Personalization

Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.