Jes Kirkwood on November 15th 2021
Jes Kirkwood on January 4th 2022
Jes Kirkwood on December 14th 2021
Jordan Seaton on December 7th 2021
Jes Kirkwood on December 7th 2021
Jes Kirkwood on December 1st 2021
Geoffrey Keating on November 23rd 2021
All I want for Christmas is a customer data platform that makes it through the holidays.
Not your average wish-list headliner, except among infrastructure teams dealing with peak requests that are at least five times greater than the average load.
An outage during any of the peak shopping season events —Thanksgiving, Black Friday, 11/11 in China, Christmas—can easily cost you your festive mood and your company hundreds of thousands to millions of dollars.
At least 48 prominent brands experienced technical problems on 2020's Black Friday. Luckily, you don't have to put your fate in the hands of Santa or anyone else to make your wish come true. You can turn the holidays—and other peaks—into business as usual with proper and timely preparation.
We've put together five peak-season best practices for your data infrastructure that get you ready for some peaceful time off while the business keeps humming along.
Start preparing by revisiting the data and lessons learned from last year's holidays or another peak event. While historical performance isn't a mirror of the future, it's a helpful starting point for your preparations.
When making your projections, consider industry data like the above chart for the eCommerce sector, which shows and predicts that overall online sales will keep increasing.
These are common characteristics of peak events you should expect to find in your historical data, according to the Google Cloud blog:
Traffic increases of 5 to 20 times (or even greater).
Higher conversion rates and a more considerable burden on back-end systems—like payment processing—than on the front end.
Rapidly increasing traffic in a short period as the event starts.
A trailing decline to normal levels that's much slower than the acceleration to the peak level.
Not every business runs into a peak over the holidays—another good reason to check your historical data before acting.
Looking at Segment's data, for example, on average across all our customers, we saw a decrease of 30% in traffic during Thanksgiving. But for those in online retail, we observed surges of up to 1,000% for several hours during peak time.
Once you've collected your historical data, you want to consider what major changes have happened since then. Ask yourself:
What systems have we rolled out or refactored significantly?
What major features or services have we launched?
What new sources of data have we plugged into our data stack?
What types of customers, partners, and other vendors have we added or changed?
Note any such changes, especially if you haven't tested them under significant amounts of load. You'll want to pay special attention to these in the subsequent steps of your preparation.
You'll also want to reach out to colleagues in other departments to get their projections for the upcoming peaks. Forecasts for sales revenue, inventory, and shipping, as well as marketing initiatives, will all give you insights into what kind of volume to plan for.
Most people and businesses make plans for the holiday season. Whether you work directly with consumers (B2C) or other businesses (B2B), reach out and learn their plans so you can further improve yours. Also, talk to partners you rely on—like us, Segment, or cloud providers like AWS or Google—so they can make the necessary preparations, too.
In a B2C business, you want to understand how consumer behavior might change compared to last year. Maybe a new type of device is more popular now or a different social media channel. Such changes can affect loads on different areas of your system or require you to collect new data events. You also want to check which product launches, special offers, and other marketing campaigns look like upcoming holiday hits. Short of coordinating research interviews directly with customers—never a bad idea—heading over to the folks in marketing, growth, or research should get you plenty of insights.
For a B2B-focused company, you want to understand what the businesses you serve are planning for their customers. Your customer should be running through a preparation exercise similar to what we're outlining here. Figure out where you can best support them in this process and the kind of numbers they're expecting, and agree on how you'll cooperate over the holidays.
Some of the B2B customers of Segment, all of which we proactively reach out to to support them during the peak seasons.
Make sure to talk to partners you rely on once you've collected this information from your customers. Albert Strasheim, VP, Segment core engineering, says the cloud isn't infinite. Your vendors face the same peak events as you do, and their other customers compete with you for the same cloud resources.
Reach out early to:
Check how your partners can support you. They might have documentation on handling peaks, template configurations for their products or services, and dedicated support teams to take on some of the preparation tasks.
Make reservations for the cloud resources you'll need. (More on this later under Create headroom and other buffers, as most but not all cloud resources can scale up automatically.)
Establish how and with whom you'll communicate leading up to and during the peak event.
Don't forget to check the existing service-level agreements (SLAs) you have in place with vendors and whether the agreed response times suffice.
You can only determine whether your team, partners, and infrastructure are peak-season-ready through tests and games that simulate actual events as closely as possible.
Since you can't measure, test, or prepare for the unknown, you first need to establish which metrics reflect system health and whether you're capturing such data for monitoring.
If you had to pick just four such data points, you can't go wrong with Google's Four Golden Signals:
Latency: The time it takes a request—like asking a web page to load—to reach the system.
Traffic: The total demand on the system, usually measured in requests per second.
Errors: The rate of failing requests, either as an absolute number or a proportion of all requests.
Saturation: One or more metrics reflecting the utilization of critical system resources, say what percentage of memory the servers are using or the amount of database storage you have available.
You can complement technical health metrics like the ones above by monitoring critical business numbers such as revenue, website traffic, and orders placed. These indicators can also signal potential problems when they show irregularities.
With the numbers to monitor identified, you'll want to see how different aspects of your data infrastructure hold up under a large volume of simulated requests—load testing.
You want to test the entire customer journey—not just individual elements—under such pressure, so you'll see what the customer experience will be like. Make sure to try several load mixes, like mobile versus desktop, various transaction types, or traffic coming from different regions. These test variations can all reveal particular weaknesses in your system.
You'll also want to check whether spikes in traffic don't set off automated defensive mechanisms in your system that mistake the holiday surge for an attack or other security breach.
While load tests reveal how a system responds to peak traffic, game days show how your teams respond. You'll want to think through potential failures during the holidays and then stage those situations to see where your operational procedures or knowledge are insufficient. Depending on the stakes and your team's size, you might want to run several of these and even include outside vendors.
You could, for example, simulate that one of your primary methods of payment stops working because of an outage at your payment processor. Such a simulation will reveal which teams need to get involved—did anyone think of the added load on customer support such an outage will cause?—and whether you’ve established effective lines of communication with a third party like your payment processor.
You'll want to create buffers in your system because your earlier plans and test results are estimates—reality can always turn out differently.
At Segment, we look at vital infrastructure components like Kafka, a data event streaming platform, and DynamoDB, a NoSQL database service, to ensure they have headroom—additional space above the peak traffic we expect.
You need to do such checks on all critical pieces of your tech stack. As much as possible, you want to create headroom and other buffers in your system through auto-scaling: the automatic expanding and shrinking of storage, memory, and other resources as traffic goes up or down.
You can configure many modern cloud services like AWS and Google for such auto-scaling. Yet, these processes might not keep up with rapid, five- to tenfold traffic increases. Under such circumstances, manual scaling might still be the most reliable solution. Discuss your plans and projections with your partners to determine which parts of your system can auto-scale and which elements need reservations and manual operation.
Amazon Kinesis is an advanced data product that helps customers analyze their video and data streams in real time. (Screenshot from the Kinesis explainer video.)
A typical issue we see with Segment customers is scaling Kinesis data, which provides real-time insights on video and data streams. It's not impossible to configure auto-scaling for Kinesis, but it's not straightforward either and often overlooked during peak season preparations.
A final, important buffer you want to create is a change freeze across your infrastructure leading up to and during peak events. Changes in one part of a system can trigger unexpected events elsewhere and render all of your preparations useless. At a minimum, such a change freeze should apply to releasing new or updated features. Typically, you want to extend it to the scope of marketing activities and third-party services integrated with or connected to your system.
When you've done all the preparations we've run through, a peak day can unfold much like a regular one. But some folks will have to work or be on call on days they'd rather be home, no matter how much prep work you put in, and they'd better know what to do in case something does go wrong.
You want to establish well in advance the exact responsibilities of each team and which individuals take on which shifts. Do such planning before the festive season starts and spread the burden of working on events such as Thanksgiving, Christmas, and perhaps New Year's evenly across team members.
Make sure to carefully review your standard on-call and incident procedures and see what changes you need to make for peak season by asking questions such as:
Are alerts set up for every critical metric you identified earlier?
To whom does the first alert go? Who is the fallback if that alert goes unanswered or unnoticed? What are the required response times for different types of alerts?
What's the first action someone should take for a specific alert? What happens when the initial procedure, template, or checklist doesn't solve the problem?
What does escalation look like? Under what circumstances and how can someone reach the engineering management team or even executives?
Don't just think about internal procedures and contacts when answering such questions. Consider which incidents need third-party involvement and include their contact persons and details in your plans. Segment's support team, for example, have coverage plans in place during holidays, and have team members standing by should a significant issue occur.
Most incidents are manageable if they get detected and handled anywhere between minutes and an hour or two. Albert Strasheim, VP, Segment core engineering, sees customers run into trouble when there's no automated alerting and noticing something is wrong takes more than a few hours.
You can avoid such problems by having virtual or in-office dashboards with your critical metrics and alerts displayed in real time. You might want to create a war room—physical or virtual—for larger operations, where people from all relevant teams work together synchronously and communication is instant.
That's the best part of the holiday season: you can be pretty sure the same events will be on the calendar again next year. All the effort you put in now increases the chances of your wishes coming through this holiday season and the one after.
Jes Kirkwood on November 22nd 2021
Geoffrey Keating on November 19th 2021