Engineers & Developers

$0.6Million/Year savings by using S3 for ChangeDataCapture for DynamoDB Table

By implementing semantic partitioning on kafka topic and a service that buffers changelog messages from kafka topic to flush files to S3 Bucket, we effectively removed our dependence on BigTable as changelog for tracking DynamoDB Item changes. This significantly reduced our costs, while ensuring efficient handling of high-volume change data entries in production.

July 29, 2024

By Akash Kashyap

Want to stay updated on Segment launches, events, and updates? Subscribe below to keep in touch.

Thank you for subscribing!

We'll share the latest and greatest Segment content, events, and updates straight to your inbox.

We are really sorry. There was an error processing your submission. Try reloading your browser, and submitting again.

Want to learn more about Customer Data Platforms?

Watch a Product Tour to see Segment's CDP in action and learn how it can impact your business.

Watch now

Keep updated

At Twilio Segment, the objects pipeline processes hundreds of thousands of messages per second and stores the data state in a DynamoDB table. This data is used by the warehouse integrations to keep the customer warehouses up-to-date. The system originally consisted of a Producer service, DynamoDB, and BigTable. We had this configuration for the longest time, with DynamoDB and BigTable being key components which powered our batch pipeline.

We recently revamped our platform by migrating from BigTable to offset the growing cost concerns, consolidate infrastructure to AWS, and simplify the number of components.

In this blog post, we will take a closer look at the objects pipeline, specifically focusing on how the changelog sub-system which powers warehouse products has evolved over time. We will also share how we reduced the operational footprint and achieved significant cost savings by migrating to S3 as our changelog data store.

Why did we need a changelog store?

The primary purpose of changelog in our pipeline was to provide the ability to query newly created/modified DynamoDB Items for downstream systems. In our case, these were warehouse integrations. Ideally, we could have easily achieved this using DynamoDB’s Global Secondary Index which would minimally contain:

An ID field which uniquely identifies a DynamoDB Item
A TimeStamp field for sorting and filtering

But due to the very large size of our table, creating a GSI for the table is not cost-efficient. Below are some numbers related to our table:

Avg size of item is 900 Bytes
Monthly storage costs $0.25/GB
Total size of the table today is ~1 PetaByte

The total number of items back then was 399 Billion. Today it is 958 Billion and still growing.

Initial System, V1

In the V1 system, we used BigTable as our changelog as it suited our requirements quite well. It provided low-latency read and write access to data, which made it suitable for real-time applications.

Also, BigTable was designed to scale horizontally, meaning it could handle a large amount of data and traffic. Like any real world system, it was not without trade-offs.

Some of our pain points included:

Responsibility of detection if DynamoDB item has changed or not fell on Producer service
Responsibility of hydrating BigTable fell on Producer service
Infrastructure had to managed across different Public clouds

But there was one large advantage:

The Bigtable costs stood at ~$30K/month which is much lesser than the one time upfront cost of $86K for creating a GSI in existing DynamoDB table of size 376TB and additional monthly ongoing costs of $24K/Month with reserved capacity.

V2 System - Initial Re-Design Considered

Over time operating costs of BigTable increased to $60K/Month as the size of the DynamoDB table grew to 848TB in size. In an effort to alleviate some of the pain points we inherited from V1 and growing costs, we leveraged DynamoDB’s ChangeDataCapture feature. This simplified the business logic on the producer service side.

We also introduced newer components like archiver which buffers the changelog entries to a file and uploads them to a partitioned path in the S3 bucket. The Partitioned Path made it easier for downstream systems to query the newly/changed entries to DynamoDB.

Example S3 partition path: <Prefix>/<ProjectID>/<Date>/<Hour_part_of_day>

We ran into some pain points, including:

More moving parts, more points of failure
An increase in operational footprint
The introduction of Lambdas to consume CDC warranted additional components like DLQ Component, Triggers on SQS 🤦
Data migration required additional tooling/efforts
Human intervention was required during lambda outages

But we also saw some advantages:

The introduction of S3 as a changelog store was initially 6 times cheaper than BigTable
Simplified producer logic

The system worked well with lower traffic but proved sub-optimal for handling higher traffic. Our efforts to address it by tuning lamba configuration with higher batch size and more parallelization factor didn’t yield expected outcome, rather proved to be more expensive both in cost and operations.

V3 System

We took our learnings from earlier implementations, so the V3 System uses the best of both implementations. It consolidates all the components to a single Public Cloud (we use AWS) while keeping the costs low. The total cost of the V3 System is less than $10K/month compared to $60K/month with BigTable.

The Producer service writes to object-changes topics using semantic partitioning. The semantic partitioning helps with more effective aggregation of the changelog messages belonging to the same project ID going to the same files. By directly writing to object-changes kafka-topic instead of depending on DynamoDB CDC, we managed to get rid of unnecessary components like DLQ, Lambdas, Lambda Processor, and CDC there by improving overall operational footprint.

This design also minimized data migration efforts from BigTable to S3 without disrupting live traffic. The migration tool simply reads changelog data from BigTable and publishes to object-changes kafka topics. The changelog files are surfaced to correct S3 partitioned paths based on the timestamp field present in kafka message payload.

We had one main pain point:

Responsibility of detection if a DynamoDB item has changed or not fell back on Producer service

And a lot more advantages:

Less components compared to the V2 system, less points of failures
Reduced on-call fatigue and operational footprint.
Costs less than $10K/Month
Simplified data migration efforts

Conclusion

By optimizing objects pipeline by transitioning from BigTable to S3 as changelog data store, we greatly reduced the costs and improved operational efficiency. The current system, V3 has led to an annual cost savings of over $0.6Million.

Acknowledgements

Huge thanks to the entire team which made this possible : Akash Kashyap, Y Nguyen, Dominic Barnes, Sowjanya Paladugu, Anthony Vylushchak, Annie Zhao, Emily Jia.

The State of Personalization 2023

Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.

Want to keep updated on Segment launches, events, and updates?

Thank you for subscribing!

We are really sorry. There was an error processing your submission. Try reloading your browser, and submitting again.

$0.6Million/Year savings by using S3 for ChangeDataCapture for DynamoDB Table

Why did we need a changelog store?

Initial System, V1

Some of our pain points included:

But there was one large advantage:

V2 System - Initial Re-Design Considered

We ran into some pain points, including:

But we also saw some advantages:

V3 System

We had one main pain point:

And a lot more advantages:

Conclusion

Acknowledgements

The State of Personalization 2023

Recommended articles

Want to keep updated on Segment launches, events, and updates?

Thank you for subscribing!