Engineers & Developers

Event-driven architecture, Kafka and CDPs: Joining internal infrastructure with your tech stack

Event streaming platforms such as Apache Kafka are gaining in importance across all industries. In this article we'll discuss the benefits Apache Kafka implementations can gain from pairing it with a CDP.

Sep 8, 2021

By Yanis Bousdira


It’s no secret that the amount of data collected from users across websites, mobile apps, and other digital platforms is on the rise. Data interactions went up by 5000% between 2010 and 2020.  This explosion in data volume has also resulted in a rise in event-driven architecture, a design pattern that uses “events” to communicate between applications. Event-driven architecture is extremely flexible, fast, and is the pattern of choice for many different use-cases including the use of customer data for marketing, advertising, product, and analytics. Integrations into the larger SaaS stack can, however, be complicated and costly. However, it doesn’t have to be this way. In this article, we’ll take a look at event streaming platforms, specifically discuss the integration of Apache Kafka with your wider tech-stack, and how using a CDP such as Segment can help in significantly streamlining this process. 

Integrating event-driven architecture and your wider tech stack is complicated

Various factors inform how complicated your event-driven architecture is going to be, but let’s walk through an example based on the event-streaming model. A customer completes an order on an e-commerce website. This action would generate an important engagement event (e.g Order Completed) that’s captured and processed by your event-streaming platform. But what if you also need to share this event with other SaaS tools across analytics, marketing, product & advertising? Using point-to-point integrations with these destinations could look something like this:

Traditional Event streaming platform
This is a simplified diagram of a traditional event stream processing setup that uses point-to-point integrations to send data to eight tools. Four consumers will read events of topics, translate the original message into up to different 8 API specs each, and then make the requests to said tool. 

This is a simplified diagram of a traditional event stream processing setup that uses point-to-point integrations to send data to eight tools. Four consumers will read events of topics, translate the original message into up to different 8 API specs each, and then make the requests to said tool.  The problem here is that for just one event, you’d need to develop and maintain bespoke point-to-point integration logic for each downstream tool. Now multiply this for hundreds of events and tens of different destinations. Depending on the size of your tech stack and the number of events processed, you could be looking at 100s if not 1000s of different mutations which are costly to develop and maintain.  There is a better way to manage and even simplify the complexity of event-driven architecture. While the complexity and cost will differ from company to company, in Forrester’s Total Economic Impact assessment of Segment’s CDP, interviews have yielded a potential reduction of 90% when looking at the development time required for integrations. Before we get into how a CDP’s single API can help with this, let’s level-set on some definitions.

Event-driven architecture

As we established earlier, event-driven architecture is a technology design pattern where applications use events to communicate with each other. These events can capture the states of an object, such as a user profile, or they can capture point-in-time actions that take place. Events can be user-generated (a customer placing an e-commerce order or submitting their contact details) or system-generated (a triggered subscription payment reminder).  Looking at the simplest forms of event-driven architecture, you will find a “producer” that captures or generates the event, a “broker” which organizes the data, and a “consumer” which listens or receives the event. Producers and consumers are decoupled from each other, which provides a whole host of benefits. Given their decoupled nature, both producers and consumers can scale and fail independently, and can be developed with more agility and auditing. 

Producer Broker Consumer

Compared to the more traditional “simple messaging” model or “publisher-subscriber” model, event-streaming is widely seen as the superior technology in event-driven architecture. It is regarded as more flexible and reliable as it not only decouples the producer from the consumer, but the data itself too. Consumers can “read”  the persisted data whenever they need it rather than simply being sent the data at one fixed point in time. Here is a quick comparison between the two:

  • With the publisher-subscriber model, all subscribers to an event will receive it from the producer when the event gets created or “published”. Think of the data being pushed downstream. Data is not persisted and is “lost” once pushed to the consumer.

  • Event streaming allows events to be stored in a log (event stream). Producers write to the log, brokers organize the data, and consumers "read" from the log, on their own independent schedule and position. Think of data being pulled. This ensures that data is persisted and can be “replayed” if errors occur.

Within event streaming, one technology has firmly manifested itself as a component of choice for companies such as Zalando, Pinterest, Adidas, Airbnb, and Spotify to name just a few. This technology is called Apache Kafka.

Apache Kafka

Apache Kafka is the world’s most prominent event streaming platform. Since being created and open-sourced by LinkedIn in 2011, Kafka has been adopted by thousands of companies, including more than 80% of Fortune 100 businesses (according to Apache Kafka). It’s highly flexible and can be deployed on-premise as well as in cloud environments. Kafka is most commonly used to power event-driven architecture by providing event streaming pipelines that help companies capture, process, and deliver data at unprecedented speed, scale and cost.

Here at Segment, we use Kafka too. Our infrastructure uses multiple Kafka clusters, serving more than 8 million messages per second on average, allowing us to provide customers a service that simplifies collecting, processing, and delivering events to all tools that rely on first–party customer data. We’ve spent 10 years optimizing our infrastructure to send trillions of events per month exactly once to over 400 different out of the box destinations, guaranteeing high–quality data and creating 360 degree views of customers through identity resolution. All of this relies heavily on Kafka. - Julien Fabre, Principal Site Reliability Engineer

Customer data platforms

But a service like Kafka really comes into its own when paired with a customer data platform like Segment. A Customer Data Platform (CDP) provides organizations the ability to decouple data collection from data delivery. With one standardized tracking API, regardless of where the data comes from and what its intended destination is, events get tracked once and in the case of Segment delivered with a simplified UI to over 350 destinations. All of that without ever having to touch a tool’s code. 

  • Not all of your customer data sources will be server-side and neither will be your destination systems. Segment’s data collection libraries standardize the data capture beyond Kafka to web and mobile, and can easily bundle and translate events to third parties from the client directly, when needed. 

  • Event validation raises alerts for malformed or corrupted events, transformations allow for fixing of said events, and schema controls provide blocking capabilities. This means that the data quality of events processed by Kafka can be enforced before they ever make it downstream.

  • A customizable identify graph allows companies to automatically organize their events into user profiles, without having to rely on costly database joins or other forms of analysis to create a single view of the customer. Think of a user logging into your e-commerce storefront, but completing the order on mobile. 

  • Rarely do we find that all customer intelligence can be self-contained in events themselves, so Segment provides trait computation capabilities. With trait computations, you can look across events and gain further customer intelligence attached to individual users or companies for B2B use-cases. Our customers are often interested in aggregations such as “Lifetime Value”, “Number of logins in the last 14 days” or “Most frequently viewed product category”. 

  • In addition to calculating across events, a lot of customer intelligence sits in Data Warehouses. Segment’s SQL traits are a powerful way to directly query and enrich customer profiles with additional insights, computations, and data stores managed within the data warehouse.

  • Once events are layered with advanced intelligence, many organizations look to cluster or segment users into addressable audiences. Segment provides both the ability to execute this segmentation and the subsequent integration of the Segment into the wider tech stack. Said integration can be done on a per-audience basis or Journeys can be designed to orchestrate and iterate on entire end-to-end customer experiences across your entire tech stack with real-time engagement logic.

Now that we have established the concepts of Apache Kafka and what capabilities a CDP brings to the table, let’s take a look at how a CDP can help reduce the complexity of integrating your Apache Kafka stream with your wider tech stack.

Going from point-to-point with Apache Kafka to a centralized approach with a CDP

Let’s circle back to the example from earlier where a customer places an order on an e-commerce website.

The "order completed" event gets produced into one of our topics. Let’s say that we have four consumers reading from this topic and processing the event. Through the use of a CDP, we no longer have to develop consumers that translate the original event into the many connected tooling specs. We can leverage the CDPs single tracking API instead. Once collected by the CDP, the data is simply sent downstream to destinations that can be set up through a point-and-click UI rather than through code. You would effectively be going from left to right:

Comparison event streaming without and with CDP

On the right, a CDP is used to simplify the integration. Each consumer only needs to translate the original message into the CDP’s API spec once and connectivity to the eight downstream tools is entirely managed by the CDP. This simplification helps organizations reduce the cost of complex consumer-to-destination configs and ensures that event streams are seamlessly flowing from producer to target without major engineering dependencies.

CDPs as a replacement for Apache Kafka

As we have established above, a CDP can significantly simplify the activation of data in your Kafka cluster. Kafka is not the only data source for a CDP and most will have their own proprietary collection libraries to capture data on web, mobile, and server-side. When using these CDP libraries, you capture data directly from the source rather than “subscribing” to data in event streaming platforms, such as Kafka. The data in Kafka usually originates from a source that a CDP can plug into directly as well, so you might be wondering if a CDP can do the capture too, can it replace Kafka completely?”.  The short answer is: “It depends.” Kafka has broad use cases beyond behavioral customer data, and it is not uncommon for mature organizations to choose to manage their own data pipeline on Kafka for full flexibility. In this case, many of our customers adopt a hybrid approach where they continue to manage their own data collection pipelines, but then plug Segment on the end as an activation layer for their customer data. Let’s take a look at an example reference architecture for this hybrid approach.

Reference architecture - Event streaming

Starting on the left, both Segment’s Javascript & mobile libraries are used to stream client-side data to Segment. In parallel, server-side events are being captured and processed by Kafka, and Segment is being used to connect various tools in the customer’s tech stack to said events, while also loading the data into various other downstream destinations.  Segment is seen as a means to increase the value gained out of data flowing through Kafka, as it empowers various stakeholders in the business to self-serve their data more seamlessly. If managing your own data pipeline on Kafka isn’t important to you, then you could most likely get the value that you are looking for directly from a CDP.

Wrapping things up

Event-streaming platforms are a very popular way for organizations to transfer data from a source system to a destination. When using Apache Kafka to integrate event streams with the wider tech stack, organizations still face the task of maintaining complex point-to-point integrations and customer data platforms are positioned to help reduce these complexities. They provide a single API, consumers can integrate with and the distribution of the data is managed within the CDP rather than within the consumer itself. If you have any questions about the integration of Apache Kafka or want to learn how Segment can help you gain a deeper understanding of your customers, please get in touch.

The State of Personalization 2023

Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.

Get the report
SOP 2023

The State of Personalization 2023

Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.

Get the report
SOP 2023

Share article

Want to keep updated on Segment launches, events, and updates?

We’ll share a copy of this guide and send you content and updates about Twilio Segment’s products as we continue to build the world’s leading CDP. We use your information according to our privacy policy. You can update your preferences at any time.