Go back to Blog

Engineering

Solon Aguiar, Brian Lai on January 20th 2022

With the launch of Twilio Engage, the team needed to update the data delivery system to support the latest platform features. This blog covers the process, including questions and lessons learned, to get to the end result.

All Engineering articles

Kelly Kirwan on December 1st 2021

Are you concerned that data silos are stunting your organization’s growth? Read this to learn what data silos are, why they matter, and how to fix them.

Geoffrey Keating on November 30th 2021

A data pipeline comprises the tools and processes that move data from source to destination. Here’s how to build one.

Kelly Kirwan on November 17th 2021

Data cleaning or data cleansing is the process of identifying and removing dirty data. It is a crucial step in ensuring high-quality data for your organization.

Kelly Kirwan on November 12th 2021

Data flows through organizations fueling everything from executive decisions to marketing personalization. To ensure data integrity, data collection is critical.

Geoffrey Keating on October 28th 2021

Digital acceleration means data migration is inevitable. Learn all about data migration, important things to consider, and the basic process.

Benjamin Yolken on October 26th 2021

At Segment, we use Apache Kafka extensively to store customer events and connect the key pieces of our data processing pipeline. Last year we open-sourced topicctl, a tool that we developed for safer and easier management of the topics in our Kafka clusters; see our previous blog post for more details.

Since the initial release of topicctl, we’ve been working on several enhancements to the tool, with a particular focus on removing its dependencies on the Apache ZooKeeper APIs; as described more below, this is needed for a future world in which Kafka runs without ZooKeeper. We’ve also added authentication on broker API calls and fixed a number of user-reported bugs.

After months of internal testing, we’re pleased to announce that the new version, which we’re referring to as “v1”, is now ready for general use! See the repo README for more details on installing and using the latest version.

In the remainder of this post, we’d like to go into more detail on these changes and explain some of the technical challenges we faced in the process.

Kafka, ZooKeeper, and topicctl

A Kafka cluster consists of one or more brokers (i.e. nodes), which expose a set of APIs that allow clients to read and write data, among other use cases. The brokers coordinate to ensure that each has the latest version of the configuration metadata, that there is a single, agreed-upon leader for each partition, that messages are replicated to the right locations, and so forth.

Original architecture

Historically, the coordination described above has been done via a distributed key-value store called Apache ZooKeeper. The latter system stores shared metadata about everything in the cluster (brokers, topics, partitions, etc.) and has primitives to support coordination activities like leader election.

ZooKeeper was not just used internally by Kafka, but also externally by clients as an interface for interacting with cluster metadata. To fetch all topics in the cluster, for instance, a client would hit the ZooKeeper API and read the keys and values in a particular place in the ZooKeeper data hierarchy. Similarly, updates to metadata, e.g. changing the brokers assigned to each partition in a topic, were done by writing JSON blobs into ZooKeeper with the expected format in the expected place.

Some of these operations could also be done through the broker APIs, but many could only be done via ZooKeeper.

Given these conditions, we decided to use ZooKeeper APIs extensively in the original version of topicctl. Although it might have been possible to provide some subset of functionality without going through ZooKeeper, this “mixed access” mode would have made the code significantly more complex and made troubleshooting connection issues harder because different operations would be talking to different systems.

Towards a ZooKeeper-less World

In 2019, a proposal was made to remove the ZooKeeper dependency from Kafka. This would require handling all coordination activities internally within the cluster (involving some significant architectural changes) and also adding new APIs so that clients would no longer need to hit ZooKeeper for any administrative operations.

The motivation behind this proposal was pretty straightforward — ZooKeeper is a robust system and generally works well for the coordination use cases of Kafka, but can be complex to set up and manage. Removing it would significantly simplify the Kafka architecture and improve its scalability.

This proposal was on our radar when we originally created topicctl, but the implementation was so far off in the future that we weren’t worried about it interfering with our initial release. Recently, however, the first Kafka version that can run without ZooKeeper landed. We realized that we needed to embrace this new world so the tool would work continue to work with newer Kafka versions.

At the same time, we got feedback both internally and externally that depending on ZooKeeper APIs for the tool would make security significantly harder. ZooKeeper does have its own ACL system, but managing this in parallel with the Kafka one is a pain, so many companies just block ZooKeeper API access completely for everything except the Kafka brokers. Many users would be reluctant to open this access up (rightfully so!) and thus the ZooKeeper requirement was blocking the adoption of the tool in many environments.

Given these multiple factors, removing the ZooKeeper requirement from topicctl became a high priority.

Removing the ZooKeeper requirement

In the original code for topicctl, all cluster admin access went through a single struct type, the admin client, which then used a private ZooKeeper client for fetching configs, updating topics, etc. This struct exposed methods that could be called by other parts of the tool; the golang code for triggering a leader election in the cluster, for instance, looked like the following (some details omitted for simplicity):

Note that the client in this case isn’t actually communicating with the Kafka brokers or using any Kafka APIs. It’s just writing some JSON into /admin/preferred_replica_election, which, by convention, is where the Kafka brokers will look to start the process of running a leader election.

Our first step was to convert the APIs exposed by this struct into a golang interface with two implementations- one that depended on ZooKeeper, i.e. using the code from our original version of the admin client, and a second that only used Kafka broker APIs. 

So, the Client above became:

with the RunLeaderElection implementations becoming the following for the ZooKeeper and ZooKeeper-less versions, respectively:

The next step was to fill out the details of the broker-based admin client so that it actually worked. topicctl was already using the excellent kafka-go library for its functionality that depended on broker APIs (e.g., tailing topics), so we wanted to use that here as well. Unfortunately, however, this library was designed primarily for reading and writing data, as opposed to metadata, so it only supported a subset of the admin-related Kafka API.

After doing an inventory of our client’s requirements, we determined that there were six API calls we needed that were not yet supported by kafka-go:

Our next step was to update kafka-go to support these! At first, it looked easy- this library already had a nice interface for adding new Kafka APIs; all you had to do was create go structs to match the API message specs, and then add some helper functions to do the calls.

But, as often happens, we ran into a wrinkle: a new variant of the Kafka protocol had been recently introduced (described here) to make API messages more space-efficient. Although most of the APIs we needed had versions predating the update, a few only supported the new protocol format. To add all of the APIs we needed, we’d have to update kafka-go to support the new format.

Thus, we first went through all of the protocol code in kafka-go, updating it to support both the old and new formats. The proposal linked above didn’t have 100% of the details we needed, so in several cases, we also had to consult the Kafka code to fully understand how newer messages were formatted. After much trial and error, we eventually got this code working and merged.

Once that was done, we were unblocked from adding the additional APIs, which we did in this change. Finally, we could go back to the topicctl code and fill out the implementation of the broker-based admin client.

Returning to the RunLeaderElection example from above, we now had something like:

The end result is that we were able to get topicctl working end-to-end with either the ZooKeeper-based implementation (required for older clusters) or the ZooKeeper-less one (for newer clusters), with only minimal changes in the other parts of the code.

Security updates

In addition to removing the ZooKeeper requirement from topicctl , we also got several requests to support secure communication between the tool and the brokers in a cluster. We didn’t include these in the original version because we don’t (yet) depend on these features internally at Segment; but, they’re becoming increasingly important, particularly as users adopt externally hosted Kafka solutions like AWS MSK and Confluent Cloud.

We went ahead and fixed this, at least for the most common security mechanisms that Kafka supports. First, and most significantly, topicctl can now use TLS (called “SSL” for historical reasons in the Kafka documentation) to encrypt all communication between the tool and the brokers. 

In addition to TLS, we also added support for SASL authentication on these links. This provides a secure way for a client to present a username and password to the API; the permissions for each authenticated user can then be controlled in a fine-grained way via Kafka’s authorization settings.

Testing and release

As we updated the internals of topicctl, we extended our unit tests to run through the core flows like applying a topic change under multiple conditions, e.g. using ZooKeeper vs. only using Kafka APIs. We also used docker-compose to create local clusters with different combinations of Kafka versions, security settings, and client settings to ensure that the tool worked as expected in all cases. 

Once this initial testing was done, we updated the internal tooling that Segment engineers use to run topicctl to use either the old version or the new one, depending on the cluster. In this way, we could roll out to newer, lower-risk clusters first, then eventually work up to the bigger, riskier ones. 

After several months of usage, we felt confident enough to use v1 for all of our clusters and deprecate the old version for both internal and external users of the tool.

Conclusion

topicctl v1 is ready for general use! You might find it a useful addition to your Kafka toolkit for understanding the data and metadata in your clusters, and for making config changes. Also, feel free to create issues in our Github repository to report problems or request features for future versions.

Kelly Kirwan on October 26th 2021

Big data is a big deal to manage. And data integrity is vital for business growth and customer trust. Learn what it is, why it’s crucial, and how to ensure it.

Geoffrey Keating on October 20th 2021

We outline a path to building trust in data across your organization, and explain why it's critical to lay the right infrastructure and process for how data is collected, cleaned, governed, and acted on.

Pete Walker on October 12th 2021

At Segment, we work with a wide range of customers both in terms of industry and scale. Some customers have millions of users and have a lot of experience integrating our tracking libraries or similar systems; others are comparatively smaller and are just starting to work with such systems. Especially with these smaller customers, a common question is: how do we actually integrate Segment into our unique SPA (Single Page App) architecture? 

In fact, we’ve seen  this concern about using tracking libraries inside a SPA become a barrier to using tracking libraries at all! But not tracking anything means you don’t have visibility into how your customers are interacting with your brand. Therefore, you can’t objectively answer questions like: Which advertising or engagement channels are the most profitable for our business? What content or products should we be focusing our attention on improving? What behavior is correlated with people staying with our brand, or worse, churning? 

The advantage that Segment brings is that the way we track customer behavior is completely agnostic to the tools that you use to answer questions about your business and then act on that information to improve it. Segment helps companies answer these questions by collecting customer behavior data in a way that is completely agnostic to the tools that your teams rely on for insight. Once your customer data has been collected by Segment, we take care of the rest by making the data accessible to the teams that need it, empowering informed decision-making for improving the customer experience.

How Segment Works

Segment works by allowing engineering teams to collect data once, saving the time it would otherwise take to integrate each analytics or engagement tool individually. This also allows end users of the data (i.e. analytics, marketing, and product teams) to add the customer data tools that they need in a matter of minutes rather than months.

Getting clean tracking in place is crucial, especially for small businesses just starting out. At some point, your business will outgrow any ad hoc tracking implementation, making data-driven initiatives come to a screeching halt. Customer data should be an investment, not a liability.

I’d like to outline exactly how to accomplish an initial Segment integration, and what considerations need to be made for an effective implementation.  

What is a Single Page App?

A SPA, or Single Page Application, is a front-end architecture pattern, where the entire application sits in a single HTML file in the browser, and the user interface dynamically loads as the user interacts with it via javascript. These are very commonly implemented using libraries like React.js, Angular.js, or Vue.js. Well known examples of Single Page Apps include Gmail and Twitter. This architecture is contrasted with the more traditional Multi-Page Application, which has links to multiple html pages that each get served to the client from a server as they are visited by the user.

In terms of integrating Segment with a SPA, the hidden challenge comes with the fact that there is technically only one html page to track on and therefore no distinction between pages. This is an important consideration for the third type of data you will want to collect on your customers in “the who, the what, and the where?” since there is only one “where” with a SPA!

Segment’s Javascript Library

When integrating Segment on a Web Source, you'll first inject the tracking snippet on your site which will look something like this:

You’ll notice that the second to last line in the script is “analytics.page()”. What this javascript does is record a page view to send to your Segment workspace every time that html page is loaded. The trouble with SPA integrations is that the html document only gets loaded once! Therefore the Analytics.js library won’t record a page view automatically for each “page” in our app, even if we are using a routing library like React Router to give our SPA the appearance of having multiple pages.

So, in order for us to effectively record where users are interacting with our app, we need to take manual control over the “analytics.page()” call.

The Implementation

As a demonstration, we’ve created this simple site from this code to illustrate how to accomplish an effective SPA implementation. We’re using React.js and React Router to give our app the appearance of having multiple pages and routes.

First, you’ll want to create a javascript source and place the Segment SDK into the only HTML document. You’ll then want to remove the “analytics.page()” call from the snippet. Once this is done, you are then free to call “analytics.page()” on whatever makes sense as a “page” in the context of the app. 

In our case, we have three different React components in a react router switch: “Home”, “experiment”, and “about.”

In a sense, each one of these components shows an entirely different page, and so we’ll want to attach a page call to each one of these React components. This is so we can answer questions like how many times a specific user has used the experiment page, for instance. 

In each of these components, we’ll call “analytics.page()” every time that specific component is loaded. In React, the corresponding lifecycle method for doing something when a component is loaded is “componentDidMount”, which is where we will track our page view:

As a side note, If you use hooks rather than class based components in React, you can implement page views from within the “useEffect()'' method.

After doing the same for each one of our components that we consider to be a page, we can then start to track user actions and profile traits via the same analytics.js library, using the “analytics.track()” and “analytics.identify()” methods respectively.

Wrapping up

And there you have it! For any other user behaviors besides page views, we will embed inline tracking code much the same way we did for our page calls, and we are well on our way to a full implementation. For those of you wondering what exactly you should be tracking besides page views, there is an excellent article in our Analytics Academy, that covers just that.

There is always a little bit of subtlety to every implementation. But because of the flexibility of Segment’s tracking libraries, it's easy to customize an implementation to virtually any architecture.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.