At this point it's well-accepted that analytics data is the beating heart of a great customer experience. It's what allows us to understand our customer's journey through the product and pinpoint opportunities for improvement.
This all sounds great, but reality is a messy place. As a team becomes an organization, the structure used for recording this data can easily diverge. One part of the product might use
userId and another
user_id. There might be both a
CartCheckout and a
CheckoutCart event that mean the same thing. Given thousands of call sites and hundreds of kinds of records, this sort of thing is an inevitability.
Without an enforceable shared vocabulary for describing this data, an organization's ability to use this data in any meaningful way becomes crippled.
Downstream tools for analyzing data begin to lose value as redundant or unexpected data trickles in as a result of implementation errors at the source. Fixing these issues after they’ve made it downstream turns out to be a very expensive proposition, with estimates as high as 60% of a data scientist’s time being spent cleaning and organizing data.
At Segment, we’ve put a considerable amount of engineering effort into scaling our data pipeline to ensure low-latency, high throughput and reliable delivery of event data to power the customer data infrastructure for over 15 thousand companies.
We also recently launched Protocols to help our customers ensure high quality data at scale.
In this post, I want to explore some approaches we’re taking to tackle that dimension of scalability from a developer perspective, allowing organizations to scale the domain of their customer data with a shared, consistent representation of it.
To ensure a successful implementation of Segment, we’ll typically recommend that customers maintain something known as a “Tracking Plan.”
An example of a Tracking Plan that we would use internally for our own product
This spreadsheet gives structure and meaning to the events and fields that are present in every customer data payload.
An example of where a Tracking Plan becomes a critical tool would be in any scenario involving multiple engineers working across products and platforms. If there are no standards around how one should represent a “Signed Up” event or what metadata would be important to capture, you’d eventually find every permutation of it when it comes time to make use of that business critical data, rendering it “worse than useless.”
This Tracking Plan serves as a living document for Central Analytics, Product, Engineering and other teams to agree on what’s important to measure, how those measures are represented and how to name them (the three hard problems in analytics).
Where it breaks down, and how to fix it
As a Tracking Plan evolves, the code that implements it will not often change accordingly. Tickets may get assigned, but oftentimes feature work will get prioritized over maintaining the tracking code, leading to a divergence between the Tracking Plan and implementation. In addition to this, natural human error is still a very real factor that can lead to an incorrect implementation.
This error is a pretty natural result of not having a system to provide validation to an implementor (both at implementation-time, and on an ongoing basis).
Validation that an implementation is correct relative to some idealized target sounds exactly like something that machines can help us with, and indeed they have — from compilers that enforce certain invariants of programs - to test frameworks that allow us to author scenarios in which to run our code and assert expected behaviors.
As alluded to above, an ideal system will provide feedback and validation at three critical places in the product development lifecycle:
At development time “What should I implement?”
At build-time “Is it right?”
At CI time “Has it stayed right?”
As a developer-focused company, these elements of aligning a great developer experience with the process improvements of a centralized tracking plan became a compelling problem to solve.
That’s why we built, and are now open sourcing Typewriter - a tool that lets developers “bring a tracking plan into their editor” by generating a strongly typed client library across a variety of languages from a centrally defined spec.
The developer experience of using a Typewriter generated library in Typescript
Typewriter delivers a higher degree of developer ergonomics over our more general purpose analytics libraries by providing a strongly typed API that speaks to a customer’s data domain. The events, their properties, types, and all associated documentation are present to inform product engineers that need to implement them perfectly to spec, all without leaving the comfort of their development environment.
Compile time (and runtime) validation is performed to ensure that tracking events are dispatched with the correct fields and types to give realtime validation that an implementation is correct.
This answers the questions of “What should I implement” and “Is it right” mentioned earlier. The remaining question of “Has it stayed right?” can be answered by integrating Typewriter as a task in your CI system.
How it works
Typewriter uses what amounts to a machine-readable Tracking Plan with a rich language built on JSON Schema for defining and validating events, their properties and associated types that can be compiled into a standalone library (making use of the excellent quicktype library to generate types for languages we target with static type systems).
This spec can be managed within your codebase, ensuring that any changes to it will result in a regenerated client library, and always up to date tracking code.
What comes next
Being avid Segment users ourselves, we’ve been migrating our mountains of hand written tracking code to Typewriter generated libraries and have been excited to realize the productivity gains of offloading that work to the tooling.
Typewriter will continue to evolve to support the needs of all Segment customers too — we’re continuing to expand it and are open to community PRs!
We’d like you to give Typewriter a shot in your own projects and feel free to open issues, submit PRs, or reach us on twitter @segment.