Data Cleaning: What It Is And How It Works

Kelly Kirwan on November 17th 2021

Two analysts walk into the boss’s office. Online purchases for their product have suddenly soared. This launches a series of events: the boss tells the manufacturer to ramp up production, the manufacturer orders more raw supplies, a fully loaded ship sets sail, and investors get bullish.

In a living room somewhere, a baby is playing with a tablet, happily punching the “buy now” button. The product? An encyclopedia.

Maybe you’ve seen this story, a video ad that Adobe produced in 2013. It paints the perils of poor data insights.

While the ad is hyperbolic, its premise resonates in real life: you can track data—but if it’s incomplete and inaccurate, it won’t help you make good decisions.

To avoid having dirty data, organizations must invest time and resources in data cleaning.

To show you how to begin the data cleansing process, we’ll discuss:

  • What is data cleaning?

  • 3 common data quality problems and what causes them

  • Clean data starts with a standardized collection process

  • How to clean data in 5 steps

  • Ensure clean data at the source with Protocols

What is data cleaning?

Data cleaning is the process of identifying and modifying or removing incorrect, duplicate, incomplete, invalid, or irrelevant data within a dataset. It helps ensure that data is correct, usable, and ready for data analysis.

As such, data cleaning is a crucial part of data management. Data scientists may also call it by other names, such as data cleansing, data scrubbing, and data wrangling.

3 common data quality problems and what causes them

There are three basic types of data quality problems that require data cleaning. And organizations with massive amounts of data and poor hygiene usually have all three.

Invalid data

Invalid data is data that is generated incorrectly. These mistakes can happen when your customer enters data on a form, or they can stem from your end, particularly the way you name data events and properties.

Sample scenarios:

  • Asking the wrong question: Labeling a field on a form as “kg” when you’re actually trying to measure pounds.

  • Not defining what makes data invalid: Not specifying types of data (ex: numeric values, string), a range of acceptable data (ex: only 19 years of age and above), or permitted values (ex: Gmail address not allowed if you want to get a company email ad).

  • Incorrectly labeling data: Placing a product in the wrong category (ex: bags under “apparel” instead of “accessories,” which invalidates your data if you’re tracking user actions related to product categories).

  • Using nearly identical names for different data events: If marketing uses the event name FormSubmitted for an ebook download while sales uses formSubmitted for a free demo request, it will be easy to confuse the two during analysis.

Duplicate data

Duplicate data is data that is collected more than once, either through one channel or across various channels. This can often happen when you have not unified data in a single customer view.

For instance, a person who clicked your Facebook ad might also visit your site’s promo page and buy a product in your brick-and-mortar store. Without a single customer view, you will not recognize that this person is one and the same.

Sample scenarios:

  • Customer is repeatedly asked to provide the same information: Customer downloads an ebook and needs to fill in their email address, company name, and company role. Next time they download another asset on the site, they’re asked for the same info because the site does not recognize them.

  • Data comes in varied formats from different sources: The same person might use their nickname on one channel and their full name on another. Without unifying the data, you’ll end up with two customer profiles that share some of the same details.

Incomplete data

Data is incomplete when it has missing values. Sometimes, this isn’t within your control; the customer might neglect or decline to provide certain information, or you might experience a technical glitch like data import issues. But you can inadvertently cause incomplete data collection by not clearly defining the details you need.

Sample scenarios:

  • Setting vague parameters for a form: You can wind up with incomplete data if you have a single field labeled "address," as opposed to discrete fields with specified data types or permitted values for the street name, house/unit number, city, state, country, and zip code.

  • Not identifying which data is required: You might not mark the zip code as a required property for completing checkout, and the customer might forget to provide that detail.

  • Not defining what data you need to meet your business goals: Like the analysts in the Adobe ad, you might fail to collect certain user traits or contextual information for data events.

Clean data starts with a standardized collection process

As we’ve seen, data quality problems tend to stem from the way you gather information in the first place. Here’s how you can improve your data collection process to enhance the quality of your customer information:

Establish a clear data tracking plan

A data tracking plan lays out your organization's standards for tracking data. For the plan to be useful, all teams in your organization need to refer to it as a common source of truth.

Your data tracking plan can come in the form of a spreadsheet that identifies:

  • Events (i.e., user actions) that you will track, along with a description of each event and your reason for tracking it

  • Properties to track under each event, including which ones are required and which are optional

  • Data types and permitted values for properties

  • Where and when each event is tracked

  • Where events should be located in your code base

Zoom with margin

Basic components of a data tracking plan. Source.

Once your tracking plan is complete, you can upload it to your customer data platform (CDP) and apply it to your sources.

If you don’t want to create a plan from scratch or need inspiration on how to build one, check out this tracking plan template, which suggests plans for various verticals.

Standardize naming conventions

We recommend these best practices for naming your data events and properties:

  • Choose standards for capitalization: Ex: Title Case for event names and snake_case for property names.

  • Pick an event name structure: We recommend using an Object + Action structure. Ex: Product Added.

  • Don’t create event names dynamically: Avoid including dynamic values, such as dates, in event names. Ex: Product Added (11-12-2021).

  • Don’t create events to track properties: Check if a value you’re adding to an event name could be a property or property value instead. Ex: Instead of the event name Under Armour Shoes Added, you could use properties, such as "category":"Shoes" and "brand":"Under Armour".

  • Don’t create property keys dynamically: Vague property names make it difficult to pinpoint which actions and values you’re tracking. For example, in the key-value pairs "category_1:"true" and "category_2":"false" , you wouldn’t be able to tell at a glance exactly which product categories they refer to.

Here’s how these data naming best practices would look like in a script:

Zoom with margin

Source

Block bad data at the source

Examine all customer interactions where you collect data. Spot and fix errors in fields that:

  • Don’t adhere to the formats specified in your tracking plan

  • Don’t set parameters for valid data inputs, if such are needed

  • Ask the wrong questions

  • Ask vague questions

How to clean data in 5 steps

To clean the raw data you collect—and keep it clean—start with these five steps:

1. Build a QA process to automatically validate data and diagnose errors

Automation is key for scaling your data cleaning process—otherwise, you’d have to manually scan every single piece of data you collect!

If you’ve connected your tracking plan to a data source via your CDP, your CDP will be able to identify non-conforming data. Protocols, a premium feature on the Segment CDP, runs automatic validation based on the plan. It flags violations when:

  • A required property is missing.

  • A property value’s data type is invalid.

  • A property value does not pass applied conditional filtering.

Next, set controls that tell your CDP how to handle non-conforming data. You might ask your CDP to alert you of violations or to block non-conforming events. The data won’t be sent to any tool downstream until your team manually approves it.

In the example below, the checkout ID, order ID, and shipping details are required to complete an order on a website. The CDP flags missing data as well as properties that don’t conform to your specified format.

Zoom with margin

Source

Pro tip: Use APIs to connect your data cleaning tools and make your QA automation more powerful. You can connect your CDP to a messaging tool like Slack, so you receive alerts of violations. You can also link to a business intelligence or data visualization tool that lets you create custom dashboards for analyzing violation trends, identifying outliers, and determining the causes of common violations.

2. Identify and remove duplicate data points

Through identity resolution, your CDP can create customer profiles that tie different actions across multiple channels to one and the same person. You can create identity rules that guide your CDP’s algorithm. It will decide whether to merge the information into a single profile or to remove one or more pieces of duplicate data.

3. Fix structural errors

Use data transformation to fix data with structural errors, such as typos, incorrect labels, or inconsistent formatting. For example, you might change an event name from completed_order to Order Completed.

Protocols’ Transformations tool lets you modify data to conform with either your tracking plan or your destination tool’s specs. With one click, you can apply the transformation to a source or destination, so you don’t have to scour your code and manually change the data value in question.

4. Identify and remove irrelevant data

Audit every piece of data and ensure it has a use. This ties back to the business goals laid out in your tracking plan. If a user trait, event, or property isn't included in the plan, it might be because tracking it does not help you achieve your objectives.

You can ask your CDP to alert you to irrelevant or unplanned data, so you can decide on an appropriate action.

5. Keep your tracking plan up-to-date

While your tracking plan serves as your data collection framework, keep in mind that it’s a living document. As your company grows and changes, so should your tracking plan.

Audit your plan to ensure it remains relevant to your business goals and customer journey. You can do this on a schedule—like every six months—or every time you implement a major business change, such as launching a product or targeting a new consumer segment.

Say you used to offer a freemium version of your SaaS product. However, you’re now eliminating the free plan and starting with a basic plan instead. You no longer have to track conversion rates from free to paid plans—and your tracking plan needs to reflect this.

Ensure clean data at the source with Protocols

Data cleaning may be time-consuming, but it’s a process that delivers real business results. For example, when Typeform used Protocols to implement clear naming conventions, they were able to identify which form and survey templates their customers preferred. They placed those templates at the top of their gallery and increased user engagement as a result.

Learn more about how Protocols can help you protect your data integrity.

The State of Personalization

Our annual look at how attitudes, preferences, and experiences with personalization have evolved over the past year.

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.