Josephine Liu, Sherry Huang on June 9th 2021

Our latest feature, Journeys, empowers teams to unify touchpoints across the end-to-end customer journey.

All Company articles

Calvin French-Owen, Kevin Garcia on August 26th 2019

For the past five years, nearly every company on the market has tried to become ‘data-driven’. You don’t want intuition and gut feelings driving company strategy. You want hard facts.

The unfortunate reality is that being ‘data-driven’ is incredibly hard to achieve in practice.

You need the relevant data from hundreds of different systems at your fingertips. You need to know where it lives, how it’s generated, and how to analyze it. And if you’re missing a key insight, you might be taking your company in the wrong strategic direction, costing you months of wasted effort and millions in spend. 

Today, we’d like to share the story of one of the most data-informed companies we’ve seen amongst our thousands of customers: Houseparty.

In this post, Jeff Needles from Houseparty, shares how they’ve killed 40% of their experiments, increased signups across the board, and continuously focused on metrics and engagement. It’s a masterclass in using data to power your business. 


Houseparty makes it incredibly easy for groups of up to 8 friends to get together and hang out in a video chat room. Over the years, the app has grown to host tens of millions of users, who on average, spend more than an hour per day chatting with their friends. In total, their users are generating more than seven billion data points every single month. 

In order to make an impact, Jeff has scaled both the way they collect data and the way they experiment, so that he is able to provide the value of a much bigger team without doing extra manual work. These efforts have helped them reliably collect billions of events, run over one hundred experiments in the last year, and improve iOS invites-sent by 64%.

Collecting terabyte-scale data

Jeff knew Houseparty needed to invest in data collection from the start—but he had one problem. 

Houseparty operates at a massive scale. Finding tools which would scale to billions of incoming events was one of the top requirements.

The search wasn’t easy. After testing to find the perfect combination of cost-effective tools that provided the deep insights he needed, he decided on the following architecture:

  • Collect user events via Segment

  • Send those events into Taplytics and Amazon Redshift

  • Analyze the Amazon Redshift data in Periscope

With this simple setup, Jeff can quickly analyze the terabytes of data Houseparty is storing. His entire team can use Periscope to put together ad hoc reports and explore their data, while Jeff can query Amazon Redshift directly to get deep answers to his questions.

Not all experiments are good

With his data infrastructure in place, Jeff focused on helping his product team iterate fast and easily understand their results.

Experimentation is part of the Houseparty DNA. At their weekly team meeting, anyone is able to propose new ideas to test—ranging from app features to ideas for push notifications. The product team then decides to take some of these ideas and turn them into real experiments—fast.

Every experiment they ship is paired with a specific hypothesis to test. The team states which metrics they hope to move and how much they hope to move them. Then, they actually roll out the change to a cohort of users to see the results.

In the last year, Houseparty has run more than 100 different experiments, each backed by their own SQL query. Each experiment is analyzed using a consistent set of SQL queries running against the warehouse based on metrics that Houseparty team members can select using a simple UI. 

Because the data is standardized in the warehouse using Segment’s schema, looking for statistical significance in a wide array of metrics is easy. 

Every experiment has a default set of global metrics (retention, online-ness, and time in conversation) as well as additional local metrics unique to the experiment. 

Results are calculated and stored every few hours for all active experiments.

While the results are continuously being calculated, the team reviews results at their weekly meeting to enforce solid practices. The team only look at results for the experiments that are statistically significant, have had enough time lapsed, and have a full cohort of users. Once all of those pieces are in place, the team makes a decision: ship it or kill it.

This weekly cadence helps the team stay laser-focused on user engagement. If an experiment is actively hurting their metrics, they kill it. No questions asked.

Last year, Jeff and team have run more than 100 experiments and have decided to fully roll out 60% of them

In the above graph, his team modeled their growth rate if they had shipped 100% of all experiments (the red line) vs the just the set of experiments they decided to keep (the green line). 

Shipping all experiments would have created a 40% drag on the business!

This shocked us. We can only wonder about how many areas of other businesses are being actively harmed, just because the team behind it isn’t measuring the impact. 

Different users call for different experiences

One of the most exciting aspects of constant experimentation is that it allows you to discover really powerful insights that don’t seem obvious at first. One of those insights for Jeff is that Android users behave very differently than their iOS counterparts.

Regarding a redesign the team undertook last year, Jeff noted…

We tested everything. No matter what experiments we launch to unify the iOS and Android app experiences, they always perform worse.

The bright side is that these same insights have helped them optimize the user experience for each set of users. Depending on the operating system, the user will receive a very different UI. Android users get a dark, translucent theme, but for iOS, it’s a light flat theme.

The differences don’t stop there. One of the biggest drivers of new Houseparty users comes from invites from friends. One friend will text another asking if they’d like to join them on the app. This text invitation is the single most important factor driving the company’s growth. 

The team experimented with 16 different variations of the invite text, ranging from the mundane to the bizarre.

The winning entries? Well… they were different. iOS and Android each got their own text message. On iOS the text read “let’s (house)party”, on Android: “we need to talk ”.

This slight tweak increased the number of invites sent by 64% on iOS and 27% on Android! A massive improvement for overall viral growth. 

When data informs 

After talking with Jeff, it’s pretty clear that using data has revolutionized Houseparty’s business. Even as a small team, they’ve been able to use their metrics to achieve:

  • a 64% increase in invites sent on iOS

  • a 27% increase in invites sent on Android

  • certainty in the 60% of experiments which drive growth

  • avoiding the 40% of experiments which hurt engagements

  • separate tailored experiences for ios and android

These tools are just one tiny example of how Houseparty has used great data to power experimentation, which powers personalization, which in turn has inflected their growth curve. 

With a product team that can test any idea, and get to statistical significance within the shortest possible window, the sky's the limit for where Houseparty can go.

Amy Li, Maya Spivak on August 14th 2019

Imagine you commute through San Francisco via the I-80E every day. Over the last several months, you might have noticed a billboard that did not seem quite right. And even worse, once you realized what was wrong…things only seemed to get messier.

Good morning…wait, what city are we in again?

One month later, still not populating correctly.

Acknowledging that we needed a moment to get the good data infrastructure in place.

It might have made you scratch your head, laugh, or even do a double-take. How could Segment get the city wrong? Why would we print “Good morning” on a static billboard that could be seen 24 hours a day? (We had one job!)

Well, there’s more. It didn’t just happen in San Francisco. These mixups found their way into billboards across Austin, New York, Los Angeles, Chicago, and five other major cities.

Actual location: Austin

Actual location: Los Angeles

Actual location: New York

In each case, we planned the mixup, and had a great time matching “rival” cities to each location. But what happens when a data mess of this caliber is unplanned? What happens when a business relies on bad data and ends up with a bunch of billboards greeting the wrong city?

It might not be as public as a billboard, but this sort of thing happens to businesses every single day. It happens because somewhere along the way the data they’re relying on has let them down. And using that bad data to influence real business decisions can lead to consequences much bigger than our amusing mixup. 

It makes you wonder: what good is bad data?

Bad data is everywhere

It doesn’t take long to find. Check your email inbox right now and it’s almost a sure bet that you’ll find emails that call you by the wrong name, recommend finishing the purchase on something you literally just bought, or that clearly were meant for someone else. It’s (hopefully) not something the company directed your way on purpose. They simply relied on bad data…and things did not go as intended.

Bad data can exist in many different forms—it can be stale, inaccessible, untrustworthy—but it always leads to bad outcomes. There are serious consequences to trusting data gone bad. IBM estimates that bad data costs the US $3 trillion each year. Bad data caused Delta to cancel hundreds of flights last year and cost the company $150 million

The consequences aren’t strictly financial—they also impact your brand. Customers increasingly want to do business with brands that know their name, their preferences, and can replicate great experiences across the web, in person, over the phone and email. If you’re not proactively addressing bad data, you’re at risk of becoming the next punchline on social media:

The irony of receiving an email about the GDPR that has bad data in it.

So, how do you avoid bad data?

Like everything in the universe, data tends toward chaos over time. If one product team at your company tracks “user_signup” and the other tracks “user_Signedup,” it won’t kill your business at first. But what happens when you have dozens of products and need to figure out how user signups are going overall? That simple question can take weeks to answer. And there’s a chance that your answer will still be wrong.

You need to prevent data from going bad starting on Day 1—and use technology to minimize the effort it takes to keep data in order as your business grows and changes.

Good data doesn’t just happen, it’s the result of investing the appropriate time and resources into maintaining it. Whether that means empowering a data wrangler or a team of data champions, you need someone at your company to proactively enforce the practices that lead to good data.

These practices include everyone working from the same data dictionary so that everything you track is standardized and easy to understand. It also means replacing the convoluted web of integrations between your marketing and analytics tools with something that makes it easy to track everything once and share the same data across all tools. It even means practices that help you know what data every tool has access to so you can filter and control which data goes where.

Good data is Segment data

Segment is the easiest way to empower your business to follow the practices that lead to good data. Our technology helps you collect all of your data through one API where it can be cleaned and standardized. It helps you enforce a common data dictionary and the standards that prevent bad data no matter how complex your business or organization gets. It helps you trust that you are informing your decisions and tools with the absolute best data you can.

Whether you’re a Fortune 500 or a fast-growing startup, you have a choice to make when it comes to what kind of data powers your business. You can have data that powers revenue-growing decisions. You can have data that powers marketing campaigns and personalization that works. You can have data that helps build delightful, five-star customer experiences.

Or not. You can have bad data that leads to wasted money, bad decisions, and embarrassing billboards. The choice is yours.

Cindy Berman, Madelyn Mullen on August 12th 2019

Knowledge is power. Now it’s even easier for you to manage your customer data flowing into Segment and unlock the most value for your teams. Over the last few weeks, we’ve launched new features, including a centralized location to easily configure and manage your workspace notifications. You can also receive detailed insights into the amount and types of data you’re sending to Segment. It’s also easier for you to set up and customize your workspace, through the new workspace home setup flow. In addition, you can quickly configure and label source environments as “Dev” or “Prod” to control who has access to each.

Here are the latest features, product updates, and integrations launched at Segment.

Connections 

Have you ever wanted to manage all of your workspace notifications for sources, destinations, and warehouses in one place? Now you can! Through a new UI, you can easily see and subscribe to all your notifications in a consolidated place. With more granular options, you can subscribe to the workspace updates that matter most to you. For example, you may just want to be notified when a source is created or deleted, instead of receiving all source notifications.

If you’re an existing customer, you can review your notification preferences by navigating to your workspace settings in the app and selecting “Notification Settings” under “User Preferences.” 

Customers now have increased visibility into which data types are flowing into their Segment workspace. Customers on a business plan can see their throughput count in the app, broken out by the number of API calls and objects they send to Segment (per MTU). With this visibility, you can now monitor how you’re trending against your account’s limits.

Sample throughput count

Here are a few more Connections updates:

  • Workspace home setup: To help you start getting value from Segment even faster, we released a new workspace home setup. It provides step-by-step guidance for all the tasks you’ll need to accomplish to get your data flowing into Segment and have your workspace up and running. It’s now even easier for you to add new sources, as well as send your data to your favorite tools for email, marketing, analytics, and more. The new workspace home is currently available to customers on our Team plan. Business plan customers can request access by emailing beta@segment.com.

  • Separate user permissions based on environment: As a best practice, we recommend customers separate their sources by environment (Dev, Prod, etc). But, we’ve also heard from customers that there isn’t an easy way to do this in the app beyond naming the source (e.g., “iOS - Dev” and “iOS - Prod”). Instead of manually naming your sources, you can now apply a label to each source based on the environment. This not only makes it more clear which sources include prod data and which sources include dev data, but it also enables you to set granular permissions based on environment. For example, you could give your QA team access to only sources with the “Dev” label. You can also filter your workspace view by environment to quickly view only your prod sources. To get started with environment labels, visit the source settings tab and select the relevant environment from the dropdown. 

We’re always adding new integration partners to the Segment Catalog. Here are a few of our latest integrations:

  • PixelMe - PixelMe's acquisition tools let you measure your attribution across channels, track ROI, and get a clear view of all your marketing efforts.

  • Lazy Lantern - Autonomous anomaly detection for product metrics. Connect your Segment account with just a few clicks and get notified when something doesn't look right — no coding involved. 

  • Airship - Send meaningful messages at every stage of the customer lifecycle based on event triggers and user attributes from your Segment data. 

  • WalkMe - A Digital Adoption Platform (DAP) which accelerates digital transformation across an entire software stack using analytics insights, in-app guidance, user segmented engagement, and automation tools.

  • Podsights - Measure the effectiveness of your podcast advertising by connecting downloads with on-site activity through household matching, cross-device graphs, custom URLs, and discount codes.

  • FactorsAI - Advanced and intuitive analytics for marketers and product managers to help drive growth. Connect your Segment data to capture different metrics, trends, and funnels and create intuitive and shareable dashboards.

  • FunnelFox - Connecting FunnelFox to Segment enables you to include website and product events into your sales funnel. In just a few clicks your customer information gets automatically updated, new leads flow into your CRM, and your pipeline can be updated based on events like form submits or product signups.

Looking for an integration, not in our Segment Catalog? Add your request now to Segment’s Integration Wishlist or upvote existing requests, all directly from the app.

Want to get early access to Segment features?

We’re always working on new beta products and features for customers. Become a beta participant to receive early access to these features.

Here are a few of the betas we’re actively recruiting for:

Reach out to beta@segment.com to request access to a beta today!

Want to learn about the latest Protocols updates? Check out this blog to review the latest updates, including collaboration, versioning, and libraries. 

We’re always making updates to Segment! To see everything for yourself, log in to your workspace.

New to Segment and want to learn more? Request a personalized demo.

Coleen Coolidge on August 8th 2019

Our approach

Respect for our customers, and the end users they seek to understand, has always been central to Segment’s values. We recognize that protecting privacy requires a holistic security program. We want you to have confidence in the way we practice security at Segment, and we want to earn your trust as the infrastructure for your customer data. That is why we built and continue to invest in holistic security and trust programs to protect your customer data.

We have an ISO 27001-based security program, which means we are continuously evaluating, refining, and augmenting our security offerings. We are able to do this by running multiple security programs at once, each with its own team, each focused on maturing a different area of security.

We use Amazon Web Services for our datacenter, which means our customers benefit from AWS’s comprehensive security practices and compliance certifications. We also recognize that AWS can’t do everything for us, and protecting our customers’ data requires a great internal security program as well.

Any experienced security practitioner can tell you that technology and processes are just two key components of an effective security program. People are the third component. At Segment, security is everyone’s responsibility.

Support from Segment leadership

I chose to come work at Segment because of the strong executive and board commitment to building, and then growing a world-class security program. Having this kind of support makes a big difference in our daily work as well as our long-term success.

Segment’s leaders ensure we have the resources and headcount to provide the kind of security our customers deserve. The Security Organization is also boosted by consistent buy-in from every level of the company.

Our security culture

Ultimately, we have fostered a genuine interest in security throughout the company, and have created security champions on different teams to help us move the programs forward.

When we roll out security programs at Segment, we focus on enabling our people. We take time to demonstrate why security practices are important and how they intersect with their jobs. Sometimes, it’s about keeping people safe through good physical security. Sometimes, it’s about teaching our engineers how to think like an attacker, so they develop better code and take ownership of the security of their application or service.

Going forward

When I think about what our security teams are doing in the near future, we are creating systems and processes that are both easy and secure by default. We want to create an environment where doing something the wrong way is a lot of extra work and time, not even worth the effort. This is how the overall security community can make security viable and sustainable into the future.

For more information about our security and privacy practices, see our Security page. The team is growing, and if you’re considering joining, I encourage you to have a look at our careers page to see whether there’s a possible fit. Let’s chat. Coleen Coolidge CISO

Sasha Blumenfeld, Sudheendra Chilappagari on July 19th 2019

In the last month, the Segment partner ecosystem has been busy building several new integrations for you and your team. From machine learning tools to automated QA platforms, we’ve added 11 new integrations, allowing you to activate your first-party data in even more of the tools that you use every day. 

Check out the list of our latest partner-built integrations now available on the Segment catalog below.

Asayer | Digital Experience Optimization | Destination

  • Who is it for: SaaS Engineering teams

  • What is it: Asayer is a session replay tool that captures the complete picture of each user session on your website. It makes it easy to reproduce bugs, uncover key frustrations, improve performance and automate your tests—all in one place.

  • How does it work with Segment: Events you collect through Segment are passed to Asayer, which you can then use to search and view your user sessions.

Custify | Customer Success | Destination

  • Who is it for: SaaS companies 

  • What is it: Custify is a customer success platform that gives you a single view of your customer across different tools and touchpoints. Manual customer success workflows are automated, tasks and reminders presented actionably and each customer can be contacted proactively to avoid churn while increasing upsells and signups at the same time.  

  • How does it work with Segment: Custify will receive your customer data and their interactions with your product via Segment. You will be able to start segmenting your customer base, create playbooks and monitor customer lifecycles within a few minutes after integrating with Segment. No developer time is needed. 

Hydra | Machine Learning | Destination |

  • Who is it for: B2B Marketing, Sales and Customer Success teams

  • What is it: Hydra is a predictive analytics solution that allows you to build your own A.I. models in a few clicks to find valuable insights from your data. Hydra relays these insights to your marketing, sales, and customer success platforms to support a range of use cases including automatic customer segmentation, prioritization, campaign personalization, and hyper-targeting.

  • How does it work with Segment: Using Segment not only simplifies the set-up process but guarantees access to consistent and clean data to feed your models. 

Klenty | Sales Engagement | Source

  • Who is it for: Inside Sales team

  • What is it: Klenty is an email automation solution focused on sales follow up. Klenty helps automate top-of-the-sales funnel activities like sending targeted email campaigns, following up, tracking engagement metrics, etc; so that your sales teams can focus more on building relationships and closing more deals.

  • How does it work with Segment: Send your email engagement data from Klenty to your favorite analytical tools and build extensive reports and analytical dashboards. 

mabl | End-to-End Automated Testing | Destination

  • Who is it for: Software Quality Engineers and Developers

  • What is it: mabl provides a cloud-based platform for end-to-end web UI testing, including workflows with email and PDF components. Users can create scriptless, auto-healing tests using mabl's chrome extension and integrate them into their CI/CD pipeline. mabl monitors your test results for Javascript errors, visual anomalies, and performance degradation.  

  • How does it work with Segment: Integrating Segment into mabl allows you to view your test coverage with actual user behavior. Combine your Segment data with test results and other measure of web page importance and complexity to help you focus your ongoing testing efforts. 

Mammoth | Data Preparation | Destination

  • Who is it for: Data Analysts

  • What is it: A self-service data management platform that provides powerful tools and automation capabilities to transform, analyze, and derive insights on complex data without any coding.

  • How does it work with Segment: Send your web, mobile, CRM, email data, and more directly to Mammoth Analytics where you can prototype data workflows for easy data exploration and analysis.

Moesif | Analytics | Source

  • Who is it for: B2B and API companies

  • What is it: Moesif provides API analytics to better understand how your customers and partners use your APIs. Understand and visualize the entire journey of how customers and partners adopt and use your APIs. Discover issues in your integration funnel that stall growth while understanding which API features your most valuable customers are using the most.

  • How does it work with Segment: Connect Moesif API Analytics to Segment to sync API insights to your favorite BI, CRM, and marketing automation tools. See which customers have integrated your APIs and how much they are using them in your favorite CRM. Send targeted emails and trigger marketing automation playbooks depending on which API features your customers are using.

NorthStar | Growth Hacking Management | Destination

  • Who is it for: Growth Hackers, CRO

  • What is it: NorthStar helps growth teams to manage the entire growth process, from ideation and prioritization up to analyzing and learning.

  • How does it work with Segment: By integrating NorthStar with Segment, growth teams are able to automatically populate the results of their tests into a card, making it easier to gather data and analyze its results.

ProveSource | Conversion Rate Optimization | Source 

  • Who is it for: Marketers and Conversion Rate Optimizers

  • What is it: ProveSource is a social proof marketing platform that streams recent customer behaviors on your website to build trust and increase conversions. You can show who recently purchased or signed up on your website as well as positive reviews from other platforms. Display the number of live visitors, page visitors, or show generic notifications. You can fully customize all aspects of the notifications, from design and position to timing rules and behavior.

  • How does it work with Segment: ProveSource sends all of the visitors' interaction data like views, hovers, clicks, and goal completions to Segment, which lets you analyze and measure the effect social proof has on your website.

Serenytics | Destination | Dashboards

  • Who is it for: Marketing teams

  • What is it: Serenytics is a self-service BI platform that allows you to create dashboards and share and schedule reports easily. Serenytics also provides a built-in Amazon Redshift data warehouse where your data will be stored and an ETL to aggregate and filter it. For custom needs, you can run your Python scripts (with credentials to access your data) within the platform.

  • How does it work with Segment: The Serenytics Destination lets you easily create dashboards to explore your first-party customer data or merge it with your other data sources for more granular analysis. 

Trackier | Performance Marketing Software | Source

  • Who is it for: Affiliate networks, ad agencies, and ad networks

  • What is it: Trackier software helps marketers determine the ROI for their online marketing activities by tracking digital marketing spend, impressions, clicks, conversions, installs, and media usage. Trackier provides detailed reports and visualizations like date-range reports, daily activities, performing partners, etc to give you full visibility into the performance of all of your campaigns in one place. 

  • How does it work with Segment: Segment allows you to get even more granular with your reporting in Trackier to determine which campaigns influenced high-intent customer behavior.


Don’t see the integration you’re looking for? Let us know directly within Segment.

Try our in-app wishlists to upvote the integrations you’d like to see built next.

Mallory Mooney - Technical Content Writer @ Datadog on July 17th 2019

This post is a guest submission by our partners at Datadog, highlighting their new integration with Segment. Thanks to Datadog for building a Segment integration that gives you full visibility into your data pipelines—allowing you to build alerts and dashboards off of your Segment delivery data so you can quickly assess the health and performance of your pipelines.

Datadog is a monitoring platform that provides deep visibility into your applications and infrastructure, so you can seamlessly monitor metrics, distributed request traces, and logs across your stack. We are pleased to announce that our new integration with Segment enables you to track the health and performance of your event delivery pipelines for all of your cloud mode Destinations, and ensures that you're able to effectively troubleshoot anomalies in real-time.    

A spike in event delivery latency, for example, could indicate a critical problem with your application servers. By having visibility into your event pipelines with Datadog, you are quickly alerted so you can investigate issues and explore these inconsistencies—before they start affecting your customers.

Deep visibility into your Segment pipelines

With Datadog's out-of-the-box dashboard, you can visualize Segment event delivery metrics for all of your cloud mode Destinations. You can get an overview of successful, discarded, and rejected deliveries as well as delivery latency so that you can quickly assess the health and performance of your Segment pipelines. Datadog tags all of your Segment metrics by workspace, Source, and Destination, so you can use the template variables to filter your view of the dashboard to get more precise insights into the health of your pipelines.

While this template dashboard provides a great starting point for monitoring your Segment pipelines, you can clone and customize it by adding other key metrics that are important to your team.

Identify trends in your pipelines

To visualize trends in the health and performance of your Segment pipelines, you can apply Datadog's machine learning–powered forecasting or anomaly detection algorithms to any event delivery metric.

As seen in the example above, you can use anomaly detection to identify any abnormal fluctuations in event delivery latency, based on historical trends.

Datadog's Segment integration helps you investigate anomalies in your Segment pipelines by correlating them with monitoring data from the other technologies in your stack. For example, this anomalous spike in latency of event deliveries to a specific Destination could indicate an issue with your servers or a network outage. With Datadog, you can correlate these types of incidents with the health of the server and network to find the root cause. Datadog also provides integrations for Segment Destinations such as Amazon Redshift and S3, so you can build comprehensive dashboards for your infrastructure.

Automatically get notified about rejected deliveries

In addition to using dashboards to track the health of your Segment pipelines, you can create custom alerts to automatically notify your team of critical issues with event deliveries to one or more of your Destinations. 

For instance, you can set up an alert to notify you of a spike in the total number of events that were rejected by a Destination (`segment.event_delivery.retried`), which could point to issues with rate limiting or a configuration error for a specific Destination such as Pingdom.

Since Segment retries delivering events to Destinations up to nine times, you may want to set up an alert to detect when it reaches this threshold. If the total number of events rejected by any Destination reaches this threshold, you will receive a notification that contains more details about the Source and Destination that triggered the alert so you can troubleshoot the issue more efficiently. 

Enable the Datadog integration

To start visualizing and alerting on your Segment pipelines with this integration, you'll need to grant Datadog read-only access to each workspace you want to monitor. If you're not yet using Datadog, you can get started with a free trial. In your Datadog account, navigate to the Segment integration tile and click on the "Add Workspace" link. This will redirect you to log into your Segment account and authorize access to your desired workspace.

Within minutes, you should be able to see data from your Segment pipelines flowing into your out-of-the-box dashboard. You can filter this data based on workspace, Source, and Destination, track these metrics in custom dashboards, and set up alerts to proactively monitor the performance of your event deliveries.

Integrate Segment with Datadog

With Datadog's new Segment integration, you can get real-time visibility into the health of your Segment pipelines alongside other technologies in your stack. If you already have a Datadog account, you can learn more about the Segment integration and collected metrics by checking out our documentation. And if you’re not yet using Datadog, you can start monitoring your events alongside the rest of your infrastructure and applications by signing up for a 14-day trial.

Tido Carriero on July 9th 2019

About six months into building a new product at Segment, an alarming question began to loom over our team, “Have we spent too much time on this with not enough to show for it?” 


When we first launched Segment, the answer to this question was much clearer. The beautiful thing about a startup’s journey to find product-market fit for the first time is that the company has a very well-defined amount of runway. When you have $X in your bank account and $Y burn per month, you can figure out exactly when your cash runs out. This kind of pressure regularly creates diamonds.

In established companies that already have found product-market fit, like Segment today, the runway calculation for a new product is much more ambiguous. There is no go-to amount to invest in R&D for a particular product, and it needs to hit a much higher revenue goal to make a meaningful contribution.

Luckily, the story I started with has a happy ending. This product turned into Personas, which now has strong product-market fit. But, admittedly, the approach we took when building this product had some flaws. If we ruthlessly prioritized our customer problems and solutions, instead of embarking on an infrastructure project too soon, we never would have let the runway question creep up on us. 

In this article, I’ll share learnings for product managers building new products within post-product-market fit startups. I’ll discuss what we learned building Personas and how we applied that knowledge to our next new product, Protocols. Together, just one year after launching Personas and six months after releasing Protocols, these new products total over $10M in ARR with a healthy mix between the two!

Don’t forget the M in MVP (Minimum Viable Product)

When I joined Segment three and a half years ago, our product helped companies route data from where it was generated to where it would be useful for their business. We provided one API to collect data from websites, mobile apps, and servers. With that single integration, we enabled companies to quickly turn on hundreds of new tools for growth marketing and analytics. 

We knew that we were in an excellent position to help our customers do more with their first-party data flowing through Segment. We believed we could help them tie together their data into a complete picture of each user that they could then use to personalize their product, marketing, and support. This change would transform our value proposition from saving engineering time to generating new revenue for businesses. In 2017, we decided to fund a project to tackle this vision. 

We started with research and found that customers had a ton of different ideas for what they wanted to do with the stitched-together user profiles. Some customers wanted to use profiles to build a “sidebar” for a customer support platform where an agent could see everything a user had done in order to answer his or her questions faster. Others wanted to build advanced personalization on their websites by calling the user profile programmatically. Still, others wanted to build “functions” for engineering teams to create on-the-fly computations over customer profiles for machine learning purposes.

With a product vision and general validation from customers, we started building an infrastructure that would support a broad range of use cases where we found user interest. We built a way to stitch together disparate IDs, host complete user profiles, and enable computations on top of the data. We were excited about the infrastructure we built, but about six months in, we started to worry about our runway. 

Because we hadn’t focused on a single “job to be done” or have mocks for a specific solution we expected to ship, we didn’t have any way to show customers the infrastructure work to gather feedback. We underestimated just how much engineering work was required for the infrastructure we imagined, and we lacked specificity on the exact problem we wanted to solve first for our customers. We realized we needed to make a significant change to our approach.

We did more research and forced ourselves to prioritize a single use case. From this second round of research, we discovered that the top use case for profile technology was personalization across all channels. The audience was growth marketers, and they needed a powerful user interface for centralizing audiences based on the most granular data possible to sync out to end channels. While I am proud of the product we built, I believe that if we had started with this level of focus, we could have gotten the product to market much faster. 

Here is a screen capture of Personas today.

From this point forward, we decided as a product team that we’d be much tighter on scoping truly minimal MVPs for all future projects.

Mocks before code: the most efficient path

Having learned from our experience building Personas, we took a very different approach to finding product-market fit for our next major product initiative focused on data governance. 

We noticed that there were a lot of inbound requests for additions to a feature we called “Schema.” Schema provides an overview of the data passing through Segment and basic controls for filtering out some of the data. While we could have continued to build the incremental improvements that customers requested, we didn’t expect to see a significant business impact from that approach. Based on the volume of these requests, and the purchasing power of the customers requesting them, we decided to take a step back and see how we could provide a more impactful solution to the underlying challenges our customers were facing.

Before we built anything, we assigned a team of one designer and one product manager to research the problem. We set up interviews with more than 25 customers who had submitted requests under “data governance,” which seemed like the overarching theme of their challenges. These interviews dug into the problems they were facing, how they’d prioritize solving these problems, the impact the issues had on their businesses, and their current solutions. 

We quickly noticed a pattern: customers were highly incentivized to avoid sending poor quality data to Segment. Messy data caused them to make bad product decisions, misfire marketing campaigns, and waste countless hours tracking down and cleansing bad data. Once we identified this problem, the product manager and designer pulled in a larger group, including engineering, marketing, and sales to brainstorm different approaches for how we could help our customers solve this. We came up with three potential directions that we took back to customers for feedback:

  • Anomaly detection — When an event stops firing, automatically notify a customer.

  • Transformations — When an event is sent incorrectly into Segment, allow the customer to fix it using the Segment UI.

  • Automatic data verification — Provide a workflow to define a data schema and a set of tools to help verify and enforce that the events match the planned schema. (This later became Protocols.)

Our goal for the quarter was to get three screenshots of either emails or Slack messages with customers that showed real excitement around one of these features. Our stretch goal was to get a customer to tell us they’d be willing to pay for this feature once fully built out. The name of the game was prioritization. We wanted to identify which solution had the most customer pull, so we could build toward that job and that job only. 

Original low-fidelity mockups

Final product solution. Notice how similar they are!

We ended up prototyping janky versions of all three ideas. I initially thought anomaly detection was going to win out, but after running it for weeks, we still hadn’t found a situation where we saved a customer from ruin. It seemed like a “nice-to-have” feature. Transformations might have solved some real problems, but it effectively had no value for the type of data we could easily apply it to, and the cost to build transformations into all of our data types far exceeded the runway we had set for ourselves. 

We had one incredibly memorable customer visit that helped push us over the edge to focus on automatic data verification. Our designer, tech lead, and product manager flew to Los Angeles for an all-day onsite with one of our biggest enterprise customers. We watched their process for shipping a new release. 

They had a full team of more than five QA engineers that needed to test the data tracking across dozens of different mobile devices before they could release the change—almost a day’s worth of work. We realized how much better our original prototype could be if it automatically notified and enforced their preferred data spec, allowing them to quarantine or block bad data before it hit any of their production analytics or marketing services. 

This visit—and several additional interviews to test our new mocks with other customers—gave us great conviction that we were onto something. We continued to work hand-in-hand with customers to finalize the first version of the product and continue to refine it with customers today.

This story has another happy ending: Protocols had the fastest revenue ramp of any new product we have launched. Currently, it has a more natural go-to-market motion than past add-on products given we are selling to the same departments who buy our core product, Connections.

Principles for finding product-market fit again and again

As we continue our journey to find new product-market fit every year, we’ve embraced a few principles from our experience launching Personas and Protocols.

Ideas come from everywhere.

One of the keys to Protocols’ success was having engineering, product, and design collaborate from the very beginning. Design had some of the breakthrough ideas about the cheapest prototypes we could build. After hearing the pain directly from one customer, an engineer realized that to help customers uphold data quality, we had an opportunity to add a type-checking system in the code editor to prevent the errors from ever being introduced in the first place. This nugget turned into Typewriter, which is a favorite feature for engineers. In fact, the PM (traditionally the “finder” of new product-market fit) has described their role as “fading into the background, while the team generates all the great ideas.”

As we took the product to market, we also had our product and engineering teams work in incredibly close partnership with our go-to-market team. We received many of our later ideas about how to make the product more sellable from our partners in product marketing, sales ops, and sales engineering.

Prototype, iterate, and try not to code. 

While this is common advice for companies finding product-market fit the first time , we learned this lesson again—the hard way. The main reason that Protocols went to market so much faster than Personas was an incredible cycle and speed of prototypes and iterations. We were able to explore potential solutions with much more breadth, so by the time we took the product to General Availability, we had a tremendous amount of confidence that it would be a hit.

Choose your audience carefully.

One of the most important lessons about finding product-market fit again is that you need to deeply think about how your new product addition aligns with your existing customer base and sales motion. With Personas, we started to involve a new buyer (in our case, the Marketing team), which both expanded the market and also created go-to-market challenges in training the sales team and adapting our message. With Protocols, it was a much easier path since the product solved a huge pain-point for our core buyer. That said, there was more friction in packaging Protocols separately from Connections due to the close association. 

Understand your goals and runway.

With Personas, we had a vague idea of our runway, and we hadn’t stated the business goals clearly beforehand. Without agreeing on our intended investment upfront, we set off on an over-scoped MVP and then found ourselves doubting our progress about halfway through the project. A long infrastructure investment can be the right choice for carefully vetted and scoped projects, but we hadn’t been honest with ourselves about just how much runway we needed to pull this off. We also didn’t set any milestones along the way to prove customer value.

With Protocols, we fixed these mistakes. Having a very clear understanding of goals and runway helped us pick between the prototypes. One of them was going to take too long, the other wasn’t going to drive enough impact though it was a “nice-to-have.” Knowing we had six months to generate a new add-on product was a useful north star.

As a SaaS business, private and public investors measure us by our ability to increase net recurring revenue (NRR). This means we need to build new products to sell to our customer base and increase the value of our existing products. Therefore, finding product-market fit again and again is essential for us (and likely you) to get right. 

I encourage you to learn from our mistakes as you build out your own new products. These lessons make a lot of sense in theory, but they are often hard to implement in practice. It’s tempting to follow your gut, make tangible engineering progress as fast as possible (without customer validation), and avoid the tough questions that might lead you to kill a project.  

We’re not done learning yet. This year we have some exciting new privacy products launching. We’re running a similar process to how we ran Protocols to find product-market fit faster, and we will continue to improve on our process and share what we’ve learned along the way. 

Want to iterate together with us? We’re hiring.

Madelyn Mullen, Andy Schumeister, Doug Roberge, Sasha Blumenfeld on June 24th 2019

Visibility is crucial for driving growth. We want businesses that manage their customer data with Segment to have deep insight into how their data is flowing—and help them quickly solve any problems that arise.

Over the last few weeks, we’ve launched features that give you more visibility into what’s driving your usage metrics, send you alerts when data is not being delivered to tools the way you need, and help you adapt your tracking plan to prevent issues in the first place. 

Here are the latest features, product updates, and integrations launched at Segment.


Connections

Ever wondered where most of your user volume is coming from? We just launched usage insights to help you break down user volumes by source. You can use this to maximize visibility into what’s driving your user growth as your volumes increase.

If you’re an existing customer, check out your own Usage Insights by navigating to your workspace settings in the app today!

We’ve also recently launched a new feature to improve your visibility when it comes to syncing errors. Segment helps sync events from your data sources to over 200+ tools, and we want you to be prepared when errors happen—whether it’s a network connectivity issue or a partner tool outage.

Now you can subscribe to custom delivery alerts to have Segment let you know when your connection delivery rates dip below a threshold of your choosing. Never miss a beat when it comes to the connections that matter most to you and your business.

Interested in setting up a custom delivery alert? Navigate to the “Event Delivery” section for your favorite destinations and opt in to receive alerts.

Here are a few more Connections updates:

  • New audit trail: To help you monitor and audit activity in your workspace, we shipped a new audit trail. The audit trail provides a log of every change or update, who made the change, and when it happened. For a more detailed analysis, you can also export your audit trail or route audit events to a Segment source. To get started, go to workspace settings in the app, and select “Audit Trail.”  

  • New destination filters: Filter the data flowing to your destinations by event, event properties and more. This gives you more control over data costs and sending the right data across your tools. Learn more use cases here →

  • New React Native source: Segment’s React Native library source makes it easy to send your iOS and Android mobile app data to any analytics or marketing tool without having to learn, test, implement, and maintain a new API every time your team tries a new tool across supported mobile platforms. See the docs →

  • New MTU alerting: In order to offer customers more visibility into their MTU usage, we recently launched MTU alerting. We’ll now proactively reach out to you when you’ve crossed over 85% of your MTU limit to ensure you’re in control. 

  • Updated Stripe source docs: Strong Customer Authentication (SCA) is a new European regulatory requirement to reduce fraud and make online payments more secure. We’ve updated our Stripe source docs to add Payment Methods, Intents Collection, and Invoice Lines Collection to support our customers as they prepare for SCA. 

Here are a few of our latest integrations:

  • Hydra -  Predictive analytics tool that helps marketing, sales operations, and customer success teams find revenue signals by scanning data.

  • Mabl - End-to-end automated testing tool for scriptless cross-browser testing, auto-healing tests, and visual testing, all in one place.

  • Mammoth Analytics - Self-service BI platform with tools to transform, analyze, and get insights on complex data without coding.

  • Moesif - Sync Moesif API insights and user profiles to Segment to sync to your favorite business intelligence, CRM, and marketing tools.

  • Trackier - Performance marketing tool that helps businesses create, automate, measure, and optimize marketing campaigns.

  • ProveSource - Sync ProveSource insights and user events to Segment to use the data across your reporting, analytics, and marketing tools.

  • Serenytics - Full data stack that makes it simple to set up a full data processing chain from collecting your events to sharing your KPIs in dashboards.


Want to get early access to Segment features?

Here are a few of the betas we’re actively recruiting for:

  • Set up Segment event tracking without any coding or engineering work

  • Send Segment data anywhere via Google Cloud Functions or Azure Functions

  • Use the new Object API to ETL virtually any object data into Segment

  • Build a custom source or destination for your workspace

  • Connect Segment data to your data lake

Reach out to beta@segment.com to join a beta today! 


Protocols

We’ve heard from customers that one of the biggest challenges they face is managing analytics rollouts across multiple mobile app versions. If version 1.2.1 of your app collects a property called productId and version 1.2.2 refers to the property as product_id, it’s nearly impossible to validate both properties against the same tracking plan or spec.

Protocols now supports event versioning so you can dynamically validate your event stream against the relevant schema definition. That way, a single tracking plan can be used to validate multiple versions of your app.

Check out the docs to learn how you can start using event versioning. 

Here are a few more Protocols updates:

  • Added property groups for scaled edits: While the properties typically vary from event to event, there are also many properties that are required for every event. Instead of manually adding these properties to each event, you can now create property groups and quickly attach them to every event or specific events in bulk. 

  • New changelog to monitor tracking plan updates: Each tracking plan now includes its own changelog so you can quickly audit when your teammates make changes to it. To view the changelog:

    • Open your tracking plan 

    • Select “View Changelog” in the ellipses menu on the right-hand corner

  • New editing workflow to support auto-save and merging edits: With the new “draft mode,” you can collaboratively update your tracking plan with your teammates. Before your changes go into effect, you’ll be able to review all edits and specify which changes should be merged. That way, only approved changes will be deployed to your tracking plan. 

These new features are available to all customers using Protocols. If you’d like to see how Protocols can help you protect the integrity of your data, request a demo today.


Personas

It’s not uncommon for Segment customers to use multiple instances of the same tool. For example, you might have teams working in different countries or on different product lines using multiple instances of AdWords. However, each tool often relies on the same (or a very similar) set of customer data.

With a recent update to Personas, you can now deliver your computed traits and audiences to as many different instances of the same destination as your teams needs.


We’re always making updates to Segment! To see everything for yourself, log in to your workspace.

New to Segment and want to learn more? Request a personalized demo.

Lauren Reeder on June 18th 2019

All too often, we hear that businesses want to do more with their customer data. They want to be data-informed, they want to provide better customer experiences, and—most of all—they just want to understand their customers. 

Getting there isn’t easy. Not only do you need to collect and store the data, you also need to identify the useful pieces and act on the insights. 

At Segment, we’ve helped thousands of businesses walk the path toward becoming more data-informed. One successful technique we’ve seen time and time again is establishing a working data lake.

A data lake is a centralized repository that stores both structured and unstructured data and allows you to store massive amounts of data in a flexible, cost effective storage layer. Data lakes have become increasingly popular both because businesses have more data than ever before, and it’s never been cheaper and easier to collect and store it all. 

In this post, we’ll dive into the different layers to consider when working with a data lake. 

  • We’ll start with an object store, such as S3 or Google Cloud Storage, as a cheap and reliable storage layer.

  • Next is the query layer, such as Athena or BigQuery, which will allow you to explore the data in your data lake through a simple SQL interface.

  • A central piece is a metadata store, such as the AWS Glue Catalog, which connects all the metadata (its format, location, etc.) with your tools.

  • Finally, you can take advantage of a transformation layer on top, such as EMR, to run aggregations, write to new tables, or otherwise transform your data.

As heavy users of all of these tools in AWS, we’ll share some examples, tips, and recommendations for customer data in the AWS ecosystem. These same concepts also apply to other clouds and beyond.

Storage Layer: S3

If you take one idea away from this blog post, let it be this: store a raw copy of your data in S3.

It’s cheap, scalable, incredibly reliable, and plays nicely with the other tools in the AWS ecosystem. It’s very likely your entire storage bill for S3 will cost you less than a hundred dollars per month. If we look across our entire customer base, less than 1% of our customers have an S3 bill over $100/month for data collected by Segment. 

That said, the simplicity of S3 can be a double-edged sword. While S3 is a great place to keep all of your data, it often requires a lot of work to collect the data, load it, and actually get to the insights you’re looking for. 

There are three important factors to keep in mind when collecting and storing data on S3:

  • encoding – data files can be encoded any number of ways (CSV, JSON, Parquet, ORC), and each one can have big performance implications.

  • batch size – file size has important ramifications, both for your uploading strategy (and data freshness) and for your query times.

  • partition scheme – partitions refers to the ‘hierarchy’ for data, and the way your data is partitioned or structured can impact search performance.

Structuring data within your data lake

We’ll discuss each of these in a bit more depth, but first it’s worth understanding how data first enters your data lake. 

There are a number of ways to get data into S3, such as uploading via the S3 UI or CLI. But if you’re talking customer data, it’s easy to start delivering your data to S3 via the Segment platform. The Segment platform provides the infrastructure to collect, clean, and control your first party customer data and send exactly what you need to all the tools you need it in.

Encoding

The encoding of your files has a significant impact on the performance of your queries and data analysis. For large workloads, you’ll want to use a binary format like Parquet or ORC (we’re beginning to support these natively. If you’d like beta access, please get in touch!).  

To understand why, consider what a machine has to do to read JSON vs Parquet. 

When looking at JSON, the data looks something like this:

{ 'userId': 'user-1', 'name': 'Lauren', 'company': 'Segment' } { 'userId': 'user-2', 'name': 'Parsa', 'company': 'Segment } { 'userId': 'user-3', 'company': 'Microsoft', 'name': 'Satya' } { 'userId': 'user-4', 'name': 'Elon', 'company': 'Tesla' }

Here, we must parse not only the whole message, but each key individually, and each value. Because each JSON object might have a different schema (and is totally unordered), we have to do roughly the same work for each row.

Additionally, even if we are just picking out companies, or names, we have to parse all of our data. There’s no ‘shortcut’ where we can jump to the middle of a given row.  

Contrast that with Parquet, and we see a much different schema. In Parquet, we’ve pre-defined the schema ahead of time, and we end up storing columns of data together. Below is an example of the previous JSON document transformed in Parquet format. You can see the users are stored together on the right, as they are all in the same column. 

See users stored together on the right

A reader doesn’t have to parse out and keep a complicated in-memory representation of the object, nor does it have to read entire lines to pick out one field. Instead it can quickly jump to the section of the files it needs and parse out the relevant columns.

Instead of just taking my word for it, below are a few concrete benchmarks which query both JSON and Parquet.

In each of the four scenarios, we can see big gains from using Parquet.

As you can see, the data we need to query in each instance is limited for Parquet. For JSON, we need to query the full body of each JSON event every time.

Batch Size

Batch size, or the amount of data in each file, is tricky to tune. Having too large of a batch means that you will have to re-upload or re-process a lot of data in the case of a hiccup or machine failure. Having a bunch of files which are too small means that your query times will likely be much longer.

Batch size is also tied to encoding, which we discussed above. Certain formats like Parquet and ORC are ‘splittable’, where files can be split and re-combined at runtime. JSON and CSV can be splittable under certain conditions, but typically cannot be split for faster processing.

Generally, we try and target files with sizes ranging from 256 MB to 1 GB. We find this gives the best overall performance mix. 

Partitioning

When you start to have more than 1GB of data in each batch, it’s important to think about how a data set is split, or partitioned. Each partition contains only a subset of the data. This increases performance by reducing the amount of data that must be scanned when querying with a tool like Athena or processing data with EMR. For example, a common way to partition data is by date.

Querying

Finally, it’s worth understanding that just having your data in S3 doesn’t really directly help you do any of the things we talked about at the beginning of the post. It’s like having a hard drive, but no CPU. 

There are many ways to examine this data — you could download it all, write some code, or try loading it into some other database. 

But the easiest is just to write SQL. That’s where Athena comes in.

Query Layer: Athena  🔎

Once you’ve got your data into S3, the best way to start exploring what you’ve collected is through Athena. 

Athena is a query engine managed by AWS that allows you to use SQL to query any data you have in S3, and works with most of the common file formats for structured data such as Parquet, JSON, CSV, etc. 

In order to get started with Athena, you just need to provide the location of the data, its format, and the specific pieces you care about. Segment events in particular have a specific format, which we can leverage when creating tables for easier analysis. 

Setup

Below is an example to set up a table schema in Athena, which we’ll use to look at how many messages we’ve received by type:

CREATE EXTERNAL TABLE IF NOT EXISTS segment_logs.eventlogs ( anonymousid                 string                  ,  # pick columns you care about! context                     map<string,string>      ,  # using a map for nested JSON messageid                   string                  ,    timestamp                   Timestamp               ,    type                        string                  ,    userid                      string                  ,    traits                      map<string,string>      ,   event                       string                    ) PARTITIONED BY (sourceid string)       # partition by the axes you expect to query often, sourceid here is associated with each source of data ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' LOCATION 's3://your-s3-bucket/segment-logs'    # location of your data in S3

In addition to creating the table, you will need to add the specific partitions:

ALTER TABLE eventlogs ADD     PARTITION (sourceid='source1') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source1/'  # sourceid here is associated with each source of data     PARTITION (sourceid='source2') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source2/'     PARTITION (sourceid='source3') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source3/'     PARTITION (sourceid='source4') LOCATION 's3://your-s3-bucket/segment-logs/sourceid=source4/'

There are many ways to partition your data. Here, we’ve partitioned by source for each customer. This works for us when we’re looking at a specific customer, but if you’re looking across all customers over time, you may want to partition by date instead. 

Query time!

Let’s answer a simple question from the table above. Let’s say we want to know how many messages of each type we saw for a given data source in the past day - we can simply run some SQL to find out from the table we just created in Athena:

  select  type, count(messageid)      from  eventlogs     where  sourceid='source1'       and  date_trunc('day', timestamp) = current_date  group by  1  order by  2 desc

For all queries, the cost of Athena is tightly related to how you partition your data and its format. It is also driven by how much data is scanned ($5 per TB).

When scanning JSON, you will be scanning the entire record every time due to how it’s structured (see above for an example). Alternatively, you can set up Parquet for a subset of your data containing only the columns you care about, which is great for limiting your table scans and therefore limiting cost. This is also why Parquet can be so much faster—it has direct access to specific columns without scanning the full JSON.

Metadata: AWS Glue 🗺

Staying current

One challenge with Athena is keeping your tables up to date as you add new data to S3.  Athena doesn’t know where your new data is stored, so you need to either update or create new tables, similar to the query above, in order to point Athena in the right direction. Luckily there are tools to help manage your schema and keep the tables up to date.

The AWS Glue Catalog is a central location in which to store and populate table metadata across all your tools in AWS, including Athena. You can populate the catalog either using out of the box crawlers to scan your data, or directly populate the catalog via the Glue API or via Hive. You can see how these all fit together in the diagram below.

Once this is populated with your metadata, Athena and EMR can reference the Glue Catalog for the location, type, and more when querying or otherwise accessing data in S3.

From: https://docs.aws.amazon.com/athena/latest/ug/glue-athena.html

Compute Layer: EMR

Moving beyond one-off queries and exploratory analysis, if you want to modify or transform your data, a tool like EMR (Elastic Map Reduce) gives you the power to not only read data but transform it and write into new tables. You may need to write if you want to transform the format of your data from JSON to Parquet, or if you want aggregate % of users completed the signup flow the past month and write it to another table for future use. 

Operating EMR

EMR provides managed Hadoop on top of EC2 (AWS’s standard compute instances). Some code and config is required - internally we use Spark and Hive heavily on top of EMR. Hive provides a SQL interface over your data and Spark is a data processing framework that supports many different languages such as Python, Scala, and Java. We’ll walk through an example and more in-depth explanation of each below.

Pattern-wise, managing data with EMR is similar to how Athena operates. You need to tell it where your data is and its format. You can do this each time you need to run a job or take advantage of a central metastore like the AWS Glue Catalog mentioned earlier.

Building on our earlier example, let’s use EMR to find out how many messages of each type we received not only over the past day, but for every day over the past year. This requires going through way more data than we did with Athena, which means we should make a few optimizations to help speed this up. 

Data Pre-processing

The first optimization we should make is to transform our data from JSON to Parquet. This will allow us to significantly cut down on the amount of data we need to scan for the final query, as shown previously! 

For this JSON to Parquet file format transformation, we’ll want to use Hive, then turn to Spark for the aggregation steps.

Hive is a data warehousing system with a SQL interface for processing large amounts of data and has been around since 2010. Hive really shines when you need to do heavy reads and writes on a ton of data at once, which is exactly what we need when converting all our historical data from JSON into Parquet. 

Below is an example of how we would execute this JSON to Parquet transformation.

First, we create the destination table with the final Parquet format we want, which we can do via Hive.

CREATE EXTERNAL TABLE `test_parquet`(   anonymousid                 string                  ,     context                     map<string,string>      ,     messageid                   string                  ,      timestamp                   Timestamp               ,      type                        string                  ,      userid                      string                  ,      traits                      map<string,string>      ,     event                       string                    ) PARTITIONED BY (dt string)  -- dt will be the prefix on your output files, i.e. s3://your-data-lake/parquet/dt=1432432423/object1.gz STORED AS PARQUET  -- specify the format you want here location 's3://your-data-lake/parquet/';

Then we simply need to read from the original JSON table and insert into the newly created Parquet table:

INSERT INTO test_parquet partition (dt)  SELECT anonymousid, context, messageId, `timestamp`, `type`, userid, traits, event  FROM test_json;

To actually run this step, we will need to create an EMR job to put some compute behind it. You can do this, by submitting a job to EMR via the UI: 

Or, by submitting a job via the CLI:

# EMR CLI example job, with lots of config! aws emr add-steps \   --cluster-id j-xxxxx \   --steps Type=spark, Name=SparkWordCountApp, \    Args=[     --deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/], \    ActionOnFailure=CONTINUE

Aggregations

Now that we have our data in Parquet format, we can take advantage of Spark to sum how many messages of each type we received and write the results into a final table for future reference.

Spark is useful to run computations or aggregations over your data. It supports languages beyond SQL such as Python, R, Scala, Java, etc. which have more complex logic and libraries available. It also has in memory caching, so intermediate data doesn’t write to disk.

Below is a Python example of a Spark job to do this aggregation of messageid by type.

from datetime import datetime, timezone, timedelta from pyspark.sql.functions import col, when, count, desc   # S3 buckets for reading Segment logs and writing aggregation output read_bucket_prefix = 's3://bizops-data-lake/segment-logs/source1' write_bucket_prefix = "s3://bizops-data-lake-development/tmp/segment-logs-source1"

# get datestamp for prior year today = datetime.now() last_year_partition = datetime.strftime(today - timedelta(years=today.weekday(), years=1), '%Y-%m-%d') last_year_ds = datetime.strptime(last_year_partition, '%Y-%m-%d')   """   obtain all logs partitions of the year   sample filenames:   [     's3://bizops-data-lake/segment-logs/source1/1558915200000/',      's3://bizops-data-lake/segment-logs/source1/1559001600000/',      's3://bizops-data-lake/segment-logs/source1/1559088000000/',     ...   ] """ read_partitions = [] for day in range(365):     next_ds = last_year_ds + timedelta(days=day)     ts_partition = int(1000*next_ds.replace(tzinfo=timezone.utc).timestamp())     read_year_partitions.append("{}/{}/".format(read_bucket_prefix, ts_partition))

# bucket partition for aggregation output # sample 's3://bizops-data-lake-development/tmp/segment-logs-source1/week_ds=2019-05-27/' write_year_ds = "{}/week_start_ds={}/".format(write_bucket_prefix, last_year_partition)

# read logs of last year, from pre-processing step. Faster with parquet! df = spark.read.parquet(read_year_partitions)

# aggregate by message type agg_df = df.select("type", "messageid").groupBy("type").agg(   count(messageid).alias("message_count"), )

# writing Spark output dataframe to final S3 bucket in parquet format agg_df.write.parquet(path=write_year_ds, compression='snappy', mode='overwrite')

It is this last step, agg_df.write.parquet, that takes the updated aggregations that are stored in an intermediate format, a DataFrame, and writes these aggregations to a new bucket in Parquet format.

Conclusion

All in, there is a robust ecosystem of tools that are available to get value out of the large amounts of data that can be amassed in a data lake. 

It all starts with getting your data into S3. This gives you an incredibly cheap, reliable place to store all your data.

From S3, it’s then easy to query your data with Athena. Athena is perfect for exploratory analysis, with a simple UI that allows you to write SQL queries against any of the data you have in S3. Parquet can help cut down on the amount of data you need to query and save on costs!

AWS Glue makes querying your S3 data even easier, as it serves as the central metastore for what data is where. It is already integrated with both Athena and EMR, and has convenient crawlers that can help map out your data types and locations.

Finally, EMR helps take your data lake to the next level, with the ability to transform, aggregate, and create new rollups of your data with the flexibility of Spark, Hive, and more. It can be more complex to manage, but its data manipulation capabilities are second to none.

At Segment, we help enable seamless integration with these same systems. Our S3 destination enables customers to have a fresh copy of all their customer and event data in their own AWS accounts. 

We’re working on making this even easier by expanding the file format options, as well as integrating with the AWS Glue metastore, so you always have an up to date schema to keep up to date with your latest data. Drop us a line if you want to be part of the beta! 

Special thanks to Parsa Shabani, Calvin French-Owen, and Udit Mehta

Become a data expert.

Get the latest articles on all things data, product, and growth delivered straight to your inbox.