LLM-as-Judge: Evaluating and Improving Language Model Performance in Production

Harness the power of Language Models (LLMs) as judges with Twilio Segment to improve audience building, achieving over 90% alignment with human evaluation for ASTs and paving the way for future advancements in AI-driven solutions.

By James Zhu, Alfredo Lainez Rodrigo, Ankit Awasthi, Salman Ahmed, Kevin Niparko

Late last year, we had an LLM a-ha moment.

Every day, thousands of marketers flock to Segment to build new, advanced audiences & customer journeys to power their campaigns –  but doing so requires marketers to navigate an advanced UI and understand their data assets. 

We asked ourselves: How can we harness LLM advancements to create sophisticated audiences and customer journeys for campaign generation, all while eliminating the need for marketers to grapple with complex UIs and data subtleties? And what if we could streamline all these audience generation tasks into a simple prompt? Might such an advanced way of building audiences be indistinguishable from Magic?

Fast-forward to today, CustomerAI audiences are in the wild. We’ve been blown away by the results – customers are experiencing a 3x improvement in median time-to-audience creation, and a whopping 95% feature retention rate IF the audience generation works on the first attempt. 

In this blog post, we want to share the behind-the-scenes story of how we went from proof-of-concept to widely-adopted generative AI capabilities by using LLMs to generate, eval and decide. Today we will share behind the scenes machinery that makes evals work – an emerging development approach referred to as LLM-as-Judge

The Challenge

Let’s take a look at the problem we were trying to tackle. 

To build an audience in Segment, end users can use the Audience Builder – a sophisticated UI that enables the expression of complex query logic without code. For example, marketers can build an audience of “all users who have added a product to cart but not checked out in the last 7 days”. Once the audience is saved, the output (e.g. a list of users that meet the criteria) get federated into downstream tools like advertising and email marketing tools, so every marketing & sales system is operating off of the consistent audience definition. 

Behind the scenes, the inputs to the audience UI get compiled into an AST (abstract syntax tree). For example, here are the potential ASTs for the audience “Customers who have purchased at least 1 time”:

For complex audiences, AST syntax can get complex, and there are many ways to express an audience that have the same meaning and an identical result. For example, the same audience above could be defined as “Customers who have purchased more than 0 times but less than 2 times”. 

So as we set out to simplify audience-building via LLMs & natural language, we faced the challenge: how do we evaluate a generative AI system, when there can be an unbounded set of “right answers”?

Solving this evaluation problem would provide the “scoreboard” we would use to iterate on our architecture:

  • Model Selection: Which model should we use? Anthropic’s Claude or OpenAI’s GPT models?

  • Prompt Optimization: What prompt should I use in the system? If the developers have improved the prompt, how should we make sure the new prompt is better than the old one?

  • RAG & Persistent Memory: If there are other components in the system, such as a vector database, how do we ensure the end user quality?

  • Single v. Multi-Stage: Should we use a multi-stage LLM approach or a single-stage LLM approach?

  • All of those practical questions relied on figuring out our evals

Give the LLM a Gavel 🧑‍⚖️

The general approach we landed on is that of LLM-as-Judge. Rather than define a strict set of heuristics, LLM-as-Judge asks a model (the “judge”) to evaluate, compare, and score {prompt:output} pairs to a “ground truth” to assess the quality & accuracy of model inferences. 

Several recent papers have explored the use of LLMs as judges for various tasks, including JudgeLM, Prometheus, Generative Judge for Evaluating Alignment, and Calibrating LLM-Based Evaluator. The LLM-SQL-Solver paper was particularly relevant for our use case, as it focuses on determining SQL equivalence, which is similar to our goal of evaluating ASTs.

Ground Truth & Synthetic Eval Generation

There was one sub-problem we had to solve before our judge could start judging. Our judge needs to compare the “ground truth” (i.e. the correct AST input by a customer via the UI) to {prompt:output} eval pairs. And while we have a large dataset of ASTs, those ASTs were generated by our UI, not via a prompt. 

To solve for the missing prompts, we built an LLM Question Generator Agent that takes a “ground truth” AST and generates a prompt. This may seem counterintuitive, as we normally think of prompts as the input into models. But to construct the necessary eval set, we needed to extract prompts from the AST. We then take the synthetic prompts as the input into the AST Generator.

The LLM Judge then compares the output:

Our current architecture consists of several components:

1. Real World AST Input from End users: These are the ASTs provided by customers or labellers, which serve as the ground truth for evaluation.

2. LLM Question Generator Agent: This agent generates potential user input prompts based on the Real World AST Input.

3. LLM AST Generator Agent: This agent takes the generated prompts and produces ASTs using LLMs. This LLM Agent is responsible for generating the ASTs given customer’s input

4. Generated AST: The AST produced by the LLM AST Generator Agent.

5. LLM Judge Agent: This agent evaluates the Generated AST and provides a score based on its alignment with the Real World AST Input.

6. LLM Agent: The underlying LLM used for generating and evaluating ASTs.

7. AST: The AST structure being evaluated.

8. User's Prompt: The original prompt provided by the user.

The LLM Judge Agent evaluates the Generated AST and provides a score based on its alignment with the Real World AST Input. This score helps us determine the quality of the generated code and make improvements as needed.

Results

Overall, we’re been impressed with the results. Our LLM Judge Evaluation system has achieved over 90% alignment with human evaluation for ASTs. 

Our experimentation with various models for LLM AST Generator Agent has yielded promising results, with the most potent Claude model scoring 4.02 and the gpt-4-32k-0613 model achieving the highest score of 4.55 with full score 5.0.

These outcomes highlight the efficacy of our LLM Judge system in assessing and enhancing generated code. 

  • There is a remarkable similarity in the final scores between the 8K context length version and the 32k context window version which proves the stability of this method.

  • GPT-4 Series performs slightly better than Claude 3

Further, these baseline scores enable us to compare future iterations & optimizations. For example, as we explore adding persistent memory via RAG. adopting a new model, or changing our prompting, we can compare the scores to determine the impact of improvements.

Privacy

At Twilio we believe building products using AI has to be built on three core principles: Transparent, Responsible, Accountable. You can refer to Generative Audiences Nutrition Facts Label for more details on how the data is used for this feature.

Next Steps

While we’re excited about the impact so far, there are additional optimizations we plan to introduce for our evals:

  1. Improving the correlation between LLM Judge and human scores to ensure that our evaluation process aligns well with human judgment.

  2. Orchestrating different agents using frameworks such as AutoGen, which will allow for better coordination and efficiency in the evaluation process.

  3. Applying LLM Judge to different use cases for CustomerAI to explore its potential in various domains and applications.

Conclusion

Our mission at Twilio and Segment is to make magic happen by harnessing the power of LLMs to revolutionize the way marketers build audiences and customer journeys. As we look to the future, we are excited to delve deeper into the potential applications and optimizations of our LLM Judge system, with a focus on concrete advancements and collaborations across the industry.

We encourage those interested in learning more about our work or exploring potential partnerships to reach out to us. In this era of LLMs, we believe that sharing knowledge and learning from one another is crucial to driving innovation and unlocking the full potential of AI-driven solutions.

As we continue to refine our approach and explore new use cases, we remain committed to fostering transparent, responsible, and accountable AI-driven solutions that empower businesses and individuals alike. We hope that by sharing our experiences and learnings, we can inspire others to embark on their own LLM journeys and contribute to the collective growth of the AI community. Together, we can shape the future of audience building and create a world where the magic of LLMs is accessible to all.

Appendix

To provide a more in-depth understanding of the generative audience evaluation process, let's look at the steps involved.

1. A user provides a prompt, such as "Customers who have purchased at least 1 time."

2. The prompt is converted into an AST, which is a tree-like data structure representing the code structure. In our case, the AST is similar to a JSON object.

3. We need to evaluate if the generated AST is robust enough to be shipped to production.

To accomplish this, we use the "ground truth" ASTs provided by customers as a reference for evaluation. However, there can be multiple valid ways to represent the same code structure, making it difficult to determine the best AST representation. This is where LLMs can play the role of a judge, helping us evaluate and improve the quality of generated code.

However, there are some challenges.

1. LLMs struggle with continuous scores. For example, if asking LLM to give a score from 0 to 100, LLM tends to only output discrete values such as 0 and 100. To address this, we ask the model to provide scores on a discrete 1-5 scale, with 1 being "very bad" and 5 being "perfect." This helps us obtain more interpretable results.

2. The LLM Judge also needs a Chain of Thought (CoT) to provide reasoning for the scores, which can improve the model's capability and facilitate human evaluation. Implementing CoT allows the model to explain its decisions, making it easier for engineers to understand and trust the evaluation process. By using CoT, the alignment with human increase from roughly 89% to 92%.

3. We adopt OpenAI’s GPT-4 model as the actual LLM Judge model. Surprisingly, we find that using other strong models such as Claude 3 Opus achieve similar scores with GPT-4, which shows that the alignment between different models and human.

LLMs have proven to be a powerful tool for evaluating and improving code generation in the context of Segment and Twilio engineering. By using LLMs as judge, we can enhance the quality of our generated code and provide better recommendations to our users. As we continue to refine our approach and explore new use cases, we expect LLMs to play an increasingly important role in our engineering efforts, leading to more efficient and effective solutions.

Test drive Segment CDP today

It’s free to connect your data sources and destinations to the Segment CDP. Use one API to collect analytics data across any platform.

Recommended articles

Loading

Want to keep updated on Segment launches, events, and updates?