Artificial intelligence (AI) is increasingly being asked not just to generate answers but also to evaluate them. Whether it’s deciding which chatbot response is better, assessing scientific claims, or even grading essays, AI systems that use Large Language Models (LLMs) are being leveraged as “digital judges”. The approach, often referred to as LLM-as-a-Judge, promises both speed and scale while also raising thorny questions about fairness, bias, and alignment with human values.
In the Notre Dame-IBM Tech Ethics Lab, researchers from academia and industry are confronting these questions head-on. Two lab-funded collaborative projects are tackling the challenges of LLM-as-a-Judge from different angles: one by designing tools that help people define better LLM evaluation criteria, and the other by systematically probing how, when, and why bias creeps into LLM judgments. Together, their work offers both a practical and nuanced view of how we might make AI judges more transparent, accountable, and value-aligned, which has important implications for AI researchers and developers, as well as AI practitioners and industry leaders implementing such systems in day-to-day business operations.
________
Teaching AI to Make More Human-Centric Judgements
The first project, led by Toby Li and Diego Gomez-Zara (University of Notre Dame) and Zahra Ashktorab and Werner Geyer (IBM Research), focuses on improving AI evaluation through human-centered, context-sensitive frameworks and methods. While LLMs are technically capable of evaluating complex outputs, humans often struggle to articulate how they want the model to judge or exactly what is being judged. “I think people make a lot of assumptions when it comes to evaluating these outputs,” said Ashktorab. “One assumption is that evaluation is easy to define, static, and remains consistent over time.” But this is not what she and her colleagues are seeing in their user studies. The reality is that people's criteria are context-dependent, subjective, and change over time. Ashkortab continues, “Users’ criteria also evolves as they see more outputs,” and it is this contextual richness, dynamism, and exposure-driven experience that traditional evaluation templates, metrics, or benchmarks can't capture easily, if at all. Given the increasing use of LLMs to classify and generate both data and content, there is a need not only to systematically evaluate model outputs but also to understand how humans evaluate these outputs, especially given diverse expertise, familiarity, and experiences.
The researchers focused on two core questions:
- How do technologists establish criteria for assessing LLM outputs?
- How can we best support them as they iterate and refine their criteria with LLMs?
To address these, they designed, developed, and assessed a new system, called MetricMate, that enables collaborative evaluation of LLM-generated outputs between human users and an AI judge. As Gomez-Zara explains, “The goal was to create a tool to help users craft, iterate, and fine-tune evaluation criteria as they work with large language models.” As an interactive interface, MetricMate helps users define high-level criteria, break these criteria down into smaller testable assertions, and then adjust them on the fly as needed. It also supports iterative refinement through a variety of means, including visualizations, contextual recommendations, examples of success and failure, and optional tools for grouping related examples to highlight implicit preferences. “One of the best parts of the interactive tool,” Geyer explained, “is that it doesn’t require ground truth and is designed to be easy-to-use and customizable,” offering portability to a variety of use cases. Li continued, “One of our key goals is to lower the barriers to evaluating LLMs so that even users without deep technical expertise can assess how well a model’s outputs align with their values and preferences. This democratization of LLM evaluation not only empowers individuals and organizations to make more informed and responsible decisions about whether, when, and how to adopt AI in their contexts, but also creates a feedback loop—where insights from these evaluations can help guide and steer models toward better alignment with human intent.”
Their recent paper, presented at the 4th Annual Symposium on Human-Computer Interaction for Work (CHIWORK 2025) and the Joint Proceedings of the ACM IUI Workshops 2025, details how this system was developed with software engineers and AI practitioners and includes lessons learned in the process, particularly around just how challenging and personal the act of judgment can be. From “criteria drift” to competing values (like empathy vs. efficiency), their findings show that evaluation is as much about human-AI improvisation and adaptation as it is about precision. As Ashkortab put it—evaluation isn’t just a moment, it’s a process. Humans struggle with model evaluation at baseline, even without the help of AI, so “designing better tools can really unlock better outcomes in AI evaluation overall,” she said. Geyer agreed, adding that these kinds of tools address a large gap in the field. “On the one hand, we have benchmarks, but they aren’t use case-specific, and on the other hand, we have human evaluation but as we all know, that doesn’t really scale,” he said. “Using LLMs-as-judges as tools, like what we’ve developed in this project, it really addresses this problem. It doesn’t mean we don’t have humans anymore, but we can make their work easier.”
For the remainder of the year, the collaborators are looking to expand their research on at least two fronts. First, they aim to make the MetricMate even more collaborative so that multiple users can co-create and co-refine LLM evaluation criteria concomitantly. “We know that depending on the characteristics of a team, people will have different ways of evaluating models, so it’s important that we facilitate these processes,” said Gomez-Zara. Second, given the multi-sector, multi-industry boom of interest in agentic AI, where AI systems can be trained to autonomously perform tasks, the team hopes to develop a similar interactive tool that helps people iteratively evaluate textual inputs and outputs from AI agents.
________
Catching a Biased AI Judge
While the first research project focuses on building better judgment criteria to improve LLM outputs and applications, the second takes more of a diagnostic and adversarial approach to improving LLM evaluations. What if the AI system or LLM-as-a-judge is making flawed or biased decisions even when the evaluation criteria seems reasonable? How can these discrepancies be detected systematically? This project, led by Xiangliang Zhang (University of Notre Dame) and Pin-Yu Chen and Tian Gao (IBM Research), is focused on making LLMs safer and more trustworthy through developing better frameworks to assess their robustness, fairness, and overall transparency, while stream-lining the identification and quantification of biases in these models.
Their research, presented at the International Conference on Learning Representations (ICLR, 2025) and published online introduces a framework for the Comprehensive Assessment of Language Model Judge Biases, or CALM. This approach outlines 12 distinct types of biases that can distort LLM-as-a-Judge evaluations, including authority bias (trusting responses with fake citations), sentiment bias, and even a tendency to prefer answers generated by the same model (i.e., self-enhancement bias). Using an “attack-and-detect” strategy, the team introduced subtle modifications —like adding emotional tone or fabricated sources—to test whether the AI system’s judgments were thrown off. The results were striking: even the most advanced LLMs (i.e., those with otherwise superior benchmark performance scores) exhibited surprising vulnerabilities, sometimes favoring style over substance or being swayed by irrelevant cues. Explains Zhang, “We had some interesting findings. For example, some models preferred long answers over short answers, giving short answers a lower score even if they were the same quality.” The researchers called this “verbosity bias”, describing a model’s tendency to favor longer responses irrespective of content. The researchers didn’t just catalogue these issues; they also quantified them across multiple models using various robustness and consistency metrics.
Importantly, though there have been many kinds of biases previously reported in models, the kinds of differential performances seen by the team were less obvious, or as Chen put it, issues that “users may not be aware of.” Chen also said that it’s important be more aware of these subtle biases, particularly as people rely more on generative AI tools. “A lot of times, humans don’t have ground truth, so it’s very hard to verify the output, and we basically defer these decisions to a model that we think can do this, i.e., LLM-as-a judge. But in the process, we are overlooking potential biases in the knowledge function”, some of which may persist in unforeseen ways. “You should use LLM-as-a-judge to improve your judgement, not replace your judgement,” said Chen. At the end of the day, no LLM is immune to bias, but fragile or susceptible AI judgments are especially concerning when they are harder to detect and the stakes are high. The team encourages users to keep using LLMs but with caution, taking their outputs with “a grain of salt”, especially when working in domains that are outside of someone’s professional expertise or personal experiences. Their work offers guidelines for developers, including tips for better structuring prompts, avoiding using the same model for generating and judging, and implementing technical and social bias safeguards if and where possible.
This paper is but one of many outputs from the project’s larger research scope around assessing LLM robustness, which also includes studies investigating how safe and trustworthy LLMs are when generating recommendations for lab operations and autonomous scientific experiments. “If an LLM makes a wrong prediction or wrong suggestion, it may cause severe accidents,” said Zhang, making it imperative that researchers not only understand model capabilities in specific contexts but also develop better evaluation frameworks and benchmarks to catch issues ahead of time. To this end, the team has also released the LabSafetyBench to assess these kinds of risks in LLMs, as well as created a platform to dynamically evaluate such issues.
________
The Importance of Cross-Sector Collaboration for the Future of Responsible and Ethical Tech
These collaborations represent a subset of eleven joint research projects between Notre Dame and IBM, envisioned as one mechanism through which ethical thinking and human-centered design strategies can become embedded in practice, benefiting the research community and affecting real world change. Cross-sector collaborations like these are essential for creating responsible technology systems because they bring together diverse perspectives, expertise, and priorities that no single group holds alone. By uniting academic rigor with industry practicality, such partnerships work towards technological innovation that is not only cutting-edge but also trustworthy, inclusive, and socially grounded.
This work also enriches each group’s ongoing institutional projects and priorities. For example, the insights gleaned from better understanding how technologists evaluate LLM outputs have helped Geyer’s IBM team with their industry research on evaluative metrics for model trustworthiness and explainability, generating ideas for how to better create synthetic data that can be used to refine tooling like this. “That’s what excites me about collaborating,” said Geyer, “It helps us accelerate our research, and it complements our research.” Chen had a similar experience. “We make a great team!”, he said. “At IBM, we care about inventing AI technology and we see a lot of excitement about generative AI tools. But at the same time, we not only want to use those tools but also make sure they are trustworthy, sustainable, and don’t create harms. We were excited to partner with Xiangliang’s team to come up with a comprehensive evaluation pipeline and define what trust even means in the context of genAI tools.” Chen sees their project as helping not only IBM and industry research but also the larger scientific, developer, and technologist communities, particularly when thinking more holistically about LLM outcomes. “It’s nice working with a trustworthy partner on trustworthy AI”, he said, highlighting how it’s all too common to run into the mindset that we don’t need to care about safety or that we should just move fast and break things. Chen feels fortunate to find kindred spirits in their collaboration with Zhang’s team.
From an academic perspective, Gomez-Zara commented on how their project has been extremely important for his PhD students, giving them more research experience and industry visibility, which has also led to opportunities like fellowship awards and grants. Zhang agreed, mentioning the importance of having her students play integral roles in these projects, including learning how to lead and implement experiments, receiving and incorporating feedback into papers, and being part of weekly discussions that steered the direction of the research. All these moments provide students with opportunities to grow and practice the foundational skills needed to be successful in their future careers, be them in academia, industry, or elsewhere.
________
What Can We Learn from These Projects?
Taken together, these two projects paint a complex picture of the ways in which systems that leverage LLM-as-a-Judge capabilities can both help and harm, depending on their underlying data and inputs, existing safeguards, context of use, and the level of human involvement. On the one hand, LLM-as-a-Judge can help automate, iterate, and improve upon labor-intensive or complicated evaluations, acting more like a “collaborator” or “assistant” as opposed to a final “decision-maker”. On the other hand, LLM-based judges are vulnerable to systemic biases and do not easily account for the highly contextual and ambiguous nature of human evaluations in the real world, suggesting that entrusting them to make decisions in the absence of human intervention is not always a good idea, particularly in high-risk scenarios.
- For the technical community, resources like MetricMate and CALM offer new ways to operationalize human-AI collaboration and human-driven AI value alignment, improve upon interpretability and robustness of AI outputs, and practically stress test AI systems and associated evaluations.
- For ethics and policy experts, these results further point to the need for clearer accountability metrics and better mechanisms for people to interrogate both model outcomes and automated decision systems. They also show how LLM-based evaluations are always socially constrained, underscoring why having humans in the loop remains an important, if not essential, design choice in many instances.
- And for the business community, while it may be tempting to incorporate LLM-as-a judge capabilities into existing enterprise or industry processes due to their abilities to scale and automate, caution should be taken to avoid over-reliance and certify enactment of proper governance mechanisms and maintenance, including but not limited to longitudinal bias audits and associated updates. While LLM-as-a-Judge has many opportunities for beneficial use, it should only be seen as but one of many tools and methods to aid people in faster and/or more responsible decision-making.
________
Conclusions – So Can We Trust AI to Judge?
In the end, we can only trust machines to judge well if we first remember what it means to judge wisely as humans. If we want AI to become a trusted part of human decision-making, it’s not enough to train models to align with our values or avoid our biases. We also must learn to (re)trust ourselves – to embrace the messiness of our shared humanity and find new ways to blend the speed and scale of LLMs with the social intuition, lived experiences, and situated wisdom that only people can bring to the table.
As AI systems take on increasingly evaluative and decision-making roles, collaborative research like this shows why thoughtful design, careful testing, and meaningful partnerships are more important than ever. Whether in academic labs or industry settings, building ethical and human-centered technologies is a shared responsibility and one best tackled together.
________
Learn More!
Curious about how these and other projects are shaping the future of responsible AI and applied tech ethics?
- Visit the our Collaborative Projects page to see more examples of cross-sector research
- Look at what’s happening with our Ethical Cities Projects
- Read about the projects in our annual Call for Proposals
-
Related work by these researchers:
—Limitations of LLM-as-a-Judge Approach (ND)
—EvalAssist: LLM-as-a-Judge Simplified (IBM)
—On the Trusthworthiness of Generative Foundation Models - Guidelines, Assessment, and Perspective (ND+IBM+others)