Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

ICLR logo

Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, Xiangliang Zhang
ICLR, 2025

Download PDF

View reference article

LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, the paper identifies 12 key potential biases and proposes a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. The experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, the paper also discusses the explicit and implicit influences of these biases and gives some suggestions for the reliable application of LLM-as-a-Judge. This work highlights the need for stakeholders to address these issues and reminds users to exercise caution in LLM-as-a-Judge applications.