NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

RoPoLL: Robust Panel of LLM Judges

Share
NOW LET US Article – RoPoLL: Robust Panel of LLM Judges

Researchers have introduced RoPoLL, a novel framework that uses robust mean estimation to aggregate LLM judge outputs. By replacing simple consensus with geometric median, a small 38B parameter committee outperforms the giant 675B Mistral-Large-3.

Computer Science > Artificial Intelligence

Title:RoPoLL: Robust Panel of LLM Judges

View PDF HTML (experimental)Abstract:The LLM Jury, a Panel of LLM Evaluators (PoLL) reporting consensus scores, has become a practical alternative to single-judge LLM evaluation, yet its statistical behavior remains poorly understood. We formalize the LLM Jury under the Huber contamination model and show that PoLL incurs unbounded bias

under any positive contamination, regardless of jury size, whenever a single judge fails in a biased, LLM-typical way (mode collapse, sycophancy, safety refusal). Framing jury consensus as classical robust mean estimation, we propose RoPoLL (Robust Panel of LLM-as-Judge), which preserves the PoLL

panel but replaces the aggregation function with a robust mean estimator, instantiated with the geometric median (GM): tuning-free, with the optimal finite-sample breakdown point 1/2. A finite-sample error bound and a matching information-theoretic minimax lower bound agree on the parametric rate

sigma*sqrt(d/N) and differ on the breakdown floor by a factor of sqrt(d), a statistical-computational gap that polynomial-time RoPoLL pays relative to the intractable Tukey halfspace median. Across 13 open-weight judges (4B-675B), three reward-model benchmarks, and four corruption regimes at rates up

to 50%, RoPoLL dominates PoLL on every biased corruption type: by about 19% on cross-dimensional attacks at matched compute, and by orders of magnitude on heavy-tailed Byzantine adversaries. A 3-judge RoPoLL committee at 38B beats Mistral-Large-3 (675B) by 1.31x on HelpSteer-2 under 30% bimodal-random

corruption, an 18x parameter advantage at better accuracy; a Noisy-GT control confirms the premium is paid against biased contamination, not benign imprecision.

Current browse context:

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Investigating Multi-Agent Deliberation in Law

agentic-systems

Investigating Multi-Agent Deliberation in Law

A new study investigates the potential of multi-agent AI systems in the legal domain. By simulating courtroom procedures and legal argumentation, this approach opens up new ways to solve complex cases requiring multi-perspective critical thinking.

NOW LET US Related – HyPOLE: Hyperproperty-Guided Multi-Agent Reinforcement Learning under Partial Observation

agentic-systems

HyPOLE: Hyperproperty-Guided Multi-Agent Reinforcement Learning under Partial Observation

Researchers introduce HyPOLE, a novel framework that guides Multi-Agent Reinforcement Learning (MARL) under partial observability using formal specifications and HyperLTL temporal logic, outperforming traditional baselines.

NOW LET US Related – What Drives Interactive Improvement from Feedback?

agentic-systems

What Drives Interactive Improvement from Feedback?

A new study reveals that multi-turn improvements in LLMs are often driven by repeated attempts rather than feedback quality, highlighting that the student model's ability to act on feedback is the primary bottleneck.

NOW LET US Related – MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

agentic-systems

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

Researchers have introduced MultiUAV-Plat, a breakthrough simulation and benchmarking platform for LLM-based multi-UAV collaborative task planning, alongside the Agent4Drone framework which significantly improves task success rates.

NOW LET US Related – AgRefactor: Self-Evolving Agentic Workflow for HLS Compatibility and Performance

agentic-systems

AgRefactor: Self-Evolving Agentic Workflow for HLS Compatibility and Performance

Researchers have introduced AgRefactor, an LLM-based multi-agent workflow that automates the refactoring of software code into HLS-compatible programs. Featuring a self-evolving memory system and tool integration, AgRefactor outperforms existing solutions and paves the way for automated chip design.

NOW LET US Related – Contrastive Reflection for Iterative Prompt Optimization

agentic-systems

Contrastive Reflection for Iterative Prompt Optimization

Researchers have introduced Contrastive Reflection, an iterative prompt-optimization framework for agentic information retrieval workflows. By comparing failed and successful execution traces, the method improves exact-match accuracy on HotpotQA from 51.4% to 60.4%.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.