AGENTIC-SYSTEMSApril 7, 20261 min read19 views

VERT: Reliable LLM Judges for Radiology Report Evaluation

Researchers introduce VERT, an LLM-based metric that significantly improves the accuracy and efficiency of radiology report evaluation across multiple medical modalities.

Computer Science > Artificial Intelligence

Title:VERT: Reliable LLM Judges for Radiology Report Evaluation

Current literature on radiology report evaluation has focused primarily on designing LLM-based metrics and fine-tuning small models for chest X-rays. However, it remains unclear whether these approaches are robust when applied to reports from other modalities and anatomies. Which model and prompt configurations are best suited to serve as LLM judges for radiology evaluation? We conduct a thorough correlation analysis between expert and LLM-based ratings. We compare three existing LLM-as-a-judge metrics (RadFact, GREEN, and FineRadScore) alongside VERT, our proposed LLM-based metric, using open- and closed-source models (reasoning and non-reasoning) of different sizes across two expert-annotated datasets, RadEval and RaTE-Eval, spanning multiple modalities and anatomies. We further evaluate few-shot approaches, ensembling, and parameter-efficient fine-tuning using RaTE-Eval. To better understand metric behavior, we perform a systematic error detection and categorization study to assess alignment of these metrics against expert judgments and identify areas of lower and higher agreement. Our results show that VERT improves correlation with radiologist judgments by up to 11.7% relative to GREEN. Furthermore, fine-tuning Qwen3 30B yield gains of up to 25% using only 1,300 training samples. The fine-tuned model also reduces inference time up to 37.2 times. These findings highlight the effectiveness of LLM-based judges and demonstrate that reliable evaluation can be achieved with lightweight adaptation.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

FindStatBench: Evaluating Large Language Models on Combinatorial Code Synthesis

Researchers have introduced FindStatBench, a new execution benchmark containing 2,329 tasks designed to evaluate the combinatorial code synthesis capabilities of large language models.

agentic-systems

Probabilistic Concept-Aware Steering for Trustworthy LLM Inference

Researchers have introduced Probabilistic Concept-Aware Steering (PCS), an inference-time intervention framework for LLMs. PCS provides fine-grained, safety-oriented semantic steering while preserving original task competence.

agentic-systems

MUX: Continuous Reasoning via Multiplexed Tokens

Researchers introduced MUX, a novel latent reasoning method that compresses discrete text-based reasoning steps into continuous multiplexed tokens. By enabling lossless superposition, MUX significantly boosts LLM reasoning efficiency and speed across complex problem-solving tasks.

agentic-systems

S2T-RLHF: Hierarchical Credit Assignment for Stable Preference-Based RLHF

Researchers introduced S2T-RLHF, a hierarchical credit assignment framework that stabilizes preference-based RLHF training by decomposing response-level rewards into sentence and token levels.

agentic-systems

Semantic Cooperative Games for Contribution Attribution in LLM-Based Multi-Agent Systems

Researchers introduce Semantic Cooperative Games (SCG) and the SLIC algorithm to evaluate agent contributions in LLM-based multi-agent systems without re-running models, reducing computational costs by 93.3%.

agentic-systems

ProbSPARQL: Querying Knowledge Graphs with Multi-dimensional, Uncertain Numeric Data

Researchers have introduced ProbSPARQL, an upward-compatible SPARQL extension designed to query multi-dimensional and uncertain numeric measurement data within Knowledge Graphs, providing significant query performance gains for complex industrial applications.