AGENTIC-SYSTEMSMarch 20, 20261 min read26 views

BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity

BenchBrowser is a retriever that surfaces evaluation items across 20+ benchmark suites to help practitioners diagnose gaps between benchmark intent and actual test content.

Computer Science > Computation and Language

Title:BenchBrowser -- Collecting Evidence for Evaluating Benchmark Validity

Abstract: Do language model benchmarks actually measure what practitioners intend them to? High-level metadata is too coarse to convey the granular reality of benchmarks: a "poetry" benchmark may never test for haikus, while "instruction-following" benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability's facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

FindStatBench: Evaluating Large Language Models on Combinatorial Code Synthesis

Researchers have introduced FindStatBench, a new execution benchmark containing 2,329 tasks designed to evaluate the combinatorial code synthesis capabilities of large language models.

agentic-systems

Probabilistic Concept-Aware Steering for Trustworthy LLM Inference

Researchers have introduced Probabilistic Concept-Aware Steering (PCS), an inference-time intervention framework for LLMs. PCS provides fine-grained, safety-oriented semantic steering while preserving original task competence.

agentic-systems

MUX: Continuous Reasoning via Multiplexed Tokens

Researchers introduced MUX, a novel latent reasoning method that compresses discrete text-based reasoning steps into continuous multiplexed tokens. By enabling lossless superposition, MUX significantly boosts LLM reasoning efficiency and speed across complex problem-solving tasks.

agentic-systems

S2T-RLHF: Hierarchical Credit Assignment for Stable Preference-Based RLHF

Researchers introduced S2T-RLHF, a hierarchical credit assignment framework that stabilizes preference-based RLHF training by decomposing response-level rewards into sentence and token levels.

agentic-systems

Semantic Cooperative Games for Contribution Attribution in LLM-Based Multi-Agent Systems

Researchers introduce Semantic Cooperative Games (SCG) and the SLIC algorithm to evaluate agent contributions in LLM-based multi-agent systems without re-running models, reducing computational costs by 93.3%.

agentic-systems

ProbSPARQL: Querying Knowledge Graphs with Multi-dimensional, Uncertain Numeric Data

Researchers have introduced ProbSPARQL, an upward-compatible SPARQL extension designed to query multi-dimensional and uncertain numeric measurement data within Knowledge Graphs, providing significant query performance gains for complex industrial applications.