Which Pairs to Compare for LLM Post-Training?

A new study addresses the cost-efficiency of data labeling in LLM post-training by identifying the most informative comparison pairs. By formulating comparison curation as a sampling-design problem, the researchers demonstrate significant improvements in DPO alignment efficiency without increasing the labeling budget.

Computer Science > Artificial Intelligence

Title:Which Pairs to Compare for LLM Post-Training?

View PDF HTML (experimental)Abstract:Preference-based post-training has become a central paradigm for aligning language models. A common data-collection strategy is to generate a small set of completions for each prompt and label the resulting comparison pairs. However, human preference labels are often much more expensive than generating additional completions, suggesting a different use of the same labeling budget: generate a larger pool of completions, but label only the most informative comparison pairs. This paper studies which pairs should be compared in preference-based post-training. We formulate comparison curation as a sampling-design problem and evaluate designs by the quality of the final policy under the preference-based post-training objective. We instantiate this framework for Direct Preference Optimization (DPO), analyzing how the choice of labeled pairs propagates through DPO training to downstream policy performance. Our main results provide matching upper and lower bounds on the post-training optimality gap of the DPO-trained policy. The bounds show that comparison selection affects downstream performance through a single design-dependent information matrix, which links label allocation to parameter estimation error and policy suboptimality. This yields an explicit optimization criterion for budgeted comparison curation and motivates practical sampling designs for selecting informative pairs from large generated completion pools. Experiments on synthetic settings and language-model post-training benchmarks show that the proposed designs consistently improve sample efficiency over common comparison-selection heuristics.

Current browse context:

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Analyzing the Narration Gap in LLM-Solver Loops

A new study highlights the 'narration gap' in hybrid LLM-solver systems, revealing that while the formal solver produces sound results, adversaries can still manipulate the LLM to invert the final answer via prompt injection.

agentic-systems

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

Researchers at University Medicine Essen have deployed ACIE, an on-premise agentic RAG pipeline that extracts complex clinical information with a 96.5% clinician acceptance rate, overcoming the limitations of standard RAG in handling unstructured medical data.

agentic-systems

Uncertainty Decomposition for Clarification Seeking in LLM Agents

A new study proposes a prompt-based uncertainty decomposition method that enables LLM agents to detect ambiguous user requests and proactively seek clarification. This approach significantly outperforms existing methods across multiple large language models, including GPT-5.1 and DeepSeek-v3.2.

agentic-systems

ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence

Researchers have introduced ITNet, a unified neural network architecture that mathematically subsumes convolution, self-attention, and recurrence under a single learnable integral transform, matching or exceeding specialized baselines across multiple modalities.

NOW LET US Related – LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

agentic-systems

LLM Doesn't Know What It Doesn't Know: Detecting Epistemic Blind Spots via Cross-Model Attribution Divergence on Clinical Tabular Data

A recent study reveals that Large Language Models (LLMs) struggle to recognize their own knowledge limits when processing structured clinical tabular data. By comparing Qwen 2.5 7B with XGBoost, researchers identified critical epistemic blind spots and proposed a cross-model calibration method to address this limitation.

agentic-systems

Deontic Policies for Runtime Governance of Agentic AI Systems

Autonomous agentic AI systems introduce novel security and compliance challenges that exceed the capabilities of current policy engines. To address this, researchers propose AgenticRei, a runtime governance framework utilizing deontic policies to strictly control AI behavior outside the LLM.

EXPLORE TOPICS