AGENTIC-SYSTEMSJune 4, 20261 min read7 views

Can Generalist Agents Automate Data Curation?

A new study introduces Curation-Bench to evaluate whether generalist coding agents can automate the labor-intensive data curation loop. The findings show that with proper scaffolding, agents can autonomously design data-selection policies that outperform strong baselines at a fraction of the data budget.

Computer Science > Artificial Intelligence

Title:Can Generalist Agents Automate Data Curation?

View PDFAbstract:Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce Curation-Bench, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent execution-research gap: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

Current browse context:

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Align AI to Dynamic Human-AI Workflows

Current AI alignment methods fall short by relying on static representations of human preferences. A new paper proposes a paradigm shift toward "interactive and complementary alignment," allowing AI and humans to co-evolve within dynamic workflows.

agentic-systems

MemoHarness: Agent Harnesses That Learn from Experience

Researchers introduce MemoHarness, an adaptive harness optimization framework that enables AI agents to learn from their own executions. By analyzing performance and storing insights in a dual-layer experience bank, it dynamically adapts to new tasks, outperforming static configurations.

agentic-systems

How Artificial Intelligence LLM Engines Shape the Global Conflict Information Environment

A new study reveals how AI-powered answer engines are highly vulnerable to hallucination and manipulation when reporting on global conflicts. The lack of robust digital records allows malicious actors to exploit these systems through Generative Engine Optimization (GEO), posing a new threat of AI-driven information warfare.

agentic-systems

AI Agents Do Not Fail Alone:The Context Fails First

A new study validates context-engineering quality as an independent leading indicator of AI agent reliability, introducing an open-source evaluation infrastructure called ProofAgent-Harness to predict behavioral outcomes before deployment.

agentic-systems

Interpretable Language Model for Closed-Loop Type 1 Diabetes Control

Researchers have developed LLM-T1D, a novel approach combining reinforcement learning and large language models to automate insulin delivery for Type 1 Diabetes. The system not only outperforms traditional methods but also explains its decisions in plain language, addressing the 'black-box' challenge in medical AI.

agentic-systems

RegNetAgents: A Multi-Agent Framework for Cross-Network Regulatory Driver Identification in Cancer Genomics

Researchers have introduced RegNetAgents, an AI-oriented multi-agent framework designed to identify regulatory driver candidates across heterogeneous gene networks in cancer genomics. By integrating bulk tumor and single-cell data, the system offers a powerful tool for discovering therapeutic targets in breast and colorectal cancers.