Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning

Researchers introduce IRTS-ToolBench, a new benchmark featuring 1,700 questions across 10 task types and 13 domains to evaluate LLMs and AI agents on irregular time series data, addressing a critical gap in real-world data science.

Computer Science > Artificial Intelligence

Title:Towards Verifiable Agentic Data Science: Solving Irregular TSQA Via Tool-Grounded Reasoning

View PDF HTML (experimental)Abstract:Time series data in real-world deployments is overwhelmingly irregular. Observations are asynchronous, missing values are informative rather than random, and sampling frequencies vary across sensors and operational windows. However, existing Time Series Question Answering (TSQA) benchmarks mostly assume regularly sampled inputs, leaving a fundamental gap in understanding how large language models (LLMs) and AI agents perform under irregular conditions. To bridge this gap, we introduce IRTS-ToolBench, a benchmark of 1,700 questions spanning 10 task types across 13 domains. IRTS-ToolBench is designed to be used independently by any researcher working on LLM-based irregular time series analysis, providing standardized inputs and a reproducible evaluation protocol. Code can be found in this https URL.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Researchers have developed Metric Match, a novel method to estimate the reliability of LLM judges using limited human annotations. By selecting an optimal subset of samples, it reduces annotation needs by 32.5% and significantly cuts down evaluation costs.

agentic-systems

Semantics-Enhanced Retrieval-Augmented Time Series Forecasting

Researchers have introduced SERAF, a novel multimodal time series forecasting framework that addresses the limitations of traditional methods under non-stationarity by leveraging dual retrieval over both numerical data and self-generated textual descriptions.

agentic-systems

AI Engram: In Search of Memory Traces in Artificial Intelligence

Researchers introduce 'AI Engram', a geometric framework to identify and isolate individual memory traces in deep neural networks. This biologically-inspired approach enables surgical manipulation of learned knowledge through simple linear arithmetic without iterative optimization.

agentic-systems

Risk-Aware LLM Agents for Geospatial Data Retrieval: Design and Preliminary Adversarial Evaluation

Researchers have presented an LLM-driven framework that simplifies the retrieval of remote sensing data from cloud-based geospatial catalogs using natural language. The system integrates three specialized AI agents to optimize performance and mitigate adversarial API manipulation risks.

agentic-systems

Attribute Inference from Interactive Targeted Ads

A new study models how interactive targeted advertising can act as a channel for attribute inference, allowing advertisers to deduce sensitive user data. The researchers propose defense mechanisms like aggregate reporting and randomized disclosure to mitigate these privacy risks.

agentic-systems

Fusion is not one-size-fits-all: Cross-Modal Representation Alignment for Time-to-Event Modeling

Researchers introduce a foundation model-driven framework for cross-modal representation alignment between CT imaging and longitudinal EHR data to improve time-to-event prediction. The study demonstrates that task-aware multimodal alignment is essential for robust generalization and scalable clinical deployment.

EXPLORE TOPICS