NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data

Share
NOW LET US Article – Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data

This paper evaluates confidence interval methods for text classification performance metrics, highlighting the inaccuracies of default methods and proposing better alternatives like a novel pseudo-count regularized bootstrap for LLMs and nested data.

Computer Science > Artificial Intelligence

Title:Estimating Uncertainty in Classifier Performance with Applications to Large Language Models and Nested Data

View PDF HTML (experimental)Abstract:Researchers increasingly use text classification--supervised models or large language models--to measure constructs from natural language, providing metrics such as recall and precision as evidence of their validity. Yet, though these metrics are point estimates subject to sampling variation, measures of uncertainty are inconsistently reported alongside them. Further, when they are reported, they are often estimated with methods that are not appropriate when relevant labelled datasets are small or performance is high. To increase and improve confidence interval reporting in the field, this paper evaluates confidence interval methods for performance metrics under conditions typical of social science text classification: small to moderate sample sizes, infrequent constructs, and texts nested within individuals. Across simulations, default methods such as the Wald interval and the basic percentile bootstrap are the least accurate, with coverage sometimes far below the nominal 95% level. Accuracy is improved with the use of Agresti-Coull, Wilson, Clopper-Pearson, and a novel pseudo-count regularized bootstrap (which is particularly relevant to the calculation of F1). When texts are nested within individuals, we demonstrate that adjustment for both effective N and the appropriate degrees of freedom is necessary for producing accurate analytic intervals. Among bootstrap intervals, the hierarchical bootstrap is more accurate than the cluster bootstrap when individuals produce a moderate number of texts but overly conservative when individuals produce only a few. By providing guidance to the field on appropriate interval estimation, we aim to improve the transparency of machine learning applications, and to encourage greater attention to the validation sample size at the design stage.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Instruction Bleed: Cross-Module Interference in Prompt-Composed Agentic Systems

agentic-systems

Instruction Bleed: Cross-Module Interference in Prompt-Composed Agentic Systems

Researchers have formalized 'Instruction Bleed' (Compositional Behavioral Leakage), a recurring failure mode in prompt-composed agentic systems where editing one prompt module silently shifts the behavior of others due to lack of architectural isolation in Transformer self-attention.

NOW LET US Related – What We are Missing in Multimodal LLM Evaluation?

agentic-systems

What We are Missing in Multimodal LLM Evaluation?

While multimodal large language models (MLLMs) are advancing rapidly, current evaluation benchmarks fail to keep pace. This research highlights critical gaps in assessing how these models truly integrate cross-modal information.

NOW LET US Related – AlgoEvolve: LLM-driven Meta-evolution of Algorithmic Trading Programs

agentic-systems

AlgoEvolve: LLM-driven Meta-evolution of Algorithmic Trading Programs

Researchers have introduced AlgoEvolve, an LLM-driven evolutionary framework that automatically generates, evaluates, and optimizes Python-based algorithmic trading strategies. The system demonstrates emergent regime-adaptive logic and utilizes a meta-evolutionary loop to optimize prompts, outperforming human-designed instructions.

NOW LET US Related – COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami

agentic-systems

COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami

Researchers have developed COrigami, an end-to-end AI pipeline that generates flat-foldable origami crease patterns from natural language descriptions. By combining algorithmic optimization with reinforcement learning, the system serves as a collaborative assistant for human artists.

NOW LET US Related – Accelerating Skill Assessment in Chess: A Drift-Diffusion-Enhanced Elo Rating System

agentic-systems

Accelerating Skill Assessment in Chess: A Drift-Diffusion-Enhanced Elo Rating System

Researchers have developed DD-Elo, a new chess rating system based on the drift-diffusion model from cognitive neuroscience. By analyzing move-by-move data rather than just match outcomes, DD-Elo updates player ratings much faster and more accurately than the traditional Elo system.

NOW LET US Related – Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking

agentic-systems

Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking

Researchers have developed a knowledge-augmented multi-agent AI framework that integrates regulatory FDA records with patient narratives from Reddit and WebMD, offering a safer and more traceable way to seek mental health medication information.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.