SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

Researchers have introduced SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks. This tool aims to address fragmentation in medical AI research by standardizing the evaluation of speech-based disease diagnosis.

Computer Science > Artificial Intelligence

Title:SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

View PDF HTML (experimental)Abstract:Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

Current browse context:

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Source: arXiv cs.AI Recent

More in this category

agentic-systems

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

Researchers have developed the first foundation model-orchestrated workflow for crash safety design, reducing evaluation times from hours of conventional CAE simulations to mere seconds.

agentic-systems

SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions

SkillChain-Gym is a new benchmark for production-inventory control that integrates workforce reskilling and training dynamics under disruptions. It evaluates how training policies compete with production capacity and helps decision-makers balance operational efficiency and workforce resilience.

agentic-systems

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

Researchers have proposed a training-free, self-evolving framework powered by an LLM agent to optimize legal case retrieval by automatically creating and refining query rewriting rules.

agentic-systems

Unlocking UK house-building with AI-accelerated planning

The UK government is partnering with Google DeepMind to develop a Gemini-powered AI prototype that aims to halve the processing time for homeowner planning applications, supporting the nation's goal to build 1.5 million new homes.

agentic-systems

Feature Attribution in Directed Acyclic Graphs Using Edge Intervention

Researchers propose DAG-SHAP, a novel feature attribution method based on edge intervention in directed acyclic graphs, overcoming the limitations of traditional node-centric approaches to improve AI explainability.

agentic-systems

Metric Match: A Subset Selection Approach to Evaluating LLM Judge Reliability

Researchers have developed Metric Match, a novel method to estimate the reliability of LLM judges using limited human annotations. By selecting an optimal subset of samples, it reduces annotation needs by 32.5% and significantly cuts down evaluation costs.

EXPLORE TOPICS