Agreement Between Large Language Models, Human Reviewers, and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology

A new study compares the performance of LLMs like ChatGPT and Gemini against human experts in evaluating STROBE checklists for rheumatology research, finding that while AI excels at basic formatting, human judgment remains essential for complex methodological analysis.
Computer Science > Digital Libraries
Title:Agreement Between Large Language Models, Human Reviewers, and Authors in Evaluating STROBE Checklists for Observational Studies in Rheumatology
View PDFAbstract:Introduction: Evaluating compliance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement can be time-consuming and subjective. This study compares STROBE assessments from large language models (LLMs), a human reviewer panel, and the original manuscript authors in observational rheumatology research. Methods: Guided by the GRRAS and DEAL Pathway B frameworks, 17 rheumatology articles were independently assessed. Evaluations used the 22-item STROBE checklist, completed by the authors, a five-person human panel (ranging from junior to senior professionals), and two LLMs (ChatGPT-5.2, Gemini-3Pro). Items were grouped into Methodological Rigor and Presentation and Context domains. Inter-rater reliability was calculated using Gwet's Agreement Coefficient (AC1). Results: Overall agreement across all reviewers was 85.0% (AC1=0.826). Domain stratification showed almost perfect agreement for Presentation and Context (AC1=0.841) and substantial agreement for Methodological Rigor (AC1=0.803). Although LLMs achieved complete agreement (AC1=1.000) with all human reviewers on standard formatting elements, their agreement with human reviewers and authors declined on complex items. For example, regarding the item on loss to follow-up, the agreement between Gemini 3 Pro and the senior reviewer was AC1=-0.252, while the agreement with the authors was only fair. Additionally, ChatGPT-5.2 generally demonstrated higher agreement with human reviewers than Gemini-3Pro on specific methodological items. Conclusion: While LLMs show potential for basic STROBE screening, their lower agreement with human experts on complex methodological items likely reflects a reliance on surface-level information. Currently, these models appear more reliable for standardizing straightforward checks than for replacing expert human judgment in evaluating observational research.
Bibliographic and Citation Tools
Code, Data and Media Associated with this Article
Demos
Recommenders and Search Tools
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Source: arXiv cs.AI Recent










