NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AGENTIC-SYSTEMS...1 min read

CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

Share
NOW LET US Article – CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

The newly introduced CrowdMath dataset evaluates AI's ability in collaborative mathematical problem-solving. While frontier LLMs excel at individual reasoning, they struggle to understand the functional roles of contributions in collaborative research discussions.

Computer Science > Artificial Intelligence

Title:CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions

View PDF HTML (experimental)Abstract:Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

© 2026 Now Let Us. All rights reserved.

Source: arXiv cs.AI Recent

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

agentic-systems

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Researchers have introduced Lean4Agent, the first framework that leverages the Lean4 formal language to model and verify AI agent workflows. By addressing natural language ambiguity, it significantly enhances the reliability and execution performance of LLMs.

NOW LET US Related – Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

agentic-systems

Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

Researchers have proposed TRUST, a novel reinforcement learning framework that aligns uncertainty quantification with reward design to improve tool-calling decisions in LLM agents, preventing overconfident mistakes.

NOW LET US Related – Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

agentic-systems

Detecting and Mitigating Bias by Treating Fairness as a Symmetry Operation

Researchers propose a novel framework that treats fairness in machine learning as a symmetry operation, mitigating bias by over 90% with minimal impact on accuracy.

NOW LET US Related – Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

agentic-systems

Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

Researchers have introduced AFSAT, a GPU-accelerated pseudo-Boolean solver based on continuous local search. By leveraging the JAX compiler, AFSAT overcomes memory and floating-point limitations, delivering superior performance and near-linear scaling across multiple accelerators.

NOW LET US Related – A Study of Parallel Continuous Local Search

agentic-systems

A Study of Parallel Continuous Local Search

A new study explores parallel Continuous Local Search (CLS) as an efficient approach for solving Boolean satisfiability (SAT) problems, offering key insights for implementation on modern accelerator hardware.

NOW LET US Related – CARVE-Q: Quantum-Proposed, Classically Certified Interactive Driving Repair

agentic-systems

CARVE-Q: Quantum-Proposed, Classically Certified Interactive Driving Repair

Researchers have introduced CARVE-Q, a breakthrough architecture combining quantum search with classical verification to solve interactive driving repair for autonomous vehicles. This system enables autonomous vehicles to make rapid emergency maneuver repairs using quantum algorithms while ensuring absolute safety through classical verification certificates.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.