NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
AI-FRONTIER...5 min read

Introducing LifeSciBench

Share
NOW LET US Article – Introducing LifeSciBench

Agentic AI systems are becoming increasingly capable of performing scientific tasks. LifeSciBench is a new benchmark designed to evaluate how well these models handle the complexity of real-world life science research.

Agentic AI systems are becoming increasingly capable of performing scientific tasks. However, their usefulness to life science researchers depends on how well they handle the complexity of real research. That work rarely looks like a single fact-recall question or a clean prediction problem. Researchers interpret incomplete evidence, reconcile conflicting results, design difficult experiments, troubleshoot assays, evaluate translational risk, and decide what to do next under uncertainty.

Current benchmarks do not fully capture these capabilities. Many life science evaluations focus on narrow domains or isolated skills, resulting in questions with structured question formats and clean reference answers. While valuable, they often fail to truly assess whether a model can contribute across the broader span of research-level work.

We designed LifeSciBench to help close this gap. Every task is grounded in the judgment of practicing life scientists with Ph.D.-level training and direct experience advancing drug discovery programs in biotech and pharmaceutical settings.

LifeSciBench includes 750 expert-authored tasks spanning seven workflows and seven biological domains.

  • 1,062 Task artifacts
  • 173 Scientist contributors
  • 19,020 Rubric criteria
  • 453 Expert reviewers

What LifeSciBench measures

LifeSciBench measures whether AI systems can support realistic life science research tasks, not just answer biology questions. To define the benchmark taxonomy, we surveyed practicing life scientists about the workflows they use most often in applied research settings. Then, we grouped their responses into seven recurring categories: evidence handling, analysis, design and optimization, scientific reasoning, validation and operations, translation, and scientific communication.

Each task is structured like a request a scientist might give to a knowledgeable collaborator: scientific prompt, any relevant context or artifacts, and a free-response answer. Expert-written rubrics evaluate whether a model can produce the right answer for a specific problem, with the right level of detail, justification, caveats, and formatting a scientist would expect.

Dataset construction

LifeSciBench evaluates scientific reasoning alongside the less well-defined, practical skills necessary for real-world scientific use. Its tasks ask models to work through realistic research problems: interpreting evidence, making domain-grounded judgments, and communicating conclusions that would be useful to expert reviewers. Many tasks also require models to handle uncertainty and reason over supporting data files rather than relying on prompt text alone.

The benchmark is designed to reflect the complexity of life science work. Overall, 79% of tasks require multiple reasoning or decision-making steps, with an average of four steps per task. LifeSciBench includes 1,062 attached artifacts spanning figures, PDFs, tables, sequence files, structure or chemical files, and web references. More than half of tasks (53%) require models to interpret or synthesize information from at least one artifact.

Tasks were created by 173 expert scientists across different life science disciplines. Each scientist had Ph.D.-level training and biotechnology or pharmaceutical industry experience. Tasks could undergo as many revision cycles as needed before acceptance, with no fixed cap on the number of rounds; accepted tasks averaged six self-directed automated review cycles and completed at least two rounds of expert reviews. Reviews were anchored in either a verifiable correct answer or strong expert consensus, with at least 90% agreement among reviewers in the relevant domain. This process helped ensure that accepted tasks were scientifically grounded, clear enough to grade, and representative of applied research.

Grading and rubric breakdown

LifeSciBench tasks are graded with a detailed, task-specific rubric that breaks down the expected response into specific scientific claims, calculations, decisions, justifications, and so on. Across the benchmark, expert-developed rubrics include 19,020 criteria—an average of 25 per task—to assess both scientific correctness and usefulness for research decisions.

This design reflects how scientific work is evaluated in practice: many life science tasks cannot be graded by checking the final answer alone. A response may reach the correct high-level conclusion but still be judged incomplete if, for example, it overlooks a key assay limitation or fails to proactively bring up a highly consequential biological nuance. Conversely, a partial response may contain high-quality reasoning even if it does not fully solve the task.

The granular rubrics capture this nuance. LifeSciBench evaluates not only final-answer accuracy, but whether a model reaches its answer in a scientifically valid and operationally useful way.

Eval Example

We’re preparing for a Type B FDA meeting on AAV9-microDys-X, an AAV9-based micro-dystrophin gene therapy for Duchenne muscular dystrophy that expresses a 138 kDa construct from an MCK promoter, and we want a hard-nosed critique of whether our current package really supports accelerated approval on micro-dystrophin expression as a surrogate endpoint reasonably likely to predict clinical benefit.

Study context: open-label Phase 1b/2 in 12 ambulatory boys age 4–7 with confirmed DMD and out-of-frame rod-domain deletions. The package is:

  • Pre-treatment vastus lateralis biopsies: 0–3% of healthy-control dystrophin by quantitative Western blot using MANEX1A against the N-terminal actin-binding domain.
  • 12-week post-treatment contralateral vastus lateralis biopsies: mean micro-dystrophin 38% of healthy control (range 18–61%) by the same Western blot, normalized to total protein by Coomassie staining.
  • Post-treatment immunofluorescence: sarcolemmal signal in 75–95% of fibers using a polyclonal anti-dystrophin C-terminal antibody.
  • 48-week function: mean NSAA change +1.4 points from baseline versus −0.6 in an external published natural-history registry cohort (p = 0.03 by unpaired t-test).
  • Safety: transient transaminitis in 8/12 patients managed with steroid taper; one resolved myocarditis; no deaths.
  • Biodistribution/persistence: AAV9 vector genomes detectable in muscle at 12 weeks at a mean of 2.3 vector genomes per nucleus.
  • Eligibility: no baseline anti-AAV9 neutralizing antibodies (titer <1:400) and no exon-44 deletions.

Please pressure-test this package item by item: where would FDA or a skeptical reviewer say the evidence, as presented, fails to support our conclusion, and what additional data, analyses, or design changes would be needed to close those gaps if possible? For anything not specified, assume what is known from other clinical-grade 138 kDa micro-dystrophin programs and relevant clinical trials through December 31, 2025 also applies.

Candidate Response

Bottom line: As presented, this package is not strong enough to support accelerated approval on micro-dystrophin expression as a surrogate endpoint reasonably likely to predict clinical benefit. A skeptical FDA reviewer would likely view the biomarker, functional, durability, safety, and generalizability evidence as materially under-supported.

| Package Item | Key Failure Mode | What Would Be Needed | |---|---|---| | Western blot quantification | MANEX1A binds an N-terminal epitope shared by endogenous full-length dystrophin and the micro-dystrophin transgene, so the assay does not cleanly distinguish transgene from residual/revertant dystrophin. Quantifying a 138 kDa micro-dystrophin against a healthy full-length dystrophin standard is also invalid. | Use a recombinant micro-dystrophin standard and an orthogonal method that distinguishes transgene from endogenous dystrophin, such as targeted mass spectrometry or a transgene-specific/epitope-specific assay |

© 2026 Now Let Us. All rights reserved.

Source: OpenAI News

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – The film about Sam Altman has been dropped by Amazon MGM

ai-frontier

The film about Sam Altman has been dropped by Amazon MGM

Amazon MGM has reportedly dropped 'Artificial', a film directed by Luca Guadagnino about the dramatic firing and rehiring of OpenAI CEO Sam Altman in 2023.

NOW LET US Related – A startup claims it broke through a bottleneck that’s holding back LLMs

ai-frontier

A startup claims it broke through a bottleneck that’s holding back LLMs

Miami-based AI startup Subquadratic claims its new model, SubQ, has solved a decade-long mathematical bottleneck in LLMs by replacing dense attention with a highly efficient sparse attention mechanism. Independent testing by Appen suggests the technology could drastically cut costs and boost processing speeds.

NOW LET US Related – Barret Zoph is out at OpenAI again after just five months

ai-frontier

Barret Zoph is out at OpenAI again after just five months

Five months after returning to OpenAI to lead its enterprise AI sales, Barret Zoph has departed the company once again, following a brief stint at Mira Murati's rival startup.

NOW LET US Related – How the Peter Thiel-Linked Dialog Club Secretly Ranks Its Members

ai-frontier

How the Peter Thiel-Linked Dialog Club Secretly Ranks Its Members

Leaked internal data reveals that Dialog, a private club cofounded by Peter Thiel, secretly grades and ranks its prominent members using algorithms, wealth, and fame to dictate event pricing, seating, and membership status.

NOW LET US Related – The White House Is Making Up Its Rules for AI in Real Time

ai-frontier

The White House Is Making Up Its Rules for AI in Real Time

The Trump administration's sudden crackdown on Anthropic's advanced AI models reveals an ad-hoc, "Wild West" approach to regulation. As the White House makes up rules in real time, other tech giants are forced to adapt to an unspoken licensing regime.

NOW LET US Related – Meta’s AI Workers Are Revolting, Peter Thiel’s Secret Society, and SBF’s Plea to Trump

ai-frontier

Meta’s AI Workers Are Revolting, Peter Thiel’s Secret Society, and SBF’s Plea to Trump

This week on Uncanny Valley, hosts Zoë Schiffer and Brian Barrett discuss the internal meltdown at Meta over its aggressive AI restructuring, the leaked member list of Peter Thiel's secretive 'Dialog' society, and Sam Bankman-Fried's active campaign for a presidential pardon.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.