NOW LET US – AI RAG SaaS Studio TP.HCM
NOW LET US
Digital Product Studio
Back to news
DEV-TOOLS...3 min read

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

Share
NOW LET US Article – ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

Introducing ScarfBench, an open benchmark designed to evaluate how reliably AI agents can migrate enterprise Java applications across Spring, Jakarta EE, and Quarkus ecosystems.

Recent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains:

Can AI agents reliably modernize real-world enterprise applications?

Existing software engineering benchmarks have demonstrated impressive progress in bug fixing and code generation, but framework migration presents a fundamentally different challenge. Success requires not only translating code, but also preserving behavior, adapting build systems, and navigating runtime dependencies.

To address this gap, we introduce ScarfBench (Self-Contained Application Refactoring Benchmark), an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java.

ScarfBench focuses on migrations across three major Java ecosystems:

  • Spring
  • Jakarta EE
  • Quarkus

Unlike traditional benchmarks that compare generated code against reference implementations, ScarfBench evaluates whether migrated applications actually build, deploy, and preserve behavior.

Framework migration is much more than replacing annotations.

A simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors. Small mistakes in any of these pieces can prevent successful deployment.

Framework migration requires translating framework semantics, not just source code.

ScarfBench provides a systematic way to evaluate AI agents on enterprise Java framework migration tasks.

Applications are required to:

  • Build successfully.
  • Deploy correctly.
  • Pass behavioral validation.

This provides a much more realistic measure of modernization quality.

Benchmark at a Glance

ScarfBench includes both focused migration tasks and whole-application migrations.

Starting from a JSR-based enterprise Java taxonomy, expert migrations create verified implementations across Spring, Jakarta EE, and Quarkus.

We evaluated several state-of-the-art coding agents on ScarfBench.

Despite strong performance on traditional software engineering benchmarks, framework migration remains difficult. Success rates vary considerably across framework pairs and whole-application migrations remain particularly challenging.

Compile success consistently exceeds deploy success, which in turn exceeds behavioral success. Build success alone significantly overestimates migration quality.

Migration difficulty depends strongly on the target framework, with Jakarta EE proving particularly challenging.

Beyond measuring success rates, ScarfBench helps us understand how agents behave during modernization.

A migrated application is only useful if it actually builds and runs.

We therefore compared agent-reported outcomes against independent build verification.

Claude Code reported successful builds for 29 out of 30 whole applications.

Only 22 of those applications actually built successfully.

Meanwhile, the single application classified as failed by the agent ultimately built correctly.

This suggests that agent self-assessment should not be treated as a reliable signal of migration completion.

Independent build and test validation remains essential.

Framework migrations rarely affect a single file or layer.

Changes in configuration, services, databases, and web components often cascade across the application.

The most frequently visited layers were:

  • Configuration
  • Web
  • Database
  • Service

Common transitions included:

  • Configuration ↔ Web
  • Service ↔ Database

This suggests that migration is an iterative dependency-resolution process rather than a simple source-to-source transformation.

We used layer revisit frequency as a proxy for migration effort. Layers that required repeated visits typically involved debugging, dependency resolution, or framework adaptation.

Rather than proceeding linearly, agents repeatedly returned to configuration-related artifacts while resolving framework differences and dependency issues.

Not every migration issue originates from source code.

Agents frequently struggled with environmental issues, including:

  • Docker cache inconsistencies
  • Port connectivity problems
  • Maven wrapper and build tooling issues

These operational concerns often delayed validation even when the source-code migration itself was largely complete.

Modernization failures span build systems, deployment environments, dependency injection, databases, endpoints, assertions, and infrastructure.

The biggest challenge in framework modernization is not translating Java code.

It is managing the web of dependencies across configuration, infrastructure, and runtime environments.

While frontier agents can automate substantial portions of the migration process, reliable validation and architectural reasoning remain critical for achieving successful outcomes.

ScarfBench helps expose these challenges and provides a standardized way to measure progress toward truly autonomous application modernization.

ScarfBench is designed as an open resource for researchers and practitioners.

Resources include:

  • Benchmark dataset
  • Evaluation infrastructure
  • Public leaderboard
  • Documentation
  • Open-source code

Researchers can compare agent architectures and techniques. Practitioners can use ScarfBench to evaluate modernization solutions before deploying them in production environments.

Framework migration remains one of the largest unsolved problems in AI-assisted software engineering. We hope ScarfBench helps the community measure progress and accelerate the next generation of AI-assisted application modernization.

We invite researchers, practitioners, and framework communities to evaluate their agents, contribute new migration scenarios and help advance the state of the art.

© 2026 Now Let Us. All rights reserved.

Source: Hugging Face Blog

Advertisement
Ad slot ready: 5887729102

More in this category

NOW LET US Related – Is The Economist Always Wrong?

dev-tools

Is The Economist Always Wrong?

Often dubbed the 'voice of God' yet sometimes ridiculed as a 'contrarian indicator,' The Economist used the AI model GPT-5.5 to analyze over 7,000 of its editorials since 2000, revealing a fascinating track record of hits and misses.

NOW LET US Related – sqlite-utils 4.0rc2, mostly written by Claude Fable (for about $149.25)

dev-tools

sqlite-utils 4.0rc2, mostly written by Claude Fable (for about $149.25)

The author of sqlite-utils shares how they leveraged the Claude Fable AI agent to identify and fix critical transaction bugs for the 4.0rc2 release, costing an estimated $149.25 in API usage.

NOW LET US Related – Megawatts by Microwave

dev-tools

Megawatts by Microwave

The historical journey of how the US Army and the Bonneville Power Administration (BPA) overcame geographical barriers to build the first integrated regional power grid, laying the foundation for modern energy infrastructure.

NOW LET US Related – Shadcn/UI now defaults to Base UI instead of Radix

dev-tools

Shadcn/UI now defaults to Base UI instead of Radix

shadcn/ui has officially made Base UI its default component library, replacing Radix. The transition comes after strong community adoption, though Radix remains fully supported with no forced migrations.

NOW LET US Related – The Log Is the Agent

dev-tools

The Log Is the Agent

A new paper introduces ActiveGraph, a runtime that inverts traditional AI agent design by putting the append-only event log at the center, enabling deterministic replay, cheap forking, and end-to-end lineage.

NOW LET US Related – If you're a button, you have one job

dev-tools

If you're a button, you have one job

An insightful look into UI/UX design through the simple action of rotating an image, comparing how iPhone and Android handle rapid button taps during animations.

EXPLORE TOPICS

Discover All Categories

Deep dive into the specific technology sectors that matter most to you.