For decades, enterprise software has been built on the bedrock of Java. While the language remains a staple of global industry, the frameworks that support it—such as Jakarta EE, Spring, and various legacy application servers—evolve rapidly. Keeping these systems updated is a massive, labor-intensive task that often costs organizations millions in technical debt. Enter ScarfBench, a new benchmarking suite developed by IBM Research to determine if AI agents are finally ready to handle the heavy lifting of enterprise-scale code migration.
As Large Language Models (LLMs) continue to demonstrate proficiency in writing boilerplate code, their performance in complex, multi-file refactoring scenarios remains inconsistent. ScarfBench aims to quantify this performance by focusing specifically on the nuanced, context-heavy requirements of migrating Java applications between frameworks. By shifting the focus from simple code completion to architectural transformation, IBM is pushing the industry toward more reliable, autonomous software engineering.
ScarfBench is not merely a collection of snippets. It is a comprehensive evaluation framework designed to mirror the challenges faced by software architects. The benchmark evaluates AI agents across several critical dimensions:
- Multi-file Context Awareness: Enterprise migrations rarely involve changing a single class. ScarfBench tests whether an agent can maintain consistency across dozens of files, including configuration XMLs, dependencies, and business logic classes.
- Framework-Specific Proficiency: The benchmark measures the model’s ability to map legacy patterns (e.g., older EJB configurations) to modern, lightweight alternatives like Quarkus or Jakarta EE 10.
- Syntactic and Semantic Accuracy: Beyond just producing code that compiles, ScarfBench verifies that the transformed code maintains the intended behavior of the original application, minimizing the risk of regression.
- Dependency Management: The suite evaluates how agents handle the complex web of Maven or Gradle dependencies that characterize enterprise Java environments.
By focusing on these specific technical requirements, ScarfBench provides a more accurate performance metric than general-purpose coding benchmarks like HumanEval or MBPP, which often fail to capture the long-range dependencies inherent in enterprise systems.
Modernizing legacy code is often described as “changing the engines on a plane while it is flying.” Organizations are often hesitant to migrate critical applications because of the high risk of downtime and the scarcity of developers who understand both the legacy framework and the modern replacement.
AI agents offer a potential solution: they can read through millions of lines of code, identify deprecated patterns, and propose automated refactoring paths. However, the lack of standardized testing meant that enterprises had no way of knowing if a specific AI agent was actually capable of performing these tasks without introducing critical bugs. ScarfBench addresses this transparency gap, allowing CTOs and lead engineers to evaluate AI tools against a rigorous, industry-standard yardstick.
ScarfBench represents a significant step forward in the quest for autonomous software engineering. By standardizing the way we measure AI performance in legacy migration, IBM Research is enabling a more scientific approach to AI adoption in the enterprise.
As models improve, the goal is to move from "AI-assisted" migration to "AI-driven" migration, where an agent can propose a complete migration plan, execute the code changes, and verify the results via a series of automated test suites. ScarfBench provides the necessary feedback loop to make this possible, allowing researchers to identify where models struggle—whether it is in understanding complex dependency hierarchies or in generating correct configuration annotations—and refine their architectures accordingly.
If successful, the widespread adoption of tools like ScarfBench could fundamentally alter the economics of software development. If AI agents can reliably handle the migration of complex Java frameworks, the technical debt that currently cripples many large organizations could be systematically addressed. This would free up human talent to focus on innovation and high-level architectural design rather than the tedious process of framework updates.
IBM’s initiative serves as a reminder that the future of generative AI lies in specialization. While general-purpose models are impressive, the real value for the enterprise will be found in domain-specific benchmarks that ensure reliability, security, and precision in mission-critical environments.



