Nov 24 2025

Scaling Code-Repair Agents with Reinforcement Learning: Extending OpenHands for Real-World Repositories

By: Harsh Gupta,Nishit Neema,Shivashish Naithani,Srinjoy Mukherjee,Sapan Shah,David Bick,Gokul Ramakrishnan,Ganesh Venkatesh

Introduction

Training effective code-repair agents requires more than just large language models—it demands a robust infrastructure that can efficiently interact with real-world codebases at scale. After building a comprehensive dataset of real-world issue-resolution instances packaged as Docker images, we faced a critical challenge: existing agentic frameworks like OpenHands were tightly coupled to benchmark-specific assumptions that prevented their use in large-scale reinforcement learning (RL) training loops.

This post details our technical journey in transforming OpenHands from a SWE-Bench evaluation tool into a general-purpose RL training platform capable of handling thousands of diverse Python repositories.

The Challenge: Breaking Free from Benchmark Constraints

OpenHands provides a sophisticated agentic scaffold through its CodeActAgent, but its evaluation pipeline was architected specifically for SWE-Bench Verified. This tight coupling manifested in several ways:

Hard-coded repository assumptions: Fixed directory structures and predefined metadata schemas
Static environment expectations: Dependency on SWE-Bench harness utilities
Inflexible evaluation flow: No support for arbitrary Docker images or custom test configurations

Our dataset, spanning thousands of Python repositories scraped from PyPI and GitHub, presented fundamentally different requirements. Each repository brought its own:

Unique project structure and build systems
Custom test frameworks and configurations
Varying dependency management approaches
Repository-specific metadata formats

Architecture Redesign: Generalized Evaluation Pipeline

Dynamic Metadata Loading

We completely refactored the evaluation workflow to eliminate dependency on SWE-Bench Verified specific assumptions. The new architecture:

Dynamically loads arbitrary Docker imagesrepresenting problem instances
Parses metadata at runtimeincluding:

Base commit SHA
Failing test identifiers
Passing test suite specifications
Test runner configuration (pytest etc.)
Repository-specific environment variables

3. Executes tests in containerized environments with full isolation

4. Reports structured pass/fail outcomes compatible with RL reward signals

This transformation effectively converts OpenHands into ageneral-purpose evaluation enginefor code-repair tasks, capable of operating on any repository with minimal configuration

Optimization: Unified Rollout Architecture

The Container Duplication Problem

The original OpenHands design separated patch generation and evaluation into independent workflows. Each step would:

Load the Docker image from disk (~8GB typical size)
Initialize a new container runtime
Set up the repository environment
Execute the operation
Tear down the container

For moderately-sized Docker images, this duplication introduced approximately3 minutes of overhead per instance—completely untenable for large-scale RL training where we need to generate thousands of rollouts per training iteration.

Single-Context Rollout System

Our solution implements aunified workflowthat maintains a persistent runtime context throughout the entire agent trajectory:

Key architectural changes:

Single image load: Docker image loaded once at rollout initialization
Persistent container: Single container maintained across all agent actions
Stateful environment: Repository state preserved between edits and evaluations
Multi-turn interaction support: Agent can iteratively

Generate code edits
Apply patches incrementally
Run tests to validate changes
Receive feedback without environment reinitialization

Performance Impact

This architectural change delivers dramatic efficiency improvements:

Latency reduction: One time cost for container runtime
Scalability and resource efficiency: Enables parallel rollout generation across compute cluster

Enabling Multi-Step Reinforcement Learning

Our extended OpenHands fork now provides all essential components for training code-repair agents via RL:

1. Multi-Turn Interaction Support

Agents can engage in extended debugging sessions, making multiple edits and receiving feedback at each step—critical for learning complex repair strategies.

2. Fast Rollout Generation

Quicker rollout times enable rapid iteration during training, supporting techniques like:

GRPO
CORPO

3. Sparse Reward Evaluation

Binary pass/fail signals from test suites provide clear learning objectives while avoiding reward shaping biases.

4. Comprehensive Trajectory Logging

Detailed execution traces capture:

Agent reasoning steps
Code edits and their impacts
Test execution results
Error messages and stack traces

This rich data supports offline analysis, curriculum learning, and reward model training.

Results and Future Directions

By combining our curated dataset of real-world issues with this extended OpenHands infrastructure, we've established a complete pipeline for training code-repair agents at scale. Current experiments focus on:

Post-training Qwen3 coder-30Bvia RL on diverse Python repositories
Curriculum learning strategiesthat gradually increase problem difficulty
Multi-repository generalizationtesting transfer across codebases
Leaderboard evaluationon established benchmarks like SWE-Bench

Early results indicate that RL-trained models significantly outperform supervised fine-tuning baselines on multi-step repair tasks, particularly on issues requiring iterative debugging and test-driven refinement.

Technical Contributions Summary

Our work delivers three key technical contributions:

Generalized evaluation engine: Removes benchmark-specific constraints, enabling arbitrary repository support
Unified rollout architecture: Eliminates container duplication, reducing overhead by ~95%
RL-ready infrastructure: Provides clean APIs for integration with modern RL frameworks

These advances transform OpenHands from a specialized benchmark tool into a scalable platform for training next-generation code-repair agents on real-world software engineering tasks.

Acknowledgments

This work builds on the excellent foundation provided by theOpenHandsteam. Our modifications are designed to complement and extend their framework while maintaining compatibility with the original codebase.

Interested in the technical details?Reach out to discuss collaboration opportunities in agentic code generation and reinforcement learning for software engineering.