Nov 24 2025
Scaling Code-Repair Agents with Reinforcement Learning: Extending OpenHands for Real-World Repositories
By: Harsh Gupta,Nishit Neema,Shivashish Naithani,Srinjoy Mukherjee,Sapan Shah,David Bick,Gokul Ramakrishnan,Ganesh Venkatesh
Introduction
Training effective code-repair agents requires more than just large language models—it demands a robust infrastructure that can efficiently interact with real-world codebases at scale. After building a comprehensive dataset of real-world issue-resolution instances packaged as Docker images, we faced a critical challenge: existing agentic frameworks like OpenHands were tightly coupled to benchmark-specific assumptions that prevented their use in large-scale reinforcement learning (RL) training loops.
This post details our technical journey in transforming OpenHands from a SWE-Bench evaluation tool into a general-purpose RL training platform capable of handling thousands of diverse Python repositories.
The Challenge: Breaking Free from Benchmark Constraints
OpenHands provides a sophisticated agentic scaffold through its CodeActAgent, but its evaluation pipeline was architected specifically for SWE-Bench Verified. This tight coupling manifested in several ways:
- Hard-coded repository assumptions: Fixed directory structures and predefined metadata schemas
- Static environment expectations: Dependency on SWE-Bench harness utilities
- Inflexible evaluation flow: No support for arbitrary Docker images or custom test configurations
Our dataset, spanning thousands of Python repositories scraped from PyPI and GitHub, presented fundamentally different requirements. Each repository brought its own:
- Unique project structure and build systems
- Custom test frameworks and configurations
- Varying dependency management approaches
- Repository-specific metadata formats
Architecture Redesign: Generalized Evaluation Pipeline
Dynamic Metadata Loading
We completely refactored the evaluation workflow to eliminate dependency on SWE-Bench Verified specific assumptions. The new architecture:
- Dynamically loads arbitrary Docker imagesrepresenting problem instances
- Parses metadata at runtimeincluding:
- Base commit SHA
- Failing test identifiers
- Passing test suite specifications
- Test runner configuration (pytest etc.)
- Repository-specific environment variables
3. Executes tests in containerized environments with full isolation
4. Reports structured pass/fail outcomes compatible with RL reward signals
This transformation effectively converts OpenHands into ageneral-purpose evaluation enginefor code-repair tasks, capable of operating on any repository with minimal configuration
Optimization: Unified Rollout Architecture
The Container Duplication Problem
The original OpenHands design separated patch generation and evaluation into independent workflows. Each step would:
- Load the Docker image from disk (~8GB typical size)
- Initialize a new container runtime
- Set up the repository environment
- Execute the operation
- Tear down the container
For moderately-sized Docker images, this duplication introduced approximately3 minutes of overhead per instance—completely untenable for large-scale RL training where we need to generate thousands of rollouts per training iteration.
Single-Context Rollout System
Our solution implements aunified workflowthat maintains a persistent runtime context throughout the entire agent trajectory:
Key architectural changes:
- Single image load: Docker image loaded once at rollout initialization
- Persistent container: Single container maintained across all agent actions
- Stateful environment: Repository state preserved between edits and evaluations
- Multi-turn interaction support: Agent can iteratively
- Generate code edits
- Apply patches incrementally
- Run tests to validate changes
- Receive feedback without environment reinitialization
Performance Impact
This architectural change delivers dramatic efficiency improvements:
- Latency reduction: One time cost for container runtime
- Scalability and resource efficiency: Enables parallel rollout generation across compute cluster
Enabling Multi-Step Reinforcement Learning
Our extended OpenHands fork now provides all essential components for training code-repair agents via RL:
1. Multi-Turn Interaction Support
Agents can engage in extended debugging sessions, making multiple edits and receiving feedback at each step—critical for learning complex repair strategies.
2. Fast Rollout Generation
Quicker rollout times enable rapid iteration during training, supporting techniques like:
- GRPO
- CORPO
3. Sparse Reward Evaluation
Binary pass/fail signals from test suites provide clear learning objectives while avoiding reward shaping biases.
4. Comprehensive Trajectory Logging
Detailed execution traces capture:
- Agent reasoning steps
- Code edits and their impacts
- Test execution results
- Error messages and stack traces
This rich data supports offline analysis, curriculum learning, and reward model training.
Results and Future Directions
By combining our curated dataset of real-world issues with this extended OpenHands infrastructure, we've established a complete pipeline for training code-repair agents at scale. Current experiments focus on:
- Post-training Qwen3 coder-30Bvia RL on diverse Python repositories
- Curriculum learning strategiesthat gradually increase problem difficulty
- Multi-repository generalizationtesting transfer across codebases
- Leaderboard evaluationon established benchmarks like SWE-Bench
Early results indicate that RL-trained models significantly outperform supervised fine-tuning baselines on multi-step repair tasks, particularly on issues requiring iterative debugging and test-driven refinement.
Technical Contributions Summary
Our work delivers three key technical contributions:
- Generalized evaluation engine: Removes benchmark-specific constraints, enabling arbitrary repository support
- Unified rollout architecture: Eliminates container duplication, reducing overhead by ~95%
- RL-ready infrastructure: Provides clean APIs for integration with modern RL frameworks
These advances transform OpenHands from a specialized benchmark tool into a scalable platform for training next-generation code-repair agents on real-world software engineering tasks.
Acknowledgments
This work builds on the excellent foundation provided by theOpenHandsteam. Our modifications are designed to complement and extend their framework while maintaining compatibility with the original codebase.
Interested in the technical details?Reach out to discuss collaboration opportunities in agentic code generation and reinforcement learning for software engineering.