Skip to main content

The world’s fastest GLM-4.6 is now available on Cerebras at 1,000 TPS >>

Nov 24 2025

Scaling SWE Agent Data Collection with Dockerized Environments for Execution

By :Gune S,Sahil Lathiya,APOORV PANDEY,Mritunjai Chandra,Vijay Srinivas,Ganesh Venkatesh

Introduction

We are focused on building a high-quality platform for agentic flow training, including support for Reinforcement Learning (RL), datasets, and ML recipes. We announced our RL platform a few weeks ago, and this project represents our dataset infrastructure initiative.

Why This Matters

Challenge: Training effective AI agents for software engineering requires:

  • High Quality Datasets: Diverse Corpus of high quality data which consist of PR title/description, issue title/description, base commit, patches, unit test files and other metadata provided by github
  • Reproducible Environments: Consistent execution environments across different repositories
  • High-Quality Signals: Clear pass/fail signals for learning (FAIL_TO_PASS tests)
  • Scale: Thousands of diverse, real-world software engineering tasks
  • Validation: Verified test outcomes with proper testing

Our Solution: Our Data curation pipeline with the repository sandbox generator provides:

  • Data Curation Pipeline: Built in support for querying repositories to extract their structure, dependencies and various fields required for training SWE agents
  • Automated code execution Infrastructure: Handles complex dependency resolution and environment setup
  • Rich Metadata: Embedded test classifications for training signals
  • Portability: Self-contained Docker images that work anywhere

Impact on Agentic AI Development

This infrastructure enables:

  1. Supervised Learning: Train models on verified bug fixes with clear before/after states
  2. Reinforcement Learning: Provide executable environments with reward signals (test outcomes)
  3. Evaluation: Standardized benchmarks for measuring agent performance
  4. Research: Reproducible experiments across different approaches

Scalability Vision

Our methodology is designed to scale byorders of magnitude:

  • Current Capability: Process 1000+ instances with full automation

We are building the foundation for the largest high-quality dataset for software engineering AI agents.

Dataset Collection Methodology

To construct a high-quality, diverse corpus of real-world software projects, we sourced repositories from three complementary pipelines: PyPI-linked projects, high-impact GitHub repositories, and multimodal codebases. Each pipeline was designed with strict quality and activity thresholds to ensure that only mature, collaboratively developed projects were included

  1. PyPI-Based Repository Instances.We began by identifying repositories associated with PyPI packages, focusing on projects with substantial community engagement. Repositories were required to be predominantly Python (≥60% of their codebase) and show meaningful development activity, with minimum thresholds of 100 pull requests or issues and 100 forks. After applying these filters and a subsequent decontamination pass, this stream contributed approximately107,000 curated repository instances.
  2. GitHub Star-Based Repository Instances.To capture widely adopted and influential projects, we separately collected repositories selected on the basis of GitHub popularity. Only repositories with at least 10,000 stars and clear evidence of active maintenance, at least 100 pull requests or issues and 100 forks were considered. Applying the same ≥60% Python requirement and decontamination produced an additional50,000 high-quality instances.
  3. Multimodal Repository Instances.Finally, to broaden the dataset beyond pure Python projects, we incorporated multimodal repositories that combine Python with other languages or data modalities. Using the same activity criteria (≥100 PRs/issues and ≥100 forks) and a ≥60% Python threshold, we extracted3,500 instancesfrom a curated set of700 multimodal repositories.

Together, these pipelines ensure broad coverage across package ecosystems, popular open-source projects, and rich multimodal codebases, resulting in a balanced and rigorously filtered dataset suitable for downstream analysis and model training.

Dataset Statistics

Below is a quick reference to the key fields captured for each dataset instance.

Dataset Comparison & Benchmarks

Architecture of Code Execution Environment

Code Execution Platform Overview

This code execution platform automates end-to-end differential testing of Python repositories by setting up an isolated workspace, analyzing repository structure and dependencies, generating an optimized dockerfile, and building the environment with automatic LLM-based fixes for build failures. It then performs robust test discovery with iterative auto correction, executes baseline and patched tests in isolated docker runs, and compares results to classify tests as FAIL_TO_PASS or PASS_TO_PASS. A reversal phase rechecks classifications by undoing the main patch, ensuring accuracy. Finally, the system produces comprehensive artifacts which includes logs, docker files, summaries, and detailed JSON results capturing the full build, test, and differential-analysis lifecycle.

Interested in the technical details?Reach out to discuss collaboration opportunities in agentic code generation and sandboxed execution for software engineering.