Skip to content

Add SWE-ZERO 12M dataset#258

Open
neubig wants to merge 1 commit into
mainfrom
openhands/add-swe-zero-12m
Open

Add SWE-ZERO 12M dataset#258
neubig wants to merge 1 commit into
mainfrom
openhands/add-swe-zero-12m

Conversation

@neubig

@neubig neubig commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #257.

Adds the AlienKevin/SWE-ZERO-12M-trajectories dataset to ADP following the existing mini-swe-agent dataset patterns.

Dataset source

Files added

  • datasets/AlienKevin_SWE-ZERO-12M-trajectories/README.md
  • datasets/AlienKevin_SWE-ZERO-12M-trajectories/extract_raw.py
  • datasets/AlienKevin_SWE-ZERO-12M-trajectories/schema_raw.py
  • datasets/AlienKevin_SWE-ZERO-12M-trajectories/raw_to_standardized.py
  • datasets/AlienKevin_SWE-ZERO-12M-trajectories/metadata.json
  • datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_raw.json
  • datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_std.json
  • datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_sft/openhands_v0.json

Schema mapping summary

  • Skip raw system messages because they define mini-swe-agent formatting and execution-free constraints.
  • Convert initial raw user task messages to TextObservation(source="user").
  • Convert raw user messages beginning with Observation: to TextObservation(source="environment") with the prefix stripped.
  • Convert raw assistant messages with fenced bash blocks to CodeAction(language="bash"), preserving pre-command reasoning as the action description.
  • Preserve assistant messages without bash blocks as MessageAction entries.
  • Store instance_id, repo, trajectory_format, exit_status, and duration_sec in trajectory details.

Design decisions

  • Ambiguity: The source dataset has 100 independent rollouts per PR and repeats instance_id across rows.

    • Chosen approach: Derive ADP IDs from instance_id plus a deterministic SHA-1 content hash.
    • Example: rsteube__carapace-849 becomes IDs such as rsteube__carapace-849-f3b732c7f08f.
    • Alternatives rejected: Using only instance_id would create duplicate sample IDs; adding a synthetic counter would depend on extraction position and be less stable.
  • Ambiguity: The dataset card says most trajectories are incomplete and explicitly frames the corpus as mid-training data rather than verified SFT data.

    • Chosen approach: Preserve all non-empty trajectories rather than filtering to Submitted only.
    • Example: Sample trajectories with exit_status: incomplete are standardized and converted to OpenHands v0 SFT.
    • Alternatives rejected: Filtering to successful submissions would discard the bulk of the dataset and conflict with the dataset card's intended use.
  • Ambiguity: Raw observations are encoded as user messages prefixed with Observation:.

    • Chosen approach: Treat these as environment observations and strip only the prefix.
    • Example: Observation: ./example/cmd/_test/xonsh.py becomes an environment TextObservation containing ./example/cmd/_test/xonsh.py.
    • Alternatives rejected: Leaving observations as user messages would lose tool-result structure; stripping more text could remove meaningful command output.
  • Ambiguity: Some assistant turns may not contain a valid bash block even though the prompt requests one.

    • Chosen approach: Convert assistant turns without a bash block to MessageAction so malformed or terminal natural-language turns are preserved.
    • Example: A plain assistant explanation remains a message action rather than being dropped.
    • Alternatives rejected: Dropping these turns would alter trajectory semantics; inventing a command would introduce unsupported behavior.
  • Ambiguity: Assistant messages may contain reasoning before a bash command.

    • Chosen approach: Preserve reasoning as CodeAction.description after removing a leading THOUGHT: label.
    • Example: THOUGHT: I need to inspect files... becomes the code action description.
    • Alternatives rejected: Keeping THOUGHT: in descriptions adds format noise; discarding the reasoning loses useful supervision.

Known limitations

  • The source trajectories are execution-free and are not verified against tests.
  • Many trajectories are incomplete or truncated at the source dataset's 15-turn cap.
  • Samples are intentionally small and generated from the beginning of the training stream.

Tests run

  • python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q
  • PATH=/home/openhands/.local/bin:$PATH python -m pytest tests/ -q

This PR was created by an AI agent (OpenHands) on behalf of the user.

@neubig can click here to continue refining the PR

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig marked this pull request as ready for review June 3, 2026 15:17

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable overall — pipeline is clean, CI passes, and the schema mapping is well-documented. Two issues need to be addressed before merge.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW — adds a new dataset directory only; no shared schema or converter changes.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26894417354

This review was generated by an AI agent (OpenHands) on behalf of the reviewer.

@@ -0,0 +1,413 @@
[
{

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 Important: Non-representative samples — MessageAction path never exercised.

All 3 samples come from the same instance_id (rsteube__carapace-849) with exit_status: incomplete. As a result:

  1. The MessageAction branch in convert_assistant_message (assistant turns with no bash block) is never exercised in the committed samples, even though it is a documented converter feature.
  2. The samples don't illustrate diversity across the dataset's 122,908 unique PRs and 3,222 repositories.

Please regenerate with at least one trajectory from a different instance_id/repo, and ideally include one Submitted outcome and one plain-text assistant turn (no bash block) to cover the MessageAction path. You can skip ahead in the stream to find more varied samples:

python datasets/AlienKevin_SWE-ZERO-12M-trajectories/extract_raw.py \
  | python -c "
import sys, json
buf = []
for line in sys.stdin:
    item = json.loads(line)
    if item.get('exit_status') == 'Submitted':
        buf.append(line)
        if len(buf) == 5: break
print('[' + ','.join(l.strip() for l in buf) + ']')
"

if not bash_matches:
return MessageAction(content=content)

match = bash_matches[-1]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Undocumented multi-bash-block handling.

bash_matches[-1] selects the last bash block and silently drops any earlier code blocks when a message contains more than one. While the system prompt mandates exactly one, malformed rollouts may violate this. This design choice has no entry in the PR description's decision catalog.

Please add a design-decision entry documenting why the last block is preferred (e.g., later blocks represent the final intent) and consider emitting a warning via sys.stderr when len(bash_matches) > 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add 12M swe-zero dataset

2 participants