Add SWE-ZERO 12M dataset by neubig · Pull Request #258 · neulab/agent-data-protocol

neubig · 2026-06-03T14:35:39Z

Summary

Closes #257.

Adds the AlienKevin/SWE-ZERO-12M-trajectories dataset to ADP following the existing mini-swe-agent dataset patterns.

Dataset source

Source: https://huggingface.co/datasets/AlienKevin/SWE-ZERO-12M-trajectories
License: Apache-2.0
Split used: train
Size: 12,290,800 rollouts over 122,908 unique PRs, 3,222 repositories, 16 programming languages, and approximately 112B tokens according to the dataset card.

Files added

datasets/AlienKevin_SWE-ZERO-12M-trajectories/README.md
datasets/AlienKevin_SWE-ZERO-12M-trajectories/extract_raw.py
datasets/AlienKevin_SWE-ZERO-12M-trajectories/schema_raw.py
datasets/AlienKevin_SWE-ZERO-12M-trajectories/raw_to_standardized.py
datasets/AlienKevin_SWE-ZERO-12M-trajectories/metadata.json
datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_raw.json
datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_std.json
datasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_sft/openhands_v0.json

Schema mapping summary

Skip raw system messages because they define mini-swe-agent formatting and execution-free constraints.
Convert initial raw user task messages to TextObservation(source="user").
Convert raw user messages beginning with Observation: to TextObservation(source="environment") with the prefix stripped.
Convert raw assistant messages with fenced bash blocks to CodeAction(language="bash"), preserving pre-command reasoning as the action description.
Preserve assistant messages without bash blocks as MessageAction entries.
Store instance_id, repo, trajectory_format, exit_status, and duration_sec in trajectory details.

Design decisions

Ambiguity: The source dataset has 100 independent rollouts per PR and repeats instance_id across rows.
- Chosen approach: Derive ADP IDs from instance_id plus a deterministic SHA-1 content hash.
- Example: rsteube__carapace-849 becomes IDs such as rsteube__carapace-849-f3b732c7f08f.
- Alternatives rejected: Using only instance_id would create duplicate sample IDs; adding a synthetic counter would depend on extraction position and be less stable.
Ambiguity: The dataset card says most trajectories are incomplete and explicitly frames the corpus as mid-training data rather than verified SFT data.
- Chosen approach: Preserve all non-empty trajectories rather than filtering to Submitted only.
- Example: Sample trajectories with exit_status: incomplete are standardized and converted to OpenHands v0 SFT.
- Alternatives rejected: Filtering to successful submissions would discard the bulk of the dataset and conflict with the dataset card's intended use.
Ambiguity: Raw observations are encoded as user messages prefixed with Observation:.
- Chosen approach: Treat these as environment observations and strip only the prefix.
- Example: Observation: ./example/cmd/_test/xonsh.py becomes an environment TextObservation containing ./example/cmd/_test/xonsh.py.
- Alternatives rejected: Leaving observations as user messages would lose tool-result structure; stripping more text could remove meaningful command output.
Ambiguity: Some assistant turns may not contain a valid bash block even though the prompt requests one.
- Chosen approach: Convert assistant turns without a bash block to MessageAction so malformed or terminal natural-language turns are preserved.
- Example: A plain assistant explanation remains a message action rather than being dropped.
- Alternatives rejected: Dropping these turns would alter trajectory semantics; inventing a command would introduce unsupported behavior.
Ambiguity: Assistant messages may contain reasoning before a bash command.
- Chosen approach: Preserve reasoning as CodeAction.description after removing a leading THOUGHT: label.
- Example: THOUGHT: I need to inspect files... becomes the code action description.
- Alternatives rejected: Keeping THOUGHT: in descriptions adds format noise; discarding the reasoning loses useful supervision.

Known limitations

The source trajectories are execution-free and are not verified against tests.
Many trajectories are incomplete or truncated at the source dataset's 15-turn cap.
Samples are intentionally small and generated from the beginning of the training stream.

Tests run

python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -q
PATH=/home/openhands/.local/bin:$PATH python -m pytest tests/ -q

This PR was created by an AI agent (OpenHands) on behalf of the user.

@neubig can click here to continue refining the PR

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions

🟡 Acceptable overall — pipeline is clean, CI passes, and the schema mapping is well-documented. Two issues need to be addressed before merge.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW — adds a new dataset directory only; no shared schema or converter changes.

Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26894417354

This review was generated by an AI agent (OpenHands) on behalf of the reviewer.

github-actions · 2026-06-03T15:21:38Z

@@ -0,0 +1,413 @@
+[
+  {


🟠 Important: Non-representative samples — MessageAction path never exercised.

All 3 samples come from the same instance_id (rsteube__carapace-849) with exit_status: incomplete. As a result:

The MessageAction branch in convert_assistant_message (assistant turns with no bash block) is never exercised in the committed samples, even though it is a documented converter feature.

The samples don't illustrate diversity across the dataset's 122,908 unique PRs and 3,222 repositories.

Please regenerate with at least one trajectory from a different instance_id/repo, and ideally include one Submitted outcome and one plain-text assistant turn (no bash block) to cover the MessageAction path. You can skip ahead in the stream to find more varied samples:

python datasets/AlienKevin_SWE-ZERO-12M-trajectories/extract_raw.py \ | python -c " import sys, json buf = [] for line in sys.stdin: item = json.loads(line) if item.get('exit_status') == 'Submitted': buf.append(line) if len(buf) == 5: break print('[' + ','.join(l.strip() for l in buf) + ']') "

github-actions · 2026-06-03T15:21:39Z

+    if not bash_matches:
+        return MessageAction(content=content)
+
+    match = bash_matches[-1]


🟡 Suggestion: Undocumented multi-bash-block handling.

bash_matches[-1] selects the last bash block and silently drops any earlier code blocks when a message contains more than one. While the system prompt mandates exactly one, malformed rollouts may violate this. This design choice has no entry in the PR description's decision catalog.

Please add a design-decision entry documenting why the last block is preferred (e.g., later blocks represent the final intent) and consider emitting a warning via sys.stderr when len(bash_matches) > 1.

Add SWE-ZERO 12M dataset

1862a2b

Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai Bot mentioned this pull request Jun 3, 2026

Add 12M swe-zero dataset #257

Open

neubig marked this pull request as ready for review June 3, 2026 15:17

github-actions Bot requested changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SWE-ZERO 12M dataset#258

Add SWE-ZERO 12M dataset#258
neubig wants to merge 1 commit into
mainfrom
openhands/add-swe-zero-12m

neubig commented Jun 3, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 3, 2026

Uh oh!

github-actions Bot Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neubig commented Jun 3, 2026

Summary

Dataset source

Files added

Schema mapping summary

Design decisions

Known limitations

Tests run

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants