Add SWE-ZERO 12M dataset#258
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
There was a problem hiding this comment.
🟡 Acceptable overall — pipeline is clean, CI passes, and the schema mapping is well-documented. Two issues need to be addressed before merge.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW — adds a new dataset directory only; no shared schema or converter changes.
Was this automated review useful? React with 👍 or 👎 to this review to help us measure review quality.
Workflow run: https://github.com/neulab/agent-data-protocol/actions/runs/26894417354
This review was generated by an AI agent (OpenHands) on behalf of the reviewer.
| @@ -0,0 +1,413 @@ | |||
| [ | |||
| { | |||
There was a problem hiding this comment.
🟠 Important: Non-representative samples — MessageAction path never exercised.
All 3 samples come from the same instance_id (rsteube__carapace-849) with exit_status: incomplete. As a result:
- The
MessageActionbranch inconvert_assistant_message(assistant turns with no bash block) is never exercised in the committed samples, even though it is a documented converter feature. - The samples don't illustrate diversity across the dataset's 122,908 unique PRs and 3,222 repositories.
Please regenerate with at least one trajectory from a different instance_id/repo, and ideally include one Submitted outcome and one plain-text assistant turn (no bash block) to cover the MessageAction path. You can skip ahead in the stream to find more varied samples:
python datasets/AlienKevin_SWE-ZERO-12M-trajectories/extract_raw.py \
| python -c "
import sys, json
buf = []
for line in sys.stdin:
item = json.loads(line)
if item.get('exit_status') == 'Submitted':
buf.append(line)
if len(buf) == 5: break
print('[' + ','.join(l.strip() for l in buf) + ']')
"| if not bash_matches: | ||
| return MessageAction(content=content) | ||
|
|
||
| match = bash_matches[-1] |
There was a problem hiding this comment.
🟡 Suggestion: Undocumented multi-bash-block handling.
bash_matches[-1] selects the last bash block and silently drops any earlier code blocks when a message contains more than one. While the system prompt mandates exactly one, malformed rollouts may violate this. This design choice has no entry in the PR description's decision catalog.
Please add a design-decision entry documenting why the last block is preferred (e.g., later blocks represent the final intent) and consider emitting a warning via sys.stderr when len(bash_matches) > 1.
Summary
Closes #257.
Adds the
AlienKevin/SWE-ZERO-12M-trajectoriesdataset to ADP following the existing mini-swe-agent dataset patterns.Dataset source
trainFiles added
datasets/AlienKevin_SWE-ZERO-12M-trajectories/README.mddatasets/AlienKevin_SWE-ZERO-12M-trajectories/extract_raw.pydatasets/AlienKevin_SWE-ZERO-12M-trajectories/schema_raw.pydatasets/AlienKevin_SWE-ZERO-12M-trajectories/raw_to_standardized.pydatasets/AlienKevin_SWE-ZERO-12M-trajectories/metadata.jsondatasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_raw.jsondatasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_std.jsondatasets/AlienKevin_SWE-ZERO-12M-trajectories/sample_sft/openhands_v0.jsonSchema mapping summary
systemmessages because they define mini-swe-agent formatting and execution-free constraints.usertask messages toTextObservation(source="user").usermessages beginning withObservation:toTextObservation(source="environment")with the prefix stripped.bashblocks toCodeAction(language="bash"), preserving pre-command reasoning as the action description.MessageActionentries.instance_id,repo,trajectory_format,exit_status, andduration_secin trajectory details.Design decisions
Ambiguity: The source dataset has 100 independent rollouts per PR and repeats
instance_idacross rows.instance_idplus a deterministic SHA-1 content hash.rsteube__carapace-849becomes IDs such asrsteube__carapace-849-f3b732c7f08f.instance_idwould create duplicate sample IDs; adding a synthetic counter would depend on extraction position and be less stable.Ambiguity: The dataset card says most trajectories are incomplete and explicitly frames the corpus as mid-training data rather than verified SFT data.
Submittedonly.exit_status: incompleteare standardized and converted to OpenHands v0 SFT.Ambiguity: Raw observations are encoded as
usermessages prefixed withObservation:.Observation: ./example/cmd/_test/xonsh.pybecomes an environmentTextObservationcontaining./example/cmd/_test/xonsh.py.Ambiguity: Some assistant turns may not contain a valid bash block even though the prompt requests one.
MessageActionso malformed or terminal natural-language turns are preserved.Ambiguity: Assistant messages may contain reasoning before a bash command.
CodeAction.descriptionafter removing a leadingTHOUGHT:label.THOUGHT: I need to inspect files...becomes the code action description.THOUGHT:in descriptions adds format noise; discarding the reasoning loses useful supervision.Known limitations
Tests run
python -m pytest tests/test_dataset_structure.py tests/test_raw_schemas.py tests/test_standardized_schemas.py tests/test_std_to_sft_conversion.py -qPATH=/home/openhands/.local/bin:$PATH python -m pytest tests/ -qThis PR was created by an AI agent (OpenHands) on behalf of the user.
@neubig can click here to continue refining the PR