neulab · neubig · Jun 14, 2026 · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026
diff --git a/.agents/skills/custom-codereview-guide.md b/.agents/skills/custom-codereview-guide.md
@@ -13,18 +13,18 @@ When reviewing this repository, be strict about dataset correctness and reproduc
 
 For every dataset addition or dataset-format change, verify that the PR follows all applicable guidelines in `AGENTS.md`. In particular, check that:
 
-- Required files are present: `README.md`, `extract_raw.py`, `raw_to_standardized.py`, `schema_raw.py`, `sample_raw.json`, `sample_std.json`, and `sample_sft/openhands_v0.json`. `api.py` is additionally required whenever the dataset emits any `ApiAction`.
+- Required files are present: `README.md`, `extract_raw.py`, `raw_to_atif.py`, `atif_to_std.py`, `schema_raw.py`, `sample_raw.json`, `sample_std.json`, and `sample_sft/openhands_v0.json`.
 - Top-level dataset JSON files are limited to `sample_raw.json`, `sample_std.json`, and `generated_thoughts.json`. No root-level `sample_sft.json`, `full_*.json`, temporary chunks, downloaded corpora, scratch JSON, or alternate sample files such as `sample_fixed.json`.
 - The `sample_sft/` subdirectory contains agent-specific samples named `{agent_name}.json` (e.g. `openhands_v0.json`, `sweagent.json`). These must be regenerable from `sample_std.json` via the corresponding agent's `std_to_sft.py` and must cover the same trajectories/IDs as the standardized sample.
-- Sample files are generated by committed scripts and are not hand-patched fixtures. Mentally (or actually) re-run the pipeline: `sample_raw.json` → `raw_to_standardized.py` → `sample_std.json` → `agents/<agent>/std_to_sft.py` → `sample_sft/<agent_name>.json` should reproduce the committed JSON. If sample JSON changed but the corresponding generator bug was not fixed, flag it.
-- `sample_raw.json`, `sample_std.json`, and each `sample_sft/<agent_name>.json` represent **the same records in the same order**, with matching IDs between standardized and SFT stages. This is a hard requirement, not a soft preference.
+- Sample files are generated by committed scripts and are not hand-patched fixtures. Mentally (or actually) re-run the pipeline: `sample_raw.json` → `raw_to_atif.py` → `sample_atif.json` → `atif_to_std.py` → `sample_std.json` → `agents/<agent>/std_to_sft.py` → `sample_sft/<agent_name>.json` should reproduce the committed JSON. If sample JSON changed but the corresponding generator bug was not fixed, flag it.
+- `sample_raw.json`, `sample_atif.json`, `sample_std.json`, and each `sample_sft/<agent_name>.json` represent **the same records in the same order**, with matching IDs between standardized and SFT stages. This is a hard requirement, not a soft preference.
 - Sample size is small but representative — normally 3–5 trajectories — and covers important edge cases (tool calls, command output, final answers, dataset-specific action types, failures/rewards/terminal states where applicable).
 - Extraction, standardization, and SFT conversion are deterministic so future contributors can reproduce the samples (no unseeded `random.*`, time-dependent behavior, or nondeterministic dict ordering in outputs).
-- `schema_raw.py` validates `sample_raw.json` and standardized trajectories validate against the ADP schema.
-- Every `ApiAction.function` exists in the dataset's `api.py`, and every `kwargs` object satisfies that function's Python signature (including required parameters such as the `message` argument for `finish`). If the dataset emits `ApiAction` without an `api.py`, flag it.
-- If standardized trajectories include top-level `available_apis`, verify the dataset has `api.py`, the source data explicitly specifies per-instance tool/API availability, the list is not merely copied wholesale from `api.py` or inferred from used actions, every listed API exists in `api.py`, and every `ApiAction.function` in that trajectory appears in the list.
+- `schema_raw.py` validates `sample_raw.json` and standardized trajectories validate against the ATIF schema.
+- Every custom `ToolCall.function_name` used in `sample_std.json` is declared in `metadata.json`, and every `ToolCall.arguments` object satisfies that tool schema.
+- If standardized trajectories include per-instance tool availability metadata, verify the source data explicitly specifies it; do not infer availability merely from the tools used in the trajectory.
 - SFT messages containing `<function=`, `<function_calls>`, or `<invoke name=` use `"from": "function_call"` (not `gpt`, `human`, `assistant`, etc.).
-- `TextObservation.source` uses only schema-supported values: `user`, `agent`, or `environment`. Reject invented values like `system`, `os`, or `assistant`.
+- ATIF `Step.source` uses only schema-supported values: `system`, `user`, or `agent`.
 - Raw trajectory semantics are preserved: repeated actions, consecutive tool calls, observations, failures, rewards, and terminal states are not silently dropped. Any filtering must be implemented in code AND explained/justified in the PR description.
 - Dataset-local `std_to_sft.py`, duplicate API definitions, or schema changes are clearly justified. Prefer shared agent converters in `agents/` whenever possible.
 - Large corpora, full generated files (`full_raw.json`, `full_std.json`, `full_sft.json`), temporary chunks, caches, screenshots, and scratch JSON are not committed.
@@ -67,7 +67,7 @@ Dataset PR descriptions must include **all** of the following. If any is missing
 - **License** of the source data.
 - **Size and split** used (e.g. number of trajectories, which split(s)).
 - **Files added or changed** in this PR.
-- **Schema mapping summary** — how raw roles/actions/observations map to ADP types (`MessageAction`, `CodeAction`, `ApiAction`, `TextObservation`, `WebObservation`).
+- **Schema mapping summary** — how raw roles/actions/observations map to ATIF steps, tool calls, and observation results.
 - **Tests run** — which validation tests were executed and their results, or which equivalent CI checks passed.
 - **Known limitations** of the dataset or conversion.
 - **Design-decision catalog** for unclear implementation choices (see below).

diff --git a/.github/workflows/check_api_docstrings.yml b/.github/workflows/check_api_docstrings.yml
@@ -26,6 +26,7 @@ jobs:
         python -m pip install --upgrade pip
         pip install pytest
         if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
+        pip install -e .
 
     - name: Check dataset metadata
       run: |

diff --git a/.github/workflows/pre-commit.yml b/.github/workflows/pre-commit.yml
@@ -24,6 +24,7 @@ jobs:
         run: |
           python -m pip install --upgrade pip
           pip install -r requirements.txt
+          pip install -e .
 
       - name: Run pre-commit
         uses: pre-commit/action@v3.0.0

diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -26,21 +26,7 @@ jobs:
         python -m pip install --upgrade pip
         pip install pytest
         if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
-    - name: Check schema version bump
-      run: |
-        if [ "${{ github.event_name }}" = "pull_request" ]; then
-          base_ref="origin/${{ github.base_ref }}"
-        else
-          base_ref="${{ github.event.before }}"
-        fi
-
-        if echo "$base_ref" | grep -Eq '^0+$'; then
-          echo "Skipping schema version bump check because no base ref is available."
-        else
-          python scripts/check_schema_version_bump.py \
-            --base-ref "$base_ref" \
-            --head-ref HEAD
-        fi
+        pip install -e .
     - name: Run pytest
       run: |
         pytest tests/test_*.py
diff --git a/.github/workflows/schema-release.yml b/.github/workflows/schema-release.yml
diff --git a/.gitignore b/.gitignore
@@ -17,10 +17,12 @@ full_std_chunks/
 full_sft_chunks/
 
 full_raw.json
+full_atif.json
 full_std.json
 full_sft.json
 
 full_raw.jsonl
+full_atif.jsonl
 full_std.jsonl
 full_sft.jsonl
 
@@ -30,6 +32,7 @@ full_sft.jsonl
 /tags-opts
 
 .cache
+*.egg-info/
 
 /datasets/androidcontrol/android_env_utils/.eggs/
 /datasets/androidcontrol/android_env_utils/android_env/