Benchmark harness for running code-repair tasks across sandbox providers.
The project currently compares Vercel, Modal, and Daytona on TerminalBench and SWE-Smith style tasks. It records sandbox lifecycle timings, solver/verifier status, output tails, and provider cost estimates so warm and cold runs can be compared with the same task set.
data/: bundled TerminalBench and SWE-Smith smoke datasets in parquet and JSONL form.py/: Python runner and provider adapters.ts/: Bun/TypeScript runner, matrix runner, prewarm helper, and report generator.results/: ignored local benchmark artifacts plus checked-in metadata.reports/: curated markdown analysis split into cross-vendor, per-task, and failure-mode views.docs/providers/: provider configuration notes for Vercel, Modal, and Daytona.scripts/: dataset extraction and OpenRouter solver helpers.
Start with reports/terminalbench_provider_report.md.
The current apples-to-apples runnability comparison covers all 100 tasks in data/swesmith_v4_smoke100.jsonl. Vercel, Modal, and Daytona each have 100/100 passing cold-gold evidence.
The timing rollups are stitched from full and focused reruns, so use them for provider head-to-head shape rather than strict synchronized wall-clock claims. Details are split across:
- reports/cross-vendor-comparison.md
- reports/per-task-comparison.md
- reports/per-task-failure-audit.md
- reports/per-provider-report.md
- reports/failure-modes-tradeoffs.md
The runner normalizes task layout before solving:
| env type | workdir | provider runtime mapping |
|---|---|---|
terminalbench |
/workspace |
configured runtime. |
harbor_swesmith |
/testbed |
Modal and Daytona use the task Docker image or Dockerfile-derived setup; Vercel and local reconstruct the environment from per-repo manifests in data/swesmith_env_manifests.json (exact Python via uv, mirror clone, SWE-Smith profile install commands). |
SWE-Smith rows include tests/test.sh, solution/*, and an environment/Dockerfile inside the task archive. Vercel cannot consume those per-task Docker images directly in this harness, so the runner rebuilds each environment from the same SWE-Smith profile recipe the image was built from (see data/README.md). The prepare step also rewrites solution/solve.sh into a deterministic idempotent form, and the verifier runs as a non-root agent user to match task-image semantics.
Install the TypeScript runner:
(cd ts && npm install)Run one local task:
(cd ts && bun run bench --provider local --task-index 0 --output ../results/ts-local-one.json)Run a small Vercel/Modal/Daytona matrix:
bun --env-file=.env ts/src/matrix.ts --providers all --modes cold,warm --task-index all --task-limit 20 --concurrency 2 --run-concurrency 6 --timeout-seconds 900 --solve-timeout-seconds 300 --solve-command-file scripts/openrouter_solver.sh --output results/solve-price-matrix-task20.jsonFor solver-enabled remote runs, set provider credentials and OpenRouter variables in .env. Use .env.example as the template when present.
Each run JSON records:
- provider, mode, runtime, dataset, and task environment counts
- pass count and estimated provider cost
- per-task elapsed seconds and phase timings
- verifier return code plus stdout/stderr tails
- solver return code and output tails when a solver is enabled
Matrix JSON files summarize a group of provider/mode run artifacts.
Curated reports live in reports/. To generate a fresh raw provider report from the newest matching artifacts:
cd ts
bun run report --results-dir ../results --output ../reports/generated-provider-report.mdThe generated report is intentionally separate from the curated report files.
The curated reports in reports/ were produced in three steps:
- Cold-gold provider runs.
ts/src/bench.tsran solver-independent gold-patch checks for Vercel, Modal, and Daytona acrossdata/swesmith_v4_smoke100.jsonl, with focused reruns for repaired failure clusters. - Evidence aggregation. The regenerated reports scan local ignored
results/ts-<provider>-cold-gold*.jsonfiles and select the newest passing result for each provider/task. If no passing result exists, they select the newest cold-gold result. - Curated analysis. The cross-vendor, per-task, and failure-mode documents summarize the full 100-task comparable set and call out that the timing view is stitched from full and focused reruns.
The Updated: date in each curated report reflects when the analysis was last revised, not when the benchmark runs executed.
Provider-specific setup details live in docs/providers/.
- Vercel uses
@vercel/sandbox. ConfigureVERCEL_API_KEY,VERCEL_ACCESS_TOKEN, orVERCEL_TOKEN, plusVERCEL_TEAM_IDandVERCEL_PROJECT_IDunless OIDC credentials are available. - Modal uses the Modal SDK credentials supported by
modal. - Daytona uses
DAYTONA_API_KEYand, when needed,DAYTONA_API_URLandDAYTONA_TARGET. - Cost estimates are harness estimates from measured wall-clock time and configured provider rates. They exclude OpenRouter model spend.
Auth credentials live in .env (see .env.example). Warm-run state — the snapshot/image identifiers reused to skip cold setup — is not stored in .env. Instead, ts/src/prewarm.ts creates the artifact and emits its identifier as an env field in the prewarm result JSON under results/:
| provider | identifier | emitted to | reused via |
|---|---|---|---|
| Vercel | VERCEL_SNAPSHOT_ID |
results/prewarm-vercel-*.json |
--vercel-snapshot-id or the VERCEL_SNAPSHOT_ID env var |
| Modal | MODAL_IMAGE_ID |
results/prewarm-modal-*.json |
--modal-image-id or the MODAL_IMAGE_ID env var |
| Daytona | DAYTONA_SNAPSHOT |
results/prewarm-daytona-*.json |
--daytona-snapshot or the DAYTONA_SNAPSHOT env var |
To run warm, copy the identifier from the prewarm result JSON into the corresponding flag or env var on the next bench.ts/matrix.ts run. For TerminalBench (non-Docker) tasks, Daytona instead uses a cached profile via --prewarm-profile (default terminalbench-smoke) rather than a named snapshot.
Note: the Vercel fallback's repo-specific dependency repair for SWE-Smith tasks is not configured through environment variables — it is in-code setup in ts/src/bench.ts. See reports/failure-modes-tradeoffs.md for the rationale.