Skip to content

feat(0.15.0): cross-worker sandbox-reconnect durability#23

Merged
drewstone merged 3 commits into
mainfrom
feat/sandbox-reconnect
May 22, 2026
Merged

feat(0.15.0): cross-worker sandbox-reconnect durability#23
drewstone merged 3 commits into
mainfrom
feat/sandbox-reconnect

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

A 15-minute agentic sandbox turn must survive the Cloudflare worker isolate dying mid-turn (deploy roll, CPU limit, OOM). runDurableTurn already replays a completed turn, but an interrupted turn re-runs from the top — the producer's streamPrompt generator died with the isolate.

The Tangle sandbox container is orchestrator-managed and outlives the worker. This PR adds runReconnectableTurn: it checkpoints a RunHandle at turn start so a fresh worker re-attaches to the in-flight sandbox run instead of re-prompting.

  • RunHandle{ kind: 'sandbox' | 'tcloud', sandboxId?, sessionId?, runId?, status, cursor? }. A pointer to a substrate run that outlives the isolate.
  • runReconnectableTurn — three resolution paths on a retry: replayed (turn already finished — cached text replays), reconnected (a running handle survived — calls the product's reconnect(handle) callback), rerun/fresh (no reconnectable handle — produces live).
  • reconnect(handle) is product-supplied substrate glue. Sandbox products wire the SDK's event-replay endpoint (GET {runtimeUrl}/agents/run/{runId}/events?lastEventId={cursor}); tcloud products omit it and fall through to a clean re-run.
  • Storage: the handle is checkpointed as a completed step at index 0; the turn runs at index 1. Reuses the existing completeStep JSON-result path with zero schema change — a completed step is the only shape startOrResume returns to a retry, and the handle must be readable while the turn step is still running. A new durable_steps column would force a migration across all three stores plus a new store method.

This is a thin handle registry, not a second durable-execution framework — the sandbox runtime is the durable engine; agent-runtime just remembers the pointer.

Spike findings (@tangle-network/[email protected])

Cross-worker attach is feasible. streamPrompt's reconnect uses executionId (run id, carried on the execution.started SSE frame's data) + lastEventId (the SSE id: cursor). The runtime exposes GET {runtimeUrl}/agents/run/{executionId}/events?lastEventId={cursor}&format=sse, reachable from any process via the public SandboxConnection.runtimeUrl + authToken. The SDK does not expose a one-call resumeRun(executionId) — its reconnect loop is closure-local — so the raw replay fetch is product-owned, which is exactly why reconnect is a product-supplied callback.

Test plan

  • pnpm typecheck passes
  • pnpm test — 231/231 pass (18 new in run-handle.test.ts)
  • New tests run across the InMemory / FileSystem / D1-over-sqlite store matrix
  • Covered: fresh turn registers a handle; retry with a running handle calls reconnect not produce; completed handle replays; running handle with no reconnect falls through to re-run; reconnect-stream failure fails the run (error not swallowed); register advances the persisted cursor
  • Confirmed the FileSystem-store concurrent-write race fix (handle writes drained before the turn-step write) is stable across 5 consecutive runs

drewstone added 3 commits May 20, 2026 20:41
A 15-minute agentic sandbox turn must survive the Cloudflare worker
isolate dying mid-turn. `runDurableTurn` already replays a *completed*
turn, but an *interrupted* one re-runs from the top — the producer's
`streamPrompt` generator died with the isolate.

The sandbox container is orchestrator-managed and outlives the worker.
`runReconnectableTurn` checkpoints a `RunHandle` — `{ kind, sandboxId,
sessionId, runId, status, cursor }` — at turn start. On a retry that
finds a `running` handle, a fresh worker calls a product-supplied
`reconnect(handle)` callback (which wires the sandbox SDK's event-replay
endpoint) instead of re-prompting. tcloud products omit `reconnect` and
fall through to a clean re-run.

The handle is checkpointed as a completed step at index 0; the turn runs
at index 1. This reuses the existing `completeStep` JSON-result path
with zero schema change — a completed step is the only shape
`startOrResume` returns to a retry, and the handle must be readable
while the turn step is still `running`.

Tests cover fresh / reconnected / replayed / rerun / reconnect-failure
across the InMemory / FileSystem / D1 store matrix.
biome flagged three errors — a `let` that should be `const`
(run-handle.test.ts), an unsorted export block (durable/index.ts), and
a formatting nit (run-handle.ts). All mechanical (biome --write), no
behaviour change. A pre-existing unused-import *warning* in
tests/agent.test.ts is left untouched — warnings do not fail CI and it
is outside this PR's scope.
@drewstone drewstone force-pushed the feat/sandbox-reconnect branch from af9fa2e to 252bcb0 Compare May 22, 2026 19:39
Copy link
Copy Markdown
Contributor

@tangletools tangletools left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified after bringing the branch current with main (merged 9 commits incl. the model-resolution primitive, no conflicts). runReconnectableTurn + RunHandle: checkpoints a run pointer at turn index 0, a fresh worker re-attaches to the in-flight sandbox run (replayed/reconnected/rerun/fresh) instead of re-prompting — reuses the existing completeStep path, zero schema change. Full gate on the merged tree: typecheck 0, 254 tests green (18 sandbox-reconnect across the InMemory/FileSystem/D1 store matrix), build success, biome clean. CI green.

@drewstone drewstone merged commit 6f0a16a into main May 22, 2026
1 check passed
@drewstone drewstone deleted the feat/sandbox-reconnect branch May 22, 2026 19:40
drewstone added a commit that referenced this pull request May 22, 2026
runReconnectableTurn (#23) recovered an interrupted turn only on a retry
re-invocation, left an unattended window between worker death and that
retry, depended on the sandbox runtime buffering events, and made the
correctness-critical reconnect a per-product callback. It checkpointed
the run handle as "a completed step at index 0" — an admitted
migration-dodge.

This relocates the durability boundary off the ephemeral worker onto an
always-attached supervisor that owns the run.

Substrate (platform-agnostic, tested in Node):
- DurableRunStore gains an ordered, replayable stream-event log —
  appendStreamEvent / readStreamEvents, idempotent on eventId so a
  reconnecting adapter that re-yields a boundary event cannot double-log.
  RunHandle is real run-row state via setRunHandle, not a step hack.
  Schema v2 (durable_stream_events table + durable_runs.handle_json),
  implemented across the in-memory / file-system / D1 stores.
- runSupervisedTurn — drains a run's events into the stream log as they
  flow, persists the reconnect pointer the instant the substrate yields
  it, heartbeats the lease. A fresh supervisor reads the log for its
  cursor and resumes via the adapter — fresh / resumed / replayed.
- SandboxReconnectAdapter — one typed, conformance-tested contract. The
  dangerous reconnect glue lives once per substrate, never per product.

Cloudflare host (thin):
- SessionSupervisorDO — a Durable Object that hosts runSupervisedTurn;
  alarm() re-attaches a run a dropped response stream abandoned. CF
  types are structural (no @cloudflare/workers-types dep).

runReconnectableTurn / run-handle.ts are removed — superseded; no
product consumed them. RunHandle + the four-mode resolution carry
forward. 15 new tests incl. the cross-worker chaos keystone (kill
mid-stream, resume, no gap, no duplicate); suite 251 green; typecheck +
biome + build clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants