An AI Operating System built on hexagonal architecture.
A single evidence-gated agent loop for bounded work — and a cooperative+adversarial agent harness
that designs and hardens whole systems. Hybrid inference: local models first, a
claude -p frontier path for the hard parts.
Architecture · ADR ledger · Benchmarks
Status (2026-06, alpha). hex is a working substrate with a real, measured execution model: an evidence-gated agent loop for bounded work, and a cooperative+adversarial harness that designs and hardens whole systems — exercised on three real systems built from one-line specs. The full capability runs on frontier inference; strictly-local on commodity hardware has a measured ceiling. What follows is what hex does today — every claim here is checkable against the source, the ADR ledger, or
docs/benchmarks/.
hex is a microkernel-based AI Operating System (AIOS) built on hexagonal architecture
(Ports & Adapters). It installs into a target project to orchestrate AI-driven development:
agents are the users, developers are the sysadmins. Hooks, skills, agents, and settings are
instantiated into the target project; examples/ holds sample targets.
The current design is one strong agent loop fed by tools, code-graph context, and memory —
not a simulated organization of many agents (that earlier org-sim epoch was
retired; see ARCHITECTURE.md). The differentiator is the quality of
context assembled for that single loop — code-graph relevance plus ranked lessons — and a
structure that turns frontier models into a disciplined, gated, architecturally-aware pipeline.
The canonical path is hex do → an evidence-gated ReAct loop (reason → act → observe →
repeat) over a curated, guarded toolset (read/verify tools + a terminal propose_edit; no
arbitrary shell). The whole loop lives in the hex-exec crate.
task + graph context + ranked lessons + windowed file
→ reason → read/verify tools → propose_edit → run evidence command
→ commit IFF it exits 0 (else revert; the gate is the sole authority on what commits)
What's actually wired (all shipped + validated — see the ADR ledger and the cited source):
- The evidence gate is the only thing that authorizes a commit — vacuous passes are detected and
rejected (
hex-exec/src/direct_exec.rs::evidence_is_vacuous), failed edits reverted atomically. No "it compiles" theater. (ADR-2026-05-19-0720, ADR-2026-06-04-1740.) - Per-run worktree isolation — autonomous runs execute in a dedicated
hex/auto/<id>worktree off the operator's branch under a distincthex-factoryidentity, hard-guarded against the operator's tree, merged back viahex worktree merge(hex-exec/src/direct_workspace.rs, ADR-2606071323). - Evidence-gated best-of-N across complementary models —
hex doiterates an ordered candidate list (.hex/project.json → inference.react_models, default[devstral-small-2:24b, claude-code]) and commits the first to pass the gate. The gate, not a classifier, picks the winner — a mis-route only costs latency (react_execute_best_of_n, ADR-2606072044). claude -pfrontier fallback — when local models fail, aclaude-codecandidate delegates the whole task to the operator's logged-in Claude CLI (claude_executeindirect_react.rs): no API key, no VRAM ceiling. Local runs free/fast; Claude recovers the hard ones.- Benchmark-driven model choice —
hex bench agenticruns fixtures through the real loop in isolated worktrees and scores per-model pass-rates (docs/benchmarks/, ADR-2606071734). - Memory-aware resource governor —
hex-exec/src/resource_governor.rsgates admission by available memory so best-of-N and the swarm don't oversubscribe the box. - Self-deploy —
hex dev deploybuilds, installs, and restarts in one command (ADR-2606071702). - Hex-native frontier swarm —
hex swarm runfans a task list out to parallelclaude -pworkers under a semaphore-bounded supervisor. hex orchestrates its own agents.
With CLAUDE_SESSION_ID unset, nexus drives the loop itself via an Ollama/OpenAI-compatible adapter —
no Claude CLI needed (ADR-2026-04-11-2000). hex doctor composition diagnoses the active variant.
The single loop above is for bounded work. For whole systems, hex has a cooperative+adversarial
harness — multiple claude -p agents that disagree, attack each other's work, and resolve against
a ground-truth gate (hex-exec/src/adversarial.rs). Two composable verbs:
hex swarm build '<challenge>' --target <dir> --gate '<test>'— cooperative design: N agents propose divergent designs (durability-first, concurrency-first, …) → each is red-teamed → a lead synthesizes one spec → a build agent implements until the gate passes.hex swarm review <path> --gate '<test>'— adversarial hardening: parallel reviewers hunt bugs by failure-class lens → each finding is skeptically verified (default-refute) → confirmed bugs are fixed under the gate.--reviewchains them:hex swarm build … --reviewruns the full design → harden pipeline.
What keeps it disciplined: a ground-truth test gate is the only authority, the verifier defaults
to refuting findings (so plausible-but-wrong bugs die before any edit), and every artifact is
independently re-verified (cargo test / tsc) — not taken on the agents' word.
Exercised — from one-line challenges, the harness built three real systems (now under examples/),
and the adversarial pass found bugs the builds' own passing tests missed. The bug counts are recorded
in ADR-2606081916:
| System (built from a one-line spec) | LOC | Adversarial review found |
|---|---|---|
Concurrent durable job queue (WAL, crash-recovery) — examples/jobqueue-clean |
~2900 | 6 real bugs (incl. silent WAL data-loss) |
Thread-safe LRU + TTL cache — examples/lru-clean |
~1300 | 1 real bug (exception-safety) |
Token-bucket rate limiter — examples/ratelimiter-clean |
~550 | 0 (clean by design) |
The 6 / 1 / 0 spread is the signature of a real tool — it finds bugs when they're there and reports
none when they're not. hex supplies the structure that makes it work: the divergent-design pipeline,
the skeptical-verify gate, the fix-loop, and the evidence anchor — orchestrating claude -p agents
into a disciplined build-and-harden pipeline you can point at a one-line spec and get tested,
architecturally-clean code back. (The counts come from those build sessions, recorded in the ADR
above; the systems themselves are reproducible from their gates.)
Concretely, with the receipts:
What works:
- The hexagonal architecture is real and self-enforced —
hex analyze .grades the workspace A+ / 100 / 0 boundary violations over 712 source files (hex passes its own analyzer; the grade reflects the boundary rules thehex-analysisengine enforces).hex analyze hex-nexusis also A+ — nexus went from F (30/100) before the crate split to A+ after (ADR-2606071340). - The evidence-gated loop genuinely produces real, tested, committed code, and the gate holds under failure (a wandering model commits nothing).
- Best-of-N + the
claude -pfallback let the loop recover across models automatically — validated live (a local model failed a task; Claude took over and committed). - The cooperative+adversarial harness builds and hardens whole systems from one-line specs — the three above, each gated by its own tests, with the adversarial pass catching real bugs the build missed. hex even used it to find a bug in its own output.
The honest envelope:
- The full capability above runs on a frontier API or a logged-in
claudeCLI. Strictly local on commodity hardware has a ceiling (see the next section) — there, the local loop is a reliable implementer of bounded work, and the frontier path takes the whole-system design and the hardest tasks. hex routes between them by measured fit, not by guessing. - The benchmark corpus is small; treat any single number as directional, not gospel.
hex is model-agnostic (Ollama, vLLM, OpenAI-compatible, Claude). But the agentic loop — multi-turn tool use, not single-shot codegen — is demanding, and we measured it:
-
It's a RAM problem, not just a VRAM one. Top open models are large MoEs (e.g. Qwen3-Coder-Next is ~51 GB of weights); on a 16 GB-GPU / 30 GB-RAM box they don't fit, even with offload. The reachable set is ~≤13 GB-resident models.
-
No single local model dominates. A benchmark across the reachable models reordered the "best" model on every fixture — devstral leads on string tasks, qwen on algorithmic ones, and the top-of-the-leaderboard local model (
gpt-oss:20b) scored last on our grid. Leaderboard scores do not predict agentic-loop performance. -
The language matters as much as the model. We ran the same CSV-parser task in Rust, TS, and Go (react, per-model pass rate; data in
docs/benchmarks/fixtures/t25-csv-parse*.json):Model Rust TS Go qwen2.5-coder:14b 0/5 2/3 0/3 gpt-oss:20b 0/5 1/3 0/3 devstral-small-2:24b 5/5 3/3 2/3 gemma3:12b 4/5 2/3 1/3 The lesson isn't "static typing is hard" — it's that TypeScript is uniquely forgiving, while Rust and Go are strict and hard for weaker local models. The two models that recover in TS (qwen, gpt-oss) crash right back to 0/3 in Go — Go's strictness (unused imports/vars are compile errors, byte-vs-rune) punishes them almost like Rust's borrow checker. So the local ceiling — and how much the
claude -pfallback is load-bearing — depends heavily on your language: lowest for TS/JS, high for Rust and Go. -
So hex doesn't bet on one model. It runs best-of-N across a complementary pair and falls back to
claude -pfor tasks locals can't finish. That's the honest path to reliability on this hardware — most so for Rust, less needed for TS.
If you have a frontier API or a logged-in claude CLI, hex is strong. If you're strictly local on
commodity hardware, hex works but inherits the local models' ceiling — and the benchmark tells you
exactly where that is.
Full detail in ARCHITECTURE.md (the living map; always describes HEAD). The Rust workspace decomposes nexus behind ports (ADR-2606071340); the reusable core crates:
| Crate | Role |
|---|---|
| hex-core | Domain types + all port traits; the gravity center every crate depends on (no intra-workspace deps) |
| hex-exec | The agent engine: single-agent ReAct loop, best-of-N, claude -p delegate, the adversarial harness, the resource governor, guarded tools |
| hex-graph | Code-knowledge-graph engine → graph-out/graph.json (context_for, rank_lessons) |
| hex-analysis | Tree-sitter boundary checking; powers hex analyze |
| hex-git / hex-state | git plumbing (libgit2) · SpacetimeDB state adapter |
| hex-nexus | Composition root + daemon (axum :5555, dashboard, DI) — the only place adapters are wired |
| hex-cli | The canonical hex entry point |
Support crates round out the workspace: hex-agent (architecture-enforcement runtime),
hex-parser (parsing), hex-desktop (Tauri dashboard wrapper). SpacetimeDB (required) is the
coordination/state core — WASM modules live in spacetime-modules/; because WASM can't touch
FS/spawn/network, hex-nexus is the FS-bridge daemon.
hex bootstrap # prerequisites, SpacetimeDB, Ollama (if present), config
hex nexus start # the daemon (dashboard at :5555)
hex do run --file <f> --evidence "<cmd that must exit 0>" "<what to do>"
hex bench agentic --filter <fixture> # measure a model through the real loop
hex swarm build "<challenge>" --target <dir> --gate "<test>" --review # design + harden
hex dev deploy # rebuild + install + restart, one command
hex analyze . # architecture grade + boundary violations- ADRs are an append-only ledger — decisions are never edited or deleted; a changed decision
gets a new ADR that supersedes the old one. Lifecycle:
Proposed → Accepted → Completed, orRejected | Abandoned | Superseded | Deprecated. Status changes only viahex adr accept|complete|supersede. - Epochs group ADRs by design era (
foundation→org-sim(retired) →single-agent→hybrid-inference(current — ADR-2606072243)).hex adr reindexregenerates the INDEX. - ARCHITECTURE.md is the living map; the ADR ledger is its history. If code or older docs contradict the map, the map wins — and the contradiction is worth an ADR.
The hex-graph code-knowledge engine is graphify-influenced — a GraphRAG-style code graph
(typed nodes + edges, community detection, EXTRACTED/INFERRED/AMBIGUOUS confidence levels)
reimplemented natively in Rust. The single-agent execution model (one gateway-mediated ReAct loop
fed by code-graph context + memory, ADR-2606061359) converges with ideas from OpenClaw and
Hermes Agent (Nous Research). These shaped the design; the implementation is hex's own.
Operational rules (how to drive hex day-to-day) live in CLAUDE.md. The architecture map is ARCHITECTURE.md.