GitHub - gaberger/hex: The operating system for AI agents. Manage processes. Enforce architecture. Coordinate swarms. Route inference.

An AI Operating System built on hexagonal architecture.
A single evidence-gated agent loop for bounded work — and a cooperative+adversarial agent harness that designs and hardens whole systems. Hybrid inference: local models first, a claude -p frontier path for the hard parts.

Architecture · ADR ledger · Benchmarks

Status (2026-06, alpha). hex is a working substrate with a real, measured execution model: an evidence-gated agent loop for bounded work, and a cooperative+adversarial harness that designs and hardens whole systems — exercised on three real systems built from one-line specs. The full capability runs on frontier inference; strictly-local on commodity hardware has a measured ceiling. What follows is what hex does today — every claim here is checkable against the source, the ADR ledger, or docs/benchmarks/.

What hex is

hex is a microkernel-based AI Operating System (AIOS) built on hexagonal architecture (Ports & Adapters). It installs into a target project to orchestrate AI-driven development: agents are the users, developers are the sysadmins. Hooks, skills, agents, and settings are instantiated into the target project; examples/ holds sample targets.

The current design is one strong agent loop fed by tools, code-graph context, and memory — not a simulated organization of many agents (that earlier org-sim epoch was retired; see ARCHITECTURE.md). The differentiator is the quality of context assembled for that single loop — code-graph relevance plus ranked lessons — and a structure that turns frontier models into a disciplined, gated, architecturally-aware pipeline.

The execution model

The canonical path is hex do → an evidence-gated ReAct loop (reason → act → observe → repeat) over a curated, guarded toolset (read/verify tools + a terminal propose_edit; no arbitrary shell). The whole loop lives in the hex-exec crate.

task + graph context + ranked lessons + windowed file
  → reason → read/verify tools → propose_edit → run evidence command
  → commit IFF it exits 0   (else revert; the gate is the sole authority on what commits)

What's actually wired (all shipped + validated — see the ADR ledger and the cited source):

The evidence gate is the only thing that authorizes a commit — vacuous passes are detected and rejected (hex-exec/src/direct_exec.rs::evidence_is_vacuous), failed edits reverted atomically. No "it compiles" theater. (ADR-2026-05-19-0720, ADR-2026-06-04-1740.)
Per-run worktree isolation — autonomous runs execute in a dedicated hex/auto/<id> worktree off the operator's branch under a distinct hex-factory identity, hard-guarded against the operator's tree, merged back via hex worktree merge (hex-exec/src/direct_workspace.rs, ADR-2606071323).
Evidence-gated best-of-N across complementary models — hex do iterates an ordered candidate list (.hex/project.json → inference.react_models, default [devstral-small-2:24b, claude-code]) and commits the first to pass the gate. The gate, not a classifier, picks the winner — a mis-route only costs latency (react_execute_best_of_n, ADR-2606072044).
claude -p frontier fallback — when local models fail, a claude-code candidate delegates the whole task to the operator's logged-in Claude CLI (claude_execute in direct_react.rs): no API key, no VRAM ceiling. Local runs free/fast; Claude recovers the hard ones.
Benchmark-driven model choice — hex bench agentic runs fixtures through the real loop in isolated worktrees and scores per-model pass-rates (docs/benchmarks/, ADR-2606071734).
Memory-aware resource governor — hex-exec/src/resource_governor.rs gates admission by available memory so best-of-N and the swarm don't oversubscribe the box.
Self-deploy — hex dev deploy builds, installs, and restarts in one command (ADR-2606071702).
Hex-native frontier swarm — hex swarm run fans a task list out to parallel claude -p workers under a semaphore-bounded supervisor. hex orchestrates its own agents.

With CLAUDE_SESSION_ID unset, nexus drives the loop itself via an Ollama/OpenAI-compatible adapter — no Claude CLI needed (ADR-2026-04-11-2000). hex doctor composition diagnoses the active variant.

The agentic harness

The single loop above is for bounded work. For whole systems, hex has a cooperative+adversarial harness — multiple claude -p agents that disagree, attack each other's work, and resolve against a ground-truth gate (hex-exec/src/adversarial.rs). Two composable verbs:

hex swarm build '<challenge>' --target <dir> --gate '<test>' — cooperative design: N agents propose divergent designs (durability-first, concurrency-first, …) → each is red-teamed → a lead synthesizes one spec → a build agent implements until the gate passes.
hex swarm review <path> --gate '<test>' — adversarial hardening: parallel reviewers hunt bugs by failure-class lens → each finding is skeptically verified (default-refute) → confirmed bugs are fixed under the gate.
--review chains them: hex swarm build … --review runs the full design → harden pipeline.

What keeps it disciplined: a ground-truth test gate is the only authority, the verifier defaults to refuting findings (so plausible-but-wrong bugs die before any edit), and every artifact is independently re-verified (cargo test / tsc) — not taken on the agents' word.

Exercised — from one-line challenges, the harness built three real systems (now under examples/), and the adversarial pass found bugs the builds' own passing tests missed. The bug counts are recorded in ADR-2606081916:

System (built from a one-line spec)	LOC	Adversarial review found
Concurrent durable job queue (WAL, crash-recovery) — `examples/jobqueue-clean`	~2900	6 real bugs (incl. silent WAL data-loss)
Thread-safe LRU + TTL cache — `examples/lru-clean`	~1300	1 real bug (exception-safety)
Token-bucket rate limiter — `examples/ratelimiter-clean`	~550	0 (clean by design)

The 6 / 1 / 0 spread is the signature of a real tool — it finds bugs when they're there and reports none when they're not. hex supplies the structure that makes it work: the divergent-design pipeline, the skeptical-verify gate, the fix-loop, and the evidence anchor — orchestrating claude -p agents into a disciplined build-and-harden pipeline you can point at a one-line spec and get tested, architecturally-clean code back. (The counts come from those build sessions, recorded in the ADR above; the systems themselves are reproducible from their gates.)

What hex can do today

Concretely, with the receipts:

What works:

The hexagonal architecture is real and self-enforced — hex analyze . grades the workspace A+ / 100 / 0 boundary violations over 712 source files (hex passes its own analyzer; the grade reflects the boundary rules the hex-analysis engine enforces). hex analyze hex-nexus is also A+ — nexus went from F (30/100) before the crate split to A+ after (ADR-2606071340).
The evidence-gated loop genuinely produces real, tested, committed code, and the gate holds under failure (a wandering model commits nothing).
Best-of-N + the claude -p fallback let the loop recover across models automatically — validated live (a local model failed a task; Claude took over and committed).
The cooperative+adversarial harness builds and hardens whole systems from one-line specs — the three above, each gated by its own tests, with the adversarial pass catching real bugs the build missed. hex even used it to find a bug in its own output.

The honest envelope:

The full capability above runs on a frontier API or a logged-in claude CLI. Strictly local on commodity hardware has a ceiling (see the next section) — there, the local loop is a reliable implementer of bounded work, and the frontier path takes the whole-system design and the hardest tasks. hex routes between them by measured fit, not by guessing.
The benchmark corpus is small; treat any single number as directional, not gospel.

Local AI: the honest picture

hex is model-agnostic (Ollama, vLLM, OpenAI-compatible, Claude). But the agentic loop — multi-turn tool use, not single-shot codegen — is demanding, and we measured it:

It's a RAM problem, not just a VRAM one. Top open models are large MoEs (e.g. Qwen3-Coder-Next is ~51 GB of weights); on a 16 GB-GPU / 30 GB-RAM box they don't fit, even with offload. The reachable set is ~≤13 GB-resident models.
No single local model dominates. A benchmark across the reachable models reordered the "best" model on every fixture — devstral leads on string tasks, qwen on algorithmic ones, and the top-of-the-leaderboard local model (gpt-oss:20b) scored last on our grid. Leaderboard scores do not predict agentic-loop performance.
The language matters as much as the model. We ran the same CSV-parser task in Rust, TS, and Go (react, per-model pass rate; data in docs/benchmarks/fixtures/t25-csv-parse*.json):

Model Rust TS Go

qwen2.5-coder:14b 0/5 2/3 0/3

gpt-oss:20b 0/5 1/3 0/3

devstral-small-2:24b 5/5 3/3 2/3

gemma3:12b 4/5 2/3 1/3

The lesson isn't "static typing is hard" — it's that TypeScript is uniquely forgiving, while Rust and Go are strict and hard for weaker local models. The two models that recover in TS (qwen, gpt-oss) crash right back to 0/3 in Go — Go's strictness (unused imports/vars are compile errors, byte-vs-rune) punishes them almost like Rust's borrow checker. So the local ceiling — and how much the claude -p fallback is load-bearing — depends heavily on your language: lowest for TS/JS, high for Rust and Go.
So hex doesn't bet on one model. It runs best-of-N across a complementary pair and falls back to claude -p for tasks locals can't finish. That's the honest path to reliability on this hardware — most so for Rust, less needed for TS.

If you have a frontier API or a logged-in claude CLI, hex is strong. If you're strictly local on commodity hardware, hex works but inherits the local models' ceiling — and the benchmark tells you exactly where that is.

Architecture

Full detail in ARCHITECTURE.md (the living map; always describes HEAD). The Rust workspace decomposes nexus behind ports (ADR-2606071340); the reusable core crates:

Crate	Role
hex-core	Domain types + all port traits; the gravity center every crate depends on (no intra-workspace deps)
hex-exec	The agent engine: single-agent ReAct loop, best-of-N, `claude -p` delegate, the adversarial harness, the resource governor, guarded tools
hex-graph	Code-knowledge-graph engine → `graph-out/graph.json` (`context_for`, `rank_lessons`)
hex-analysis	Tree-sitter boundary checking; powers `hex analyze`
hex-git / hex-state	git plumbing (libgit2) · SpacetimeDB state adapter
hex-nexus	Composition root + daemon (axum `:5555`, dashboard, DI) — the only place adapters are wired
hex-cli	The canonical `hex` entry point

Support crates round out the workspace: hex-agent (architecture-enforcement runtime), hex-parser (parsing), hex-desktop (Tauri dashboard wrapper). SpacetimeDB (required) is the coordination/state core — WASM modules live in spacetime-modules/; because WASM can't touch FS/spawn/network, hex-nexus is the FS-bridge daemon.

Quick start

hex bootstrap          # prerequisites, SpacetimeDB, Ollama (if present), config
hex nexus start        # the daemon (dashboard at :5555)
hex do run --file <f> --evidence "<cmd that must exit 0>" "<what to do>"
hex bench agentic --filter <fixture>   # measure a model through the real loop
hex swarm build "<challenge>" --target <dir> --gate "<test>" --review   # design + harden
hex dev deploy         # rebuild + install + restart, one command
hex analyze .          # architecture grade + boundary violations

Governance

ADRs are an append-only ledger — decisions are never edited or deleted; a changed decision gets a new ADR that supersedes the old one. Lifecycle: Proposed → Accepted → Completed, or Rejected | Abandoned | Superseded | Deprecated. Status changes only via hex adr accept|complete|supersede.
Epochs group ADRs by design era (foundation → org-sim (retired) → single-agent → hybrid-inference (current — ADR-2606072243)). hex adr reindex regenerates the INDEX.
ARCHITECTURE.md is the living map; the ADR ledger is its history. If code or older docs contradict the map, the map wins — and the contradiction is worth an ADR.

Influences & attestation

The hex-graph code-knowledge engine is graphify-influenced — a GraphRAG-style code graph (typed nodes + edges, community detection, EXTRACTED/INFERRED/AMBIGUOUS confidence levels) reimplemented natively in Rust. The single-agent execution model (one gateway-mediated ReAct loop fed by code-graph context + memory, ADR-2606061359) converges with ideas from OpenClaw and Hermes Agent (Nous Research). These shaped the design; the implementation is hex's own.

Operational rules (how to drive hex day-to-day) live in CLAUDE.md. The architecture map is ARCHITECTURE.md.

Name		Name	Last commit message	Last commit date
Latest commit History 2,922 Commits
.cargo		.cargo
.claude		.claude
.githooks		.githooks
.github		.github
.hex-review		.hex-review
.hex-ux-review		.hex-ux-review
.hex		.hex
.opencode		.opencode
assets		assets
bin		bin
config		config
docs		docs
examples		examples
hex-agent		hex-agent
hex-analysis		hex-analysis
hex-analyzer		hex-analyzer
hex-cli		hex-cli
hex-core		hex-core
hex-desktop		hex-desktop
hex-exec		hex-exec
hex-git		hex-git
hex-graph		hex-graph
hex-nexus		hex-nexus
hex-parser		hex-parser
hex-setup/mcp		hex-setup/mcp
hex-state		hex-state
scripts		scripts
spacetime-modules		spacetime-modules
.gitignore		.gitignore
.hexignore		.hexignore
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
IMPLEMENTATION-STATUS.md		IMPLEMENTATION-STATUS.md
LANGUAGE-INJECTION-STATUS.md		LANGUAGE-INJECTION-STATUS.md
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
clippy.toml		clippy.toml
docker-compose.yml		docker-compose.yml
eslint.config.js		eslint.config.js
founding-goals.md		founding-goals.md
install.sh		install.sh
opencode.json		opencode.json
package.json		package.json
tsconfig.json		tsconfig.json
tsconfig.test.json		tsconfig.test.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What hex is

The execution model

The agentic harness

What hex can do today

Local AI: the honest picture

Architecture

Quick start

Governance

Influences & attestation

About

Uh oh!

Releases 33

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Model	Rust	TS	Go
qwen2.5-coder:14b	0/5	2/3	0/3
gpt-oss:20b	0/5	1/3	0/3
devstral-small-2:24b	5/5	3/3	2/3
gemma3:12b	4/5	2/3	1/3

Folders and files

Latest commit

History

Repository files navigation

What hex is

The execution model

The agentic harness

What hex can do today

Local AI: the honest picture

Architecture

Quick start

Governance

Influences & attestation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 33

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages