throw-if-null · MirzaMerdovic · May 14, 2026 · May 14, 2026 · May 14, 2026
diff --git a/docs/README.md b/docs/README.md
@@ -21,6 +21,9 @@ Use it to understand what the project is for, which constraints must hold, which
   - the major runtime parts and how they fit together
 - `docs/architecture/auth-model.md`
   - the current runtime auth model and transitional bot token path
+- `docs/architecture/testing.md`
+  - the repository's behavioral, integration-first testing strategy
+  - preferred test surfaces and when unit tests are justified
 
 ### Decision history
 - `docs/architecture/adrs/`
@@ -65,6 +68,7 @@ Use it to understand what the project is for, which constraints must hold, which
 - Architecture guardrails: read `docs/architecture/invariants.md`
 - System shape: read `docs/architecture/overview.md`
 - Auth and bot identity model: read `docs/architecture/auth-model.md`
+- Testing strategy and preferred test boundaries: read `docs/architecture/testing.md`
 - Terminology and identity definitions: read `docs/glossary.md`
 - "Why is it designed this way?": read the ADRs
 - "What is intentionally imperfect right now?": read `docs/project/debt.md`

diff --git a/docs/architecture/testing.md b/docs/architecture/testing.md
@@ -0,0 +1,153 @@
+# `orfe` testing strategy
+
+## Summary
+
+`orfe` prefers **behavioral tests over structural tests**.
+
+When choosing a test shape, optimize for the most confidence per test:
+
+1. prefer **live E2E** when the flow is high-value, safe, and deterministic
+2. otherwise prefer **integration tests with minimal mocking**
+3. use **unit tests rarely and intentionally**
+
+Broad behavioral coverage is more valuable here than a large count of narrow tests or inflated coverage numbers.
+
+This strategy preserves the architecture invariants from ADR 0001, ADR 0002, ADR 0007, and `docs/architecture/invariants.md`: tests should reinforce `orfe` as a generic runtime, keep the wrapper/core boundary explicit, and treat command slices as the semantic owners of command behavior.
+
+## Decision rule
+
+Before adding a test, ask:
+
+- what public or architectural boundary am I protecting?
+- can a broader behavioral test cover this with the same or better confidence?
+- am I mocking only true external boundaries?
+- if this is a unit test, why would a broader integration test be insufficient or too costly?
+
+If a test mostly restates implementation structure, command metadata, or wrapper passthrough, it is probably not the right default.
+
+## Preferred test surfaces in this repository
+
+### 1. CLI boundary
+
+Prefer CLI tests when protecting:
+
+- command-line parsing
+- usage errors and help output
+- JSON success and error envelopes
+- caller requirements for direct CLI usage
+
+Use the real `runCli` path with minimal stubbing around config, auth, and network boundaries.
+
+### 2. OpenCode tool boundary
+
+Prefer OpenCode tool tests when protecting:
+
+- `context.agent` handling
+- tool-specific input rejection
+- OpenCode result envelopes
+- plugin/tool contract behavior that is not visible from CLI tests
+
+Test the real `executeOrfeTool` path. Only keep plugin-shape tests when the plugin schema itself is part of the public contract under protection.
+
+### 3. Core runtime boundary
+
+Prefer core tests when protecting:
+
+- wrapper-independent command behavior
+- command dispatch, validation, and dependency loading rules
+- auth/config/GitHub requirements at the runtime boundary
+
+These tests should exercise the real `runOrfeCore` composition path rather than mocking internal collaborators slice-by-slice.
+
+### 4. Command + GitHub integration path
+
+For GitHub-backed commands, default to integration tests that:
+
+- run the real command through CLI, tool, or core
+- use real slice definitions and handlers
+- mock only external boundaries such as GitHub HTTP, machine-local auth config, or repo-local config loading
+
+This is the main practical default for `orfe` today.
+
+### 5. Template runtime behavior
+
+Template loading, validation, provenance, and normalization should be tested behaviorally at the template/runtime boundary rather than by inspecting template-loading implementation details.
+
+## Mocking guidance
+
+Mock only true external boundaries when practical:
+
+- GitHub network calls
+- machine-local auth/config files
+- repo-local config discovery when a test needs a controlled repo shape
+- runtime host context such as `context.agent`
+
+Avoid over-mocking internal collaborators when the real integration path is easy to exercise.
+
+## When unit tests are justified
+
+Unit tests are acceptable, but they are the exception rather than the default.
+
+Use them mainly for:
+
+- sensitive auth or security logic
+- tricky normalization or validation logic
+- small pure functions with failure modes that are hard to diagnose from higher-level tests
+- performance-sensitive logic where isolated feedback materially helps
+
+Do **not** default to unit tests just because they are easy to write or increase coverage percentages.
+
+## Low-value test patterns to avoid
+
+Treat these categories skeptically and keep them only when they protect a meaningful public contract:
+
+- **thin forwarding tests** that only prove one wrapper passes inputs straight through
+- **source-text assertions** that inspect implementation files instead of runtime behavior
+- **shallow metadata tests** that mostly restate command literals, option declarations, or examples
+- **duplicated routing/help assertions across layers** when one slice-local suite plus one real boundary suite already covers the contract
+- **over-mocked tests** that replace practical internal integration with isolated doubles
+
+Metadata tests are justified only when the metadata itself is the contract, such as published package entrypoints or the OpenCode plugin schema.
+
+## Current suite review and changes applied in issue #159
+
+This issue reviewed the existing suite against the strategy above and reduced several low-signal categories:
+
+- removed thin forwarding tests for the CLI entrypoint wrapper and OpenCode core-input passthrough
+- removed duplicated runtime-routing/help tests from broad cross-cutting suites when slice-local help tests already covered the same behavior
+- removed metadata-heavy command definition tests that mainly restated static option declarations instead of protecting runtime behavior
+- removed the CLI source shebang source-text assertion because it inspected implementation text rather than observable behavior
+- kept stronger behavioral tests around CLI execution, OpenCode behavior, core runtime behavior, template validation, and GitHub-backed command flows
+- kept contract-oriented metadata checks only where the metadata itself is meaningful public surface, such as package entrypoints and plugin schema exposure
+
+The suite should continue moving toward fewer but broader behavioral tests, especially when a command already has strong CLI/core/tool coverage.
+
+## Live E2E guidance
+
+Live GitHub E2E is worth keeping or adding later for a **small smoke set**, not for every workflow.
+
+Highest-value candidates are:
+
+- bot-auth token minting against a real sandbox installation
+- one read/write issue flow in a disposable sandbox repo
+- one PR creation or update flow in a disposable sandbox repo
+- one project status mutation smoke test if the sandbox setup is reliable
+
+Live E2E should stay selective because `orfe` depends on bot auth, GitHub state, and external API stability. Integration tests with mocked external boundaries remain the main default until safe sandbox E2E is routine.
+
+## Reviewer checklist
+
+When reviewing a new test, prefer these outcomes:
+
+- a meaningful repo boundary is exercised
+- internal mocking is minimal
+- the test protects behavior, not implementation layout
+- a unit test includes a clear reason it is narrower than the default strategy
+- no one is using coverage percentage as the primary success metric
+
+## Related docs
+
+- `docs/README.md`
+- `docs/architecture/invariants.md`
+- `docs/architecture/overview.md`
+- `docs/project/debt.md`
diff --git a/docs/project/debt.md b/docs/project/debt.md
@@ -38,3 +38,8 @@ This file keeps known documentation, architecture, and process debt visible so i
 - **Impact:** the repository now has a versioned template foundation under `.orfe/templates/`, but `.github/ISSUE_TEMPLATE/feature.md` and `.github/pull_request_template.md` still remain active human-facing fallback aids. That creates a temporary dual-source risk for artifact structure expectations.
 - **Current treatment:** treat versioned repo-defined templates as the canonical runtime source for validated agent-authored artifacts, while keeping the GitHub-native templates as transitional workflow aids that `orfe` does not read or depend on.
 - **Follow-up direction:** once contract-driven authoring and validation are in routine use, reduce or realign the GitHub-native templates intentionally so durable structure expectations do not drift.
+
+### 8. Live GitHub E2E coverage is still intentionally limited
+- **Impact:** the testing strategy prefers live E2E where practical, but the current suite still relies mainly on integration tests with mocked GitHub boundaries. That leaves a gap in real bot-auth and live GitHub smoke coverage.
+- **Current treatment:** keep integration tests as the main default because they are deterministic and practical, and reserve live E2E for future high-value smoke coverage only.
+- **Follow-up direction:** add a small sandboxed live E2E smoke set for token minting and a few representative issue/PR/project flows when the repository has a safe, reliable sandbox path.
diff --git a/src/commands/auth/token/definition.test.ts b/src/commands/auth/token/definition.test.ts
diff --git a/src/commands/help/definition.test.ts b/src/commands/help/definition.test.ts
@@ -4,16 +4,22 @@ import { test } from 'vitest';
 
 import { COMMANDS, helpCommand } from '../index.js';
 import { validateCommandInput } from '../registry/index.js';
-import { assertDefinitionIdentity, assertOption } from '../../../test/support/definition-test.js';
 import { createHelpCommandSuccessData, createHelpRootSuccessData } from './definition.js';
 
 test('help definition stays a top-level runtime command with optional targeted lookup', () => {
-  assertDefinitionIdentity(helpCommand, { name: 'help', group: 'help', leaf: 'help', execution: 'runtime', topLevel: true });
-  assertOption(helpCommand, 'command_name', {
-    flag: '--command-name',
-    type: 'string',
-    description: 'Return detailed help for one canonical command name.',
-  });
+  assert.equal(helpCommand.name, 'help');
+  assert.equal(helpCommand.group, 'help');
+  assert.equal(helpCommand.leaf, 'help');
+  assert.equal(helpCommand.execution, 'runtime');
+  assert.equal(helpCommand.topLevel, true);
+  assert.deepEqual(helpCommand.options, [
+    {
+      key: 'command_name',
+      flag: '--command-name',
+      description: 'Return detailed help for one canonical command name.',
+      type: 'string',
+    },
+  ]);
   assert.equal(helpCommand.requiresCaller, false);
   assert.equal(helpCommand.requiresRepoConfig, false);
   assert.equal(helpCommand.requiresAuthConfig, false);

diff --git a/src/commands/issue/comment/definition.test.ts b/src/commands/issue/comment/definition.test.ts
diff --git a/src/commands/issue/create/definition.test.ts b/src/commands/issue/create/definition.test.ts
diff --git a/src/commands/issue/get/definition.test.ts b/src/commands/issue/get/definition.test.ts
diff --git a/src/commands/issue/set-state/cli.test.ts b/src/commands/issue/set-state/cli.test.ts
@@ -246,3 +246,36 @@ test('runCli formats core invalid_usage errors as CLI usage failures', async ()
   assert.match(result.stderr, /Example: ORFE_CALLER_NAME=Greg orfe issue set-state --issue-number 14 --state closed --state-reason completed/);
   assert.match(result.stderr, /See: orfe issue set-state --help/);
 });
+
+test('runCli requires --duplicate-of when closing an issue as duplicate', async () => {
+  const result = await invokeCli(['issue', 'set-state', '--issue-number', '14', '--state', 'closed', '--state-reason', 'duplicate'], {
+    env: { ORFE_CALLER_NAME: 'Greg' },
+    loadRepoConfigImpl: async () => {
+      throw new OrfeError('internal_error', 'loadRepoConfigImpl should not run');
+    },
+  });
+
+  assert.equal(result.exitCode, 2);
+  assert.equal(result.stdout, '');
+  assert.match(result.stderr, /issue set-state requires --duplicate-of when state_reason=duplicate\./);
+  assert.match(result.stderr, /Usage: orfe issue set-state/);
+  assert.match(result.stderr, /See: orfe issue set-state --help/);
+});
+
+test('runCli rejects marking an issue as a duplicate of itself', async () => {
+  const result = await invokeCli(
+    ['issue', 'set-state', '--issue-number', '14', '--state', 'closed', '--state-reason', 'duplicate', '--duplicate-of', '14'],
+    {
+      env: { ORFE_CALLER_NAME: 'Greg' },
+      loadRepoConfigImpl: async () => {
+        throw new OrfeError('internal_error', 'loadRepoConfigImpl should not run');
+      },
+    },
+  );
+
+  assert.equal(result.exitCode, 2);
+  assert.equal(result.stdout, '');
+  assert.match(result.stderr, /issue set-state cannot mark an issue as a duplicate of itself\./);
+  assert.match(result.stderr, /Usage: orfe issue set-state/);
+  assert.match(result.stderr, /See: orfe issue set-state --help/);
+});
diff --git a/src/commands/issue/set-state/definition.test.ts b/src/commands/issue/set-state/definition.test.ts
diff --git a/src/commands/issue/update/definition.test.ts b/src/commands/issue/update/definition.test.ts