Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ Use it to understand what the project is for, which constraints must hold, which
- the major runtime parts and how they fit together
- `docs/architecture/auth-model.md`
- the current runtime auth model and transitional bot token path
- `docs/architecture/testing.md`
- the repository's behavioral, integration-first testing strategy
- preferred test surfaces and when unit tests are justified

### Decision history
- `docs/architecture/adrs/`
Expand Down Expand Up @@ -65,6 +68,7 @@ Use it to understand what the project is for, which constraints must hold, which
- Architecture guardrails: read `docs/architecture/invariants.md`
- System shape: read `docs/architecture/overview.md`
- Auth and bot identity model: read `docs/architecture/auth-model.md`
- Testing strategy and preferred test boundaries: read `docs/architecture/testing.md`
- Terminology and identity definitions: read `docs/glossary.md`
- "Why is it designed this way?": read the ADRs
- "What is intentionally imperfect right now?": read `docs/project/debt.md`
Expand Down
153 changes: 153 additions & 0 deletions docs/architecture/testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# `orfe` testing strategy

## Summary

`orfe` prefers **behavioral tests over structural tests**.

When choosing a test shape, optimize for the most confidence per test:

1. prefer **live E2E** when the flow is high-value, safe, and deterministic
2. otherwise prefer **integration tests with minimal mocking**
3. use **unit tests rarely and intentionally**

Broad behavioral coverage is more valuable here than a large count of narrow tests or inflated coverage numbers.

This strategy preserves the architecture invariants from ADR 0001, ADR 0002, ADR 0007, and `docs/architecture/invariants.md`: tests should reinforce `orfe` as a generic runtime, keep the wrapper/core boundary explicit, and treat command slices as the semantic owners of command behavior.

## Decision rule

Before adding a test, ask:

- what public or architectural boundary am I protecting?
- can a broader behavioral test cover this with the same or better confidence?
- am I mocking only true external boundaries?
- if this is a unit test, why would a broader integration test be insufficient or too costly?

If a test mostly restates implementation structure, command metadata, or wrapper passthrough, it is probably not the right default.

## Preferred test surfaces in this repository

### 1. CLI boundary

Prefer CLI tests when protecting:

- command-line parsing
- usage errors and help output
- JSON success and error envelopes
- caller requirements for direct CLI usage

Use the real `runCli` path with minimal stubbing around config, auth, and network boundaries.

### 2. OpenCode tool boundary

Prefer OpenCode tool tests when protecting:

- `context.agent` handling
- tool-specific input rejection
- OpenCode result envelopes
- plugin/tool contract behavior that is not visible from CLI tests

Test the real `executeOrfeTool` path. Only keep plugin-shape tests when the plugin schema itself is part of the public contract under protection.

### 3. Core runtime boundary

Prefer core tests when protecting:

- wrapper-independent command behavior
- command dispatch, validation, and dependency loading rules
- auth/config/GitHub requirements at the runtime boundary

These tests should exercise the real `runOrfeCore` composition path rather than mocking internal collaborators slice-by-slice.

### 4. Command + GitHub integration path

For GitHub-backed commands, default to integration tests that:

- run the real command through CLI, tool, or core
- use real slice definitions and handlers
- mock only external boundaries such as GitHub HTTP, machine-local auth config, or repo-local config loading

This is the main practical default for `orfe` today.

### 5. Template runtime behavior

Template loading, validation, provenance, and normalization should be tested behaviorally at the template/runtime boundary rather than by inspecting template-loading implementation details.

## Mocking guidance

Mock only true external boundaries when practical:

- GitHub network calls
- machine-local auth/config files
- repo-local config discovery when a test needs a controlled repo shape
- runtime host context such as `context.agent`

Avoid over-mocking internal collaborators when the real integration path is easy to exercise.

## When unit tests are justified

Unit tests are acceptable, but they are the exception rather than the default.

Use them mainly for:

- sensitive auth or security logic
- tricky normalization or validation logic
- small pure functions with failure modes that are hard to diagnose from higher-level tests
- performance-sensitive logic where isolated feedback materially helps

Do **not** default to unit tests just because they are easy to write or increase coverage percentages.

## Low-value test patterns to avoid

Treat these categories skeptically and keep them only when they protect a meaningful public contract:

- **thin forwarding tests** that only prove one wrapper passes inputs straight through
- **source-text assertions** that inspect implementation files instead of runtime behavior
- **shallow metadata tests** that mostly restate command literals, option declarations, or examples
- **duplicated routing/help assertions across layers** when one slice-local suite plus one real boundary suite already covers the contract
- **over-mocked tests** that replace practical internal integration with isolated doubles

Metadata tests are justified only when the metadata itself is the contract, such as published package entrypoints or the OpenCode plugin schema.

## Current suite review and changes applied in issue #159

This issue reviewed the existing suite against the strategy above and reduced several low-signal categories:

- removed thin forwarding tests for the CLI entrypoint wrapper and OpenCode core-input passthrough
- removed duplicated runtime-routing/help tests from broad cross-cutting suites when slice-local help tests already covered the same behavior
- removed metadata-heavy command definition tests that mainly restated static option declarations instead of protecting runtime behavior
- removed the CLI source shebang source-text assertion because it inspected implementation text rather than observable behavior
- kept stronger behavioral tests around CLI execution, OpenCode behavior, core runtime behavior, template validation, and GitHub-backed command flows
- kept contract-oriented metadata checks only where the metadata itself is meaningful public surface, such as package entrypoints and plugin schema exposure

The suite should continue moving toward fewer but broader behavioral tests, especially when a command already has strong CLI/core/tool coverage.

## Live E2E guidance

Live GitHub E2E is worth keeping or adding later for a **small smoke set**, not for every workflow.

Highest-value candidates are:

- bot-auth token minting against a real sandbox installation
- one read/write issue flow in a disposable sandbox repo
- one PR creation or update flow in a disposable sandbox repo
- one project status mutation smoke test if the sandbox setup is reliable

Live E2E should stay selective because `orfe` depends on bot auth, GitHub state, and external API stability. Integration tests with mocked external boundaries remain the main default until safe sandbox E2E is routine.

## Reviewer checklist

When reviewing a new test, prefer these outcomes:

- a meaningful repo boundary is exercised
- internal mocking is minimal
- the test protects behavior, not implementation layout
- a unit test includes a clear reason it is narrower than the default strategy
- no one is using coverage percentage as the primary success metric

## Related docs

- `docs/README.md`
- `docs/architecture/invariants.md`
- `docs/architecture/overview.md`
- `docs/project/debt.md`
5 changes: 5 additions & 0 deletions docs/project/debt.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,8 @@ This file keeps known documentation, architecture, and process debt visible so i
- **Impact:** the repository now has a versioned template foundation under `.orfe/templates/`, but `.github/ISSUE_TEMPLATE/feature.md` and `.github/pull_request_template.md` still remain active human-facing fallback aids. That creates a temporary dual-source risk for artifact structure expectations.
- **Current treatment:** treat versioned repo-defined templates as the canonical runtime source for validated agent-authored artifacts, while keeping the GitHub-native templates as transitional workflow aids that `orfe` does not read or depend on.
- **Follow-up direction:** once contract-driven authoring and validation are in routine use, reduce or realign the GitHub-native templates intentionally so durable structure expectations do not drift.

### 8. Live GitHub E2E coverage is still intentionally limited
- **Impact:** the testing strategy prefers live E2E where practical, but the current suite still relies mainly on integration tests with mocked GitHub boundaries. That leaves a gap in real bot-auth and live GitHub smoke coverage.
- **Current treatment:** keep integration tests as the main default because they are deterministic and practical, and reserve live E2E for future high-value smoke coverage only.
- **Follow-up direction:** add a small sandboxed live E2E smoke set for token minting and a few representative issue/PR/project flows when the repository has a safe, reliable sandbox path.
13 changes: 0 additions & 13 deletions src/commands/auth/token/definition.test.ts

This file was deleted.

20 changes: 13 additions & 7 deletions src/commands/help/definition.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,22 @@ import { test } from 'vitest';

import { COMMANDS, helpCommand } from '../index.js';
import { validateCommandInput } from '../registry/index.js';
import { assertDefinitionIdentity, assertOption } from '../../../test/support/definition-test.js';
import { createHelpCommandSuccessData, createHelpRootSuccessData } from './definition.js';

test('help definition stays a top-level runtime command with optional targeted lookup', () => {
assertDefinitionIdentity(helpCommand, { name: 'help', group: 'help', leaf: 'help', execution: 'runtime', topLevel: true });
assertOption(helpCommand, 'command_name', {
flag: '--command-name',
type: 'string',
description: 'Return detailed help for one canonical command name.',
});
assert.equal(helpCommand.name, 'help');
assert.equal(helpCommand.group, 'help');
assert.equal(helpCommand.leaf, 'help');
assert.equal(helpCommand.execution, 'runtime');
assert.equal(helpCommand.topLevel, true);
assert.deepEqual(helpCommand.options, [
{
key: 'command_name',
flag: '--command-name',
description: 'Return detailed help for one canonical command name.',
type: 'string',
},
]);
assert.equal(helpCommand.requiresCaller, false);
assert.equal(helpCommand.requiresRepoConfig, false);
assert.equal(helpCommand.requiresAuthConfig, false);
Expand Down
14 changes: 0 additions & 14 deletions src/commands/issue/comment/definition.test.ts

This file was deleted.

15 changes: 0 additions & 15 deletions src/commands/issue/create/definition.test.ts

This file was deleted.

14 changes: 0 additions & 14 deletions src/commands/issue/get/definition.test.ts

This file was deleted.

33 changes: 33 additions & 0 deletions src/commands/issue/set-state/cli.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -246,3 +246,36 @@ test('runCli formats core invalid_usage errors as CLI usage failures', async ()
assert.match(result.stderr, /Example: ORFE_CALLER_NAME=Greg orfe issue set-state --issue-number 14 --state closed --state-reason completed/);
assert.match(result.stderr, /See: orfe issue set-state --help/);
});

test('runCli requires --duplicate-of when closing an issue as duplicate', async () => {
const result = await invokeCli(['issue', 'set-state', '--issue-number', '14', '--state', 'closed', '--state-reason', 'duplicate'], {
env: { ORFE_CALLER_NAME: 'Greg' },
loadRepoConfigImpl: async () => {
throw new OrfeError('internal_error', 'loadRepoConfigImpl should not run');
},
});

assert.equal(result.exitCode, 2);
assert.equal(result.stdout, '');
assert.match(result.stderr, /issue set-state requires --duplicate-of when state_reason=duplicate\./);
assert.match(result.stderr, /Usage: orfe issue set-state/);
assert.match(result.stderr, /See: orfe issue set-state --help/);
});

test('runCli rejects marking an issue as a duplicate of itself', async () => {
const result = await invokeCli(
['issue', 'set-state', '--issue-number', '14', '--state', 'closed', '--state-reason', 'duplicate', '--duplicate-of', '14'],
{
env: { ORFE_CALLER_NAME: 'Greg' },
loadRepoConfigImpl: async () => {
throw new OrfeError('internal_error', 'loadRepoConfigImpl should not run');
},
},
);

assert.equal(result.exitCode, 2);
assert.equal(result.stdout, '');
assert.match(result.stderr, /issue set-state cannot mark an issue as a duplicate of itself\./);
assert.match(result.stderr, /Usage: orfe issue set-state/);
assert.match(result.stderr, /See: orfe issue set-state --help/);
});
35 changes: 0 additions & 35 deletions src/commands/issue/set-state/definition.test.ts

This file was deleted.

23 changes: 0 additions & 23 deletions src/commands/issue/update/definition.test.ts

This file was deleted.

Loading