feat: add hermetic backend e2e harness#8004
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 713a9e1764
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| echo "Checking/installing fake dependencies..." | ||
| install_if_missing fake_firestore fake-firestore |
There was a problem hiding this comment.
Track e2e dependencies instead of pip-installing them
In a clean checkout or any CI/offline developer environment where PyPI is blocked, bash backend/testing/e2e/run.sh exits before pytest because these required packages are not declared in backend/requirements.txt or another tracked lockfile and the script tries to fetch them dynamically. I checked repo-wide references with rg; the new harness is the only place these packages are named, so the advertised one-command harness is not reproducible without network access.
Useful? React with 👍 / 👎.
| original_init(self, *args, **kwargs) | ||
| fake = get_fake_redis() | ||
| # Delegate all key methods to the fake | ||
| for attr in ( |
There was a problem hiding this comment.
Delegate Redis scripts to fakeredis
Rate-limited endpoints in these e2e tests do not actually exercise fakeredis because database/redis_db.py creates _RATE_LIMIT_LUA = r.register_script(...) at import time, but this delegation list never forwards register_script/evalsha/script_load to the fake client. In contexts like /v3/memories, the script remains bound to the real redis.Redis client and failures are swallowed by the fail-open path, so the harness can pass while Redis behavior is not being tested.
Useful? React with 👍 / 👎.
Greptile SummaryAdds a hermetic backend e2e test harness under
Confidence Score: 4/5Safe to merge as a manually-run local harness; the defects are in the test infrastructure itself and do not touch production code. The network guard's
Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant Shell as run.sh
participant Pytest as pytest
participant Conf as conftest.py
participant Fakes as fakes/*
participant App as FastAPI app (main.py)
participant DB as MockFirestore / FakeRedis
Shell->>Pytest: python -m pytest testing/e2e/
Pytest->>Conf: import conftest.py
Conf->>Conf: patch google.auth.default
Conf->>Conf: _set_e2e_env() — overwrite credentials
Conf->>Conf: install socket guard (connect/getaddrinfo)
Conf->>Fakes: setup_fake_firestore() → MockFirestore
Conf->>Fakes: setup_fake_redis() → FakeRedis
Conf->>Fakes: patch_google_firestore() — delegate collection/batch/get_all
Conf->>Fakes: patch_redis_client() — delegate scalar ops
Conf->>App: import main → FastAPI app constructed
App-->>Conf: app cached in _app_cache
loop each test
Pytest->>Conf: client fixture → TestClient(app)
Pytest->>Fakes: seed_conversation / seed_memory / seed_action_item
Fakes->>DB: MockFirestore.collection().document().set()
Pytest->>App: TestClient.get / post / patch / delete
App->>DB: Firestore query / Redis op (via patched clients)
DB-->>App: in-memory result
App-->>Pytest: HTTP response
Pytest->>Pytest: assert status / body
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant Shell as run.sh
participant Pytest as pytest
participant Conf as conftest.py
participant Fakes as fakes/*
participant App as FastAPI app (main.py)
participant DB as MockFirestore / FakeRedis
Shell->>Pytest: python -m pytest testing/e2e/
Pytest->>Conf: import conftest.py
Conf->>Conf: patch google.auth.default
Conf->>Conf: _set_e2e_env() — overwrite credentials
Conf->>Conf: install socket guard (connect/getaddrinfo)
Conf->>Fakes: setup_fake_firestore() → MockFirestore
Conf->>Fakes: setup_fake_redis() → FakeRedis
Conf->>Fakes: patch_google_firestore() — delegate collection/batch/get_all
Conf->>Fakes: patch_redis_client() — delegate scalar ops
Conf->>App: import main → FastAPI app constructed
App-->>Conf: app cached in _app_cache
loop each test
Pytest->>Conf: client fixture → TestClient(app)
Pytest->>Fakes: seed_conversation / seed_memory / seed_action_item
Fakes->>DB: MockFirestore.collection().document().set()
Pytest->>App: TestClient.get / post / patch / delete
App->>DB: Firestore query / Redis op (via patched clients)
DB-->>App: in-memory result
App-->>Pytest: HTTP response
Pytest->>Pytest: assert status / body
end
|
| """Return the fixed dev-test UID.""" | ||
| return DEV_UID | ||
|
|
There was a problem hiding this comment.
sendto guard only handles the 2-arg form
socket.socket.sendto has two call forms: sendto(data, address) and sendto(data, flags, address). The patched signature _guarded_socket_sendto(self, data, address) only covers the first. If any library in the test process calls the 3-arg form (e.g. sock.sendto(b"payload", 0, ("host", 80))), Python invokes _guarded_socket_sendto(sock, b"payload", 0, ("host", 80)) — four positional arguments against a three-parameter function — raising an unhandled TypeError instead of the expected AssertionError. The guard would crash the test with an opaque error rather than surfacing a blocked network call.
|
|
||
|
|
||
| # ─── Utility fixtures ────────────────────────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.fixture() | ||
| def fresh_uid(): | ||
| """Generate a unique UID per test for isolation.""" | ||
| import uuid | ||
|
|
||
| return str(uuid.uuid4()) | ||
|
|
||
|
|
||
| def _now_iso() -> str: | ||
| return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ") | ||
|
|
||
|
|
||
| @pytest.fixture() |
There was a problem hiding this comment.
Session-scoped Firestore store + hardcoded UID "123" accumulates state across tests
fake_firestore is session-scoped, so the same in-memory store is shared by every test in the session. Multiple test classes (TestConversationCRUD, TestConversationProcessing, TestFailureModes, TestMigrationSafety) each seed_conversation("123", ...) with fixed document IDs such as "fail-persist-001", "discard-test-active", "migration-idempotent-001", etc. Tests that assert on list responses (e.g., test_discarded_conversation_filtered, test_mixed_format_coexistence) see all documents seeded by previously-run tests, making the assertions dependent on execution order. A test that passes in isolation can fail when run after another test that left conflicting data in the shared store.
|
Follow-up pushed in Changes made:
Dogfood after the follow-up: GitHub Lint & Format Check passed on the new commit. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 97186a6553
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| def _guarded_getaddrinfo(host, port, *args, **kwargs): | ||
| if host is not None and host not in _ALLOWED_NETWORK_HOSTS: | ||
| raise AssertionError(f"Hermetic e2e blocked DNS lookup for {host!r}") |
There was a problem hiding this comment.
Make network guard failures uncatchable
When a real outbound call happens inside backend code that intentionally catches Exception (for example Stripe startup validation or vector/persona fallback paths), this AssertionError (and the connect guard above) can be swallowed; with the client fixture also disabling logging, the harness still passes even though a real service call was attempted. To preserve the hermetic guarantee in those contexts, raise a sentinel that broad except Exception handlers do not catch or record the attempt and fail during teardown.
Useful? React with 👍 / 👎.
|
Expanded the hermetic harness across the five requested areas in What changed:
Other harness improvements:
Dogfood: Also verified filtered runner behavior: Independent review of the staged diff passed with no blockers. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fabdc5a0f0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| tc = TestClient(app) | ||
| yield tc |
There was a problem hiding this comment.
Run the FastAPI lifespan in the e2e client
In this fixture the TestClient is yielded without entering it as a context manager. FastAPI/Starlette only drive lifespan handlers when the client is used as a context manager, so the backend startup/shutdown handlers in backend/main.py (startup_event/shutdown_event) are skipped for every e2e request. That means regressions in startup tasks or shutdown cleanup can pass this harness; wrap the yield in with TestClient(app) as tc: so these tests exercise the real app lifecycle.
Useful? React with 👍 / 👎.
| def download_as_bytes(self, *args, **kwargs) -> bytes: | ||
| if not self.exists(): | ||
| raise FileNotFoundError(self.name) |
There was a problem hiding this comment.
Raise GCS NotFound from missing fake blobs
When a backend path tests a missing GCS object through this fake, it will see FileNotFoundError instead of google.cloud.exceptions.NotFound/BlobNotFound. The real storage helpers catch the Google exception in paths like download_syncing_temporal_file, download_playback_artifact, and delete_blob, so missing-object e2e scenarios either fail/500 under the fake or have to avoid those branches even though production handles them. Raise the same NotFound class here.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2181d5aad6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -0,0 +1,171 @@ | |||
| # Hermetic Backend E2E Harness | |||
|
|
|||
| A manually runnable integration test suite that imports the **real omi FastAPI backend** and exercises selected routes against **faked or disabled external dependencies**. It is intended as a local dogfood harness first; there is no CI wiring yet. | |||
There was a problem hiding this comment.
Wire the new e2e suite into CI
This adds a large set of backend test files while explicitly leaving them with “no CI wiring yet”, but backend/AGENTS.md requires new test files to be added to backend/test.sh; I also checked backend/test.sh and workflow YAML with rg and found no testing/e2e invocation. As a result these 50+ hermetic scenarios can silently rot and won’t catch regressions unless a developer remembers to run the separate script manually.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d726e5bbd1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| def _receive_until(websocket, predicate, *, limit=20): | ||
| for _ in range(limit): | ||
| message = websocket.receive() |
There was a problem hiding this comment.
Bound websocket receives with a timeout
If the listen websocket regresses by keeping the connection open but not emitting the expected service_status or segment payload, this blocking receive() never reaches the limit check, so bash backend/testing/e2e/run.sh -k listen_stt can hang instead of failing. Add a per-receive timeout (or run the receive under a bounded wait) so missing websocket events produce a deterministic test failure.
Useful? React with 👍 / 👎.
| self.metadata = None | ||
| self.cache_control = None | ||
| self.content_type = None |
There was a problem hiding this comment.
Persist fake blob metadata across blob instances
Because metadata, cache_control, and content_type are stored only on each transient FakeBlob, production paths that upload with metadata and later call bucket.blob(...).reload().metadata on a fresh object lose that state under the fake. For example utils.other.storage.get_or_create_merged_audio() relies on cached blob metadata to detect valid cached audio, so e2e coverage for that path would rebuild or skip expiry behavior that production exercises; keep per-bucket/name blob state or a sidecar metadata file with the bytes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ee71643b4a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -0,0 +1,4 @@ | |||
| fake-firestore==0.13.1 | |||
| fakeredis==2.36.2 | |||
There was a problem hiding this comment.
Install fakeredis with Lua scripting
In a clean e2e install that follows this requirements file, fakeredis is installed without its lua extra, so EVAL/EVALSHA support needed by the backend's registered Lua scripts is absent. Fresh evidence after the earlier Redis delegation fix is that this new tracked requirements file still installs core fakeredis only; routes that hit database/redis_db.py rate-limit scripts or webhook health scripts can then fail open/fallback instead of exercising the fake Redis script behavior, letting the harness report green while that Redis logic remains untested. Use fakeredis[lua] here or add the equivalent lupa dependency.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b64da7fcf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| count = 1 | ||
| r.hset(key, 'failure_count', str(count)) | ||
|
|
||
| disabled = count >= _DEV_FAILURE_THRESHOLD |
There was a problem hiding this comment.
Return true only on the disable transition
When this non-Lua fallback is used in a Redis-compatible store without scripting, concurrent webhook failures can both pass the earlier disabled check, receive HINCRBY counts at or above the threshold, and return True; each caller then runs the auto-disable path and sends the developer notification. The Lua path serializes the disabled check/set, so this fallback is not equivalent under multi-worker traffic right as a webhook crosses the threshold.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5be32d8d26
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| echo "ERROR: e2e pytest exceeded timeout ${PYTEST_TIMEOUT}" | ||
| fi | ||
| else | ||
| python -m pytest testing/e2e/ "$@" |
There was a problem hiding this comment.
Preserve the e2e timeout when timeout is unavailable
In environments where command -v timeout is false, this fallback runs pytest without any bound, so the advertised E2E_PYTEST_TIMEOUT protection is silently disabled and websocket/provider regressions can hang the one-command harness indefinitely. Since the script already documents that those regressions should fail instead of hanging, make the fallback enforce the same timeout or fail with a clear prerequisite error.
Useful? React with 👍 / 👎.
kodjima33
left a comment
There was a problem hiding this comment.
Backend e2e test harness — useful infra; approve only (feature/infra, Nik owns backend test-suite direction).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7974ff5f5f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| hermetic-e2e: | ||
| name: Hermetic Backend E2E | ||
| runs-on: ubuntu-latest | ||
| continue-on-error: true |
There was a problem hiding this comment.
Remove the e2e job's continue-on-error
For backend PRs where bash backend/testing/e2e/run.sh fails or times out, this job is still treated as non-blocking by GitHub Actions, so the workflow can remain green and required checks won't prevent the regression from merging. That defeats the new CI coverage for the hermetic route scenarios; remove continue-on-error or restrict it to an explicitly experimental/manual path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f9666d2185
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| set +e | ||
| if command -v timeout >/dev/null 2>&1; then | ||
| timeout --preserve-status --kill-after=5s "$PYTEST_TIMEOUT" python -m pytest testing/e2e/ "$@" |
There was a problem hiding this comment.
Prewarm tiktoken in the local runner
In a clean developer/CI environment where the tiktoken cache is empty, this direct pytest launch installs the socket guard before importing the backend, and utils/llm/clients.py calls tiktoken.encoding_for_model('gpt-4') at module import. The workflow has a prewarm step, but bash backend/testing/e2e/run.sh does not, so the advertised local one-command harness can fail on a blocked DNS lookup to the tokenizer blob host before any tests run. Prewarm the tokenizer cache before starting pytest or stub that import for the e2e process.
Useful? React with 👍 / 👎.
| if getattr(module, "db", None) is old_db: | ||
| setattr(module, "db", fake_firestore_instance) | ||
| if getattr(module, "r", None) is old_r: | ||
| setattr(module, "r", fake_redis_instance) |
There was a problem hiding this comment.
Relink aliased fake clients too
This relink only replaces module globals named exactly db or r. I checked repo-wide aliases and routers/integration.py, routers/notifications.py, and utils/fair_use.py bind r as redis_client while utils/retrieval/tools/screen_activity_tools.py binds db as firestore_db; once these modules are imported by the real app, those aliases keep the pre-patch clients and routes through them can hit real localhost/ADC clients or fail-open instead of exercising the hermetic fakes. Iterate module variables by identity rather than only checking the two literal names.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
5 issues found across 31 files
Tip: instead of fixing issues one by one fix them all with cubic
Re-trigger cubic
There was a problem hiding this comment.
1 issue found across 8 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="backend/testing/e2e/run.sh">
<violation number="1" location="backend/testing/e2e/run.sh:80">
P2: The runner unconditionally performs a potentially networked tokenizer download, creating a non-hermetic external dependency in the harness entrypoint.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Fix all with cubic | Re-trigger cubic
| python - <<'PY' | ||
| import tiktoken | ||
|
|
||
| tiktoken.encoding_for_model('gpt-4') |
There was a problem hiding this comment.
P2: The runner unconditionally performs a potentially networked tokenizer download, creating a non-hermetic external dependency in the harness entrypoint.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/testing/e2e/run.sh, line 80:
<comment>The runner unconditionally performs a potentially networked tokenizer download, creating a non-hermetic external dependency in the harness entrypoint.</comment>
<file context>
@@ -70,6 +70,16 @@ fi
+python - <<'PY'
+import tiktoken
+
+tiktoken.encoding_for_model('gpt-4')
+PY
+
</file context>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9d810bb317
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """The non-Lua fallback should match the Lua script's one-shot disable semantics.""" | ||
|
|
||
| monkeypatch.setattr(webhook_health, "_DEV_FAILURE_THRESHOLD", 2) | ||
| monkeypatch.setattr(webhook_health, "r", fake_redis) |
There was a problem hiding this comment.
Force the non-Lua path in the fallback test
With testing/e2e/requirements.txt now installing fakeredis[lua], assigning webhook_health.r to fake_redis still gives record_dev_webhook_failure() a working register_script, so these assertions exercise the Lua script path rather than _record_dev_webhook_failure_fallback(). In Redis-compatible environments without scripting, the fallback can regress while this test remains green; make this test force _get_dev_failure_script()/script execution to raise or use a fake that lacks scripting support.
Useful? React with 👍 / 👎.
Summary
Adds a hermetic backend e2e harness under
backend/testing/e2e/and wires it into GitHub Actions as a required/blocking backend PR check.The harness imports the real FastAPI backend and exercises selected route/seam coverage against fake/disabled external-service boundaries. It is intended to catch backend boot/route regressions and accidental real-network/service leakage without requiring Firestore, Redis, GCS, Pinecone, Typesense, LLM/STT provider keys, or Google ADC.
What this covers
TestClientexecutionfake-firestorehelpersfakeredis[lua]SERVICE_ACCOUNT_JSON,GOOGLE_APPLICATION_CREDENTIALS, andPINECONE_API_KEYE2E_PYTEST_TIMEOUTso websocket/provider-seam regressions fail instead of hanging.CI
New workflow:
It runs on backend PR changes and is now blocking: failures/timeouts fail PR status.
Verification
Latest local full harness:
Result:
Focused changed-area run:
Earlier local flake loop:
Explicit v1 limitations
The harness intentionally keeps incomplete provider flows as explicit skips/deferred v2 work rather than silent passes:
/v4/listen/ pusher remains v2.Framing
This PR adds a strong Layer 1 hermetic route/seam e2e harness. It is not full production-provider integration coverage.