Cleanup: four high-severity findings from PR #52 review by jcschaff · Pull Request #53 · biosimulations/platform

jcschaff · 2026-06-04T16:56:23Z

Follow-up to PR #52, addressing the four "High" findings from the code review (those that affect real production behavior).

What's in

Finding	Where	Fix
#14 Worker pod disk leak (orphaned OMEX temp files when submit raises)	`biosim_runs/activities.py::submit_biosim_run_activity`	Wrap `biosim_service.run_biosim_sim` in `try/finally`; `os.remove` itself in `try/except OSError` so a missing-file race is non-fatal.
#5 FAILED-row cache pollution + cache-lookup ordering hazard	`biosim_runs/activities.py::poll_biosim_run_activity`	Gate `insert_biosimulator_workflow_run` on `simulation_run.status == SUCCEEDED`. Activity still returns a fully-populated (un-saved) record so `OmexSimWorkflow` can surface the FAILED state — it doesn't consult `database_id`.
#10 Missing Mongo indexes on hot lookups	three DB services + `init_standalone` + test fixtures	New `ensure_indexes()` method (default no-op on the abstract base, Mongo impls override). Called from `init_standalone()` and every test fixture. Indexes listed below.
#2 Status endpoint 404s after Temporal history eviction even though `SimulationRunRecord` rows are still in Mongo	`simulations/router.py::get_simulation_status`	Don't immediately 404 on workflow-query failure — try the DB fallback. New `_conglomerate_status_from_records` helper builds a `ConglomerateStatus` from persisted rows (mapping `CREATED/SUCCEEDED/FAILED` back to `processing/success/failure`). Only 404 if both sources come up empty.

Indexes created (#10)

BiosimSimulationRuns: processing_id, run_id (unique)
BiosimSims: compound (file_hash_md5, image_digest, cache_buster) + biosim_run.id
BiosimOmex: file_hash_md5 (non-unique — get_cached_omex_file_from_*'s check-then-insert doesn't catch DuplicateKeyError, so logical uniqueness is enforced by the caller)

Tests (#2)

Three new router tests:

test_get_simulation_status_falls_back_to_db_when_workflow_query_fails — workflow raises, DB has 2 records (1 SUCCEEDED, 1 FAILED) → 200 with success/failure jobs and correct biosim_run_ids.
test_get_simulation_status_404_when_workflow_and_db_both_empty — workflow raises, DB returns [] → 404.
test_get_simulation_status_hybrid_enriches_biosim_run_id_from_db — workflow returns biosim_run_id=None mid-run, DB has the early-write value → response merges (live status from workflow, run id from DB).

Existing test_get_simulation_status_not_found still passes (no patched runs_db → falls back to None records → 404).

#5 doesn't have a dedicated unit test (the activity's deps are tied to module-global getters; the change is a 4-line guard whose behavior is obvious from inspection, and integration_local already exercises the SUCCEEDED path end-to-end). #14 and #10 are mechanical with the existing integration tests exercising the surrounding flows.

Verification

ruff + mypy --strict clean.
Full pytest -m "not integration": 83 passed, 1 failed (the pre-existing test_slurm_service SSH-creds env test, unchanged through every PR in this stack).

What's left

Medium and Lower findings from the review remain open (#6 HDF5 retry off-by-one, #7 stale workflow_id on cache hit, #8 submit timeout, #9 length=100 cap, #11 ImportError, #12 CancelledError, #13 initial get_sim_run no 404 handling, #15 updated bump). #13 is the most "fires under normal conditions" of the medium group — biosimulations.org eventual consistency between submit and poll. Worth addressing next if you want to keep going.

🤖 Generated with Claude Code

#14 (worker pod disk leak) -- wrap biosim_service.run_biosim_sim in try/finally inside submit_biosim_run_activity so local_omex_path is always removed, even when submit raises (HTTP 5xx, network, auth). The os.remove is itself wrapped in try/except OSError so a missing-file race during cleanup is non-fatal. #5 (FAILED-row cache pollution) -- gate insert_biosimulator_workflow_run in poll_biosim_run_activity on `simulation_run.status == SUCCEEDED`. Non-SUCCEEDED runs still return a fully-populated (un-saved) BiosimulatorWorkflowRun so OmexSimWorkflow can surface the FAILED state -- the workflow does not consult database_id. Eliminates unbounded BiosimSims growth on failing inputs and removes the cache-lookup ordering hazard (cached_runs[0] could otherwise be a FAILED row hiding a later SUCCEEDED one). #10 (missing Mongo indexes) -- add an ensure_indexes() method to all three Mongo DB services (default no-op on the abstract base; Mongo impls override). Called from init_standalone() and from each test fixture so production and tests share the same shape. Indexes: BiosimSimulationRuns: processing_id, run_id (unique) BiosimSims: (file_hash_md5, image_digest, cache_buster) compound + biosim_run.id BiosimOmex: file_hash_md5 (non-unique; the check-then-insert callers don't handle DuplicateKeyError, so uniqueness stays logical). #2 (status 404 after workflow history eviction) -- restructure GET /simulations/{processing_id}: don't 404 when the workflow query raises. Try the SimulationRunRecord DB fallback; only 404 if both sources are empty. A new _conglomerate_status_from_records helper builds ConglomerateStatus from persisted records when the workflow is gone (mapping CREATED/SUCCEEDED/FAILED back to processing/success/failure). Tests: 3 new router tests cover the workflow-not-found-with-DB-fallback path, the both-empty 404, and the hybrid enrichment-from-DB path. Existing test_get_simulation_status_not_found still passes (no records -> 404). Full -m "not integration" gate: 83 passed, 1 failed (the pre-existing test_slurm_service env failure, unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a "Known issues from the PR #52 code review" subsection to docs/workflows-architecture.md's Open items, cataloguing the 8 medium / lower-severity findings not addressed by PR #53. Each entry names the trigger, impact, and a concrete fix sketch so future work has a starting point. Also retires the "Hybrid status read for PR2.5" open item since that fix shipped as part of PR #53 itself. Medium: #13 (initial get_sim_run 404), #8 (submit timeout), #6 (HDF5 retry off-by-one). Lower: #7 (cache-hit stale workflow_id), #15 (updated bump on replay), #11 (lazy ImportError not caught), #12 (CancelledError captured), #9 (length=100 cap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five stale spots in docs/workflows-architecture.md surfaced by an audit: 1. Shape B heading said "(open)" for PR #51 — it merged ages ago. 2. Shape C diagram label on the status-endpoint edges still showed only the hybrid read; PR #53 added the DB-only fallback when the workflow query fails (and only 404s when both sources are empty). 3. Shape C diagram's POLL -> DB_SIMS arrow showed an unconditional "write result"; PR #53 (#5 fix) made it SUCCEEDED-only. 4. Sequence diagram had the same two omissions: - POLL -> DBS write now in an alt/else (SUCCEEDED inserts, FAILED / RUN_ID_NOT_FOUND skips). - "loop polling" status read now in an alt/else with the workflow- query happy path AND the DB-fallback path (with its own records-vs-404 inner alt). 5. Storage-layer prose called BiosimSims "one per unique computation" without noting SUCCEEDED-only since PR #53. Pure label/annotation updates; no structural workflow change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mermaid's sequence-diagram parser treats # as a control character in alt/else labels, so '(since PR #53)' broke parsing on two lines. Reworded those two labels (the prose around the diagram still says PR #53 explicitly). Flowchart edge labels in quoted strings parse fine, so the two PR #53 mentions in the Shape C flowchart stay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…us-2026-06-05 Docs: post-PR #53 status + frontend wire-up reality on main

jcschaff and others added 2 commits June 4, 2026 12:55

jcschaff merged commit 96ee556 into main Jun 5, 2026
5 checks passed

jcschaff deleted the feature/runs-cleanup-high-findings branch June 5, 2026 12:10

jcschaff mentioned this pull request Jun 5, 2026

Docs: post-PR #53 status + frontend wire-up reality on main #54

Merged

jcschaff added a commit that referenced this pull request Jun 5, 2026

Merge pull request #54 from biosimulations/docs/runs-convergence-stat…

dd26089

…us-2026-06-05 Docs: post-PR #53 status + frontend wire-up reality on main

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup: four high-severity findings from PR #52 review#53

Cleanup: four high-severity findings from PR #52 review#53
jcschaff merged 2 commits into
mainfrom
feature/runs-cleanup-high-findings

jcschaff commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcschaff commented Jun 4, 2026

What's in

Indexes created (#10)

Tests (#2)

Verification

What's left

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant