Skip to content

Cleanup: four high-severity findings from PR #52 review#53

Merged
jcschaff merged 2 commits into
mainfrom
feature/runs-cleanup-high-findings
Jun 5, 2026
Merged

Cleanup: four high-severity findings from PR #52 review#53
jcschaff merged 2 commits into
mainfrom
feature/runs-cleanup-high-findings

Conversation

@jcschaff

@jcschaff jcschaff commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Follow-up to PR #52, addressing the four "High" findings from the code review (those that affect real production behavior).

What's in

Finding Where Fix
#14 Worker pod disk leak (orphaned OMEX temp files when submit raises) biosim_runs/activities.py::submit_biosim_run_activity Wrap biosim_service.run_biosim_sim in try/finally; os.remove itself in try/except OSError so a missing-file race is non-fatal.
#5 FAILED-row cache pollution + cache-lookup ordering hazard biosim_runs/activities.py::poll_biosim_run_activity Gate insert_biosimulator_workflow_run on simulation_run.status == SUCCEEDED. Activity still returns a fully-populated (un-saved) record so OmexSimWorkflow can surface the FAILED state — it doesn't consult database_id.
#10 Missing Mongo indexes on hot lookups three DB services + init_standalone + test fixtures New ensure_indexes() method (default no-op on the abstract base, Mongo impls override). Called from init_standalone() and every test fixture. Indexes listed below.
#2 Status endpoint 404s after Temporal history eviction even though SimulationRunRecord rows are still in Mongo simulations/router.py::get_simulation_status Don't immediately 404 on workflow-query failure — try the DB fallback. New _conglomerate_status_from_records helper builds a ConglomerateStatus from persisted rows (mapping CREATED/SUCCEEDED/FAILED back to processing/success/failure). Only 404 if both sources come up empty.

Indexes created (#10)

  • BiosimSimulationRuns: processing_id, run_id (unique)
  • BiosimSims: compound (file_hash_md5, image_digest, cache_buster) + biosim_run.id
  • BiosimOmex: file_hash_md5 (non-unique — get_cached_omex_file_from_*'s check-then-insert doesn't catch DuplicateKeyError, so logical uniqueness is enforced by the caller)

Tests (#2)

Three new router tests:

  • test_get_simulation_status_falls_back_to_db_when_workflow_query_fails — workflow raises, DB has 2 records (1 SUCCEEDED, 1 FAILED) → 200 with success/failure jobs and correct biosim_run_ids.
  • test_get_simulation_status_404_when_workflow_and_db_both_empty — workflow raises, DB returns [] → 404.
  • test_get_simulation_status_hybrid_enriches_biosim_run_id_from_db — workflow returns biosim_run_id=None mid-run, DB has the early-write value → response merges (live status from workflow, run id from DB).

Existing test_get_simulation_status_not_found still passes (no patched runs_db → falls back to None records → 404).

#5 doesn't have a dedicated unit test (the activity's deps are tied to module-global getters; the change is a 4-line guard whose behavior is obvious from inspection, and integration_local already exercises the SUCCEEDED path end-to-end). #14 and #10 are mechanical with the existing integration tests exercising the surrounding flows.

Verification

  • ruff + mypy --strict clean.
  • Full pytest -m "not integration": 83 passed, 1 failed (the pre-existing test_slurm_service SSH-creds env test, unchanged through every PR in this stack).

What's left

Medium and Lower findings from the review remain open (#6 HDF5 retry off-by-one, #7 stale workflow_id on cache hit, #8 submit timeout, #9 length=100 cap, #11 ImportError, #12 CancelledError, #13 initial get_sim_run no 404 handling, #15 updated bump). #13 is the most "fires under normal conditions" of the medium group — biosimulations.org eventual consistency between submit and poll. Worth addressing next if you want to keep going.

🤖 Generated with Claude Code

jcschaff and others added 2 commits June 4, 2026 12:55
#14 (worker pod disk leak) -- wrap biosim_service.run_biosim_sim in
    try/finally inside submit_biosim_run_activity so local_omex_path is
    always removed, even when submit raises (HTTP 5xx, network, auth).
    The os.remove is itself wrapped in try/except OSError so a missing-file
    race during cleanup is non-fatal.

#5 (FAILED-row cache pollution) -- gate insert_biosimulator_workflow_run
    in poll_biosim_run_activity on `simulation_run.status == SUCCEEDED`.
    Non-SUCCEEDED runs still return a fully-populated (un-saved)
    BiosimulatorWorkflowRun so OmexSimWorkflow can surface the FAILED
    state -- the workflow does not consult database_id. Eliminates
    unbounded BiosimSims growth on failing inputs and removes the
    cache-lookup ordering hazard (cached_runs[0] could otherwise be a
    FAILED row hiding a later SUCCEEDED one).

#10 (missing Mongo indexes) -- add an ensure_indexes() method to all
    three Mongo DB services (default no-op on the abstract base; Mongo
    impls override). Called from init_standalone() and from each test
    fixture so production and tests share the same shape. Indexes:
      BiosimSimulationRuns: processing_id, run_id (unique)
      BiosimSims: (file_hash_md5, image_digest, cache_buster) compound +
        biosim_run.id
      BiosimOmex: file_hash_md5 (non-unique; the check-then-insert
        callers don't handle DuplicateKeyError, so uniqueness stays
        logical).

#2 (status 404 after workflow history eviction) -- restructure
    GET /simulations/{processing_id}: don't 404 when the workflow query
    raises. Try the SimulationRunRecord DB fallback; only 404 if both
    sources are empty. A new _conglomerate_status_from_records helper
    builds ConglomerateStatus from persisted records when the workflow
    is gone (mapping CREATED/SUCCEEDED/FAILED back to
    processing/success/failure).

Tests: 3 new router tests cover the workflow-not-found-with-DB-fallback
path, the both-empty 404, and the hybrid enrichment-from-DB path. Existing
test_get_simulation_status_not_found still passes (no records -> 404).

Full -m "not integration" gate: 83 passed, 1 failed (the pre-existing
test_slurm_service env failure, unchanged).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Known issues from the PR #52 code review" subsection to
docs/workflows-architecture.md's Open items, cataloguing the 8 medium /
lower-severity findings not addressed by PR #53. Each entry names the
trigger, impact, and a concrete fix sketch so future work has a
starting point.

Also retires the "Hybrid status read for PR2.5" open item since that
fix shipped as part of PR #53 itself.

Medium:  #13 (initial get_sim_run 404), #8 (submit timeout), #6 (HDF5
         retry off-by-one).
Lower:   #7 (cache-hit stale workflow_id), #15 (updated bump on replay),
         #11 (lazy ImportError not caught), #12 (CancelledError captured),
         #9 (length=100 cap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff merged commit 96ee556 into main Jun 5, 2026
5 checks passed
@jcschaff jcschaff deleted the feature/runs-cleanup-high-findings branch June 5, 2026 12:10
jcschaff added a commit that referenced this pull request Jun 5, 2026
Five stale spots in docs/workflows-architecture.md surfaced by an audit:

1. Shape B heading said "(open)" for PR #51 — it merged ages ago.
2. Shape C diagram label on the status-endpoint edges still showed only
   the hybrid read; PR #53 added the DB-only fallback when the workflow
   query fails (and only 404s when both sources are empty).
3. Shape C diagram's POLL -> DB_SIMS arrow showed an unconditional
   "write result"; PR #53 (#5 fix) made it SUCCEEDED-only.
4. Sequence diagram had the same two omissions:
   - POLL -> DBS write now in an alt/else (SUCCEEDED inserts, FAILED /
     RUN_ID_NOT_FOUND skips).
   - "loop polling" status read now in an alt/else with the workflow-
     query happy path AND the DB-fallback path (with its own
     records-vs-404 inner alt).
5. Storage-layer prose called BiosimSims "one per unique computation"
   without noting SUCCEEDED-only since PR #53.

Pure label/annotation updates; no structural workflow change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jcschaff added a commit that referenced this pull request Jun 5, 2026
Mermaid's sequence-diagram parser treats # as a control character in alt/else labels,
so '(since PR #53)' broke parsing on two lines. Reworded those two labels (the prose
around the diagram still says PR #53 explicitly). Flowchart edge labels in quoted
strings parse fine, so the two PR #53 mentions in the Shape C flowchart stay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jcschaff added a commit that referenced this pull request Jun 5, 2026
…us-2026-06-05

Docs: post-PR #53 status + frontend wire-up reality on main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant