Cleanup: four high-severity findings from PR #52 review#53
Merged
Conversation
#14 (worker pod disk leak) -- wrap biosim_service.run_biosim_sim in try/finally inside submit_biosim_run_activity so local_omex_path is always removed, even when submit raises (HTTP 5xx, network, auth). The os.remove is itself wrapped in try/except OSError so a missing-file race during cleanup is non-fatal. #5 (FAILED-row cache pollution) -- gate insert_biosimulator_workflow_run in poll_biosim_run_activity on `simulation_run.status == SUCCEEDED`. Non-SUCCEEDED runs still return a fully-populated (un-saved) BiosimulatorWorkflowRun so OmexSimWorkflow can surface the FAILED state -- the workflow does not consult database_id. Eliminates unbounded BiosimSims growth on failing inputs and removes the cache-lookup ordering hazard (cached_runs[0] could otherwise be a FAILED row hiding a later SUCCEEDED one). #10 (missing Mongo indexes) -- add an ensure_indexes() method to all three Mongo DB services (default no-op on the abstract base; Mongo impls override). Called from init_standalone() and from each test fixture so production and tests share the same shape. Indexes: BiosimSimulationRuns: processing_id, run_id (unique) BiosimSims: (file_hash_md5, image_digest, cache_buster) compound + biosim_run.id BiosimOmex: file_hash_md5 (non-unique; the check-then-insert callers don't handle DuplicateKeyError, so uniqueness stays logical). #2 (status 404 after workflow history eviction) -- restructure GET /simulations/{processing_id}: don't 404 when the workflow query raises. Try the SimulationRunRecord DB fallback; only 404 if both sources are empty. A new _conglomerate_status_from_records helper builds ConglomerateStatus from persisted records when the workflow is gone (mapping CREATED/SUCCEEDED/FAILED back to processing/success/failure). Tests: 3 new router tests cover the workflow-not-found-with-DB-fallback path, the both-empty 404, and the hybrid enrichment-from-DB path. Existing test_get_simulation_status_not_found still passes (no records -> 404). Full -m "not integration" gate: 83 passed, 1 failed (the pre-existing test_slurm_service env failure, unchanged). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Known issues from the PR #52 code review" subsection to docs/workflows-architecture.md's Open items, cataloguing the 8 medium / lower-severity findings not addressed by PR #53. Each entry names the trigger, impact, and a concrete fix sketch so future work has a starting point. Also retires the "Hybrid status read for PR2.5" open item since that fix shipped as part of PR #53 itself. Medium: #13 (initial get_sim_run 404), #8 (submit timeout), #6 (HDF5 retry off-by-one). Lower: #7 (cache-hit stale workflow_id), #15 (updated bump on replay), #11 (lazy ImportError not caught), #12 (CancelledError captured), #9 (length=100 cap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jcschaff
added a commit
that referenced
this pull request
Jun 5, 2026
Five stale spots in docs/workflows-architecture.md surfaced by an audit: 1. Shape B heading said "(open)" for PR #51 — it merged ages ago. 2. Shape C diagram label on the status-endpoint edges still showed only the hybrid read; PR #53 added the DB-only fallback when the workflow query fails (and only 404s when both sources are empty). 3. Shape C diagram's POLL -> DB_SIMS arrow showed an unconditional "write result"; PR #53 (#5 fix) made it SUCCEEDED-only. 4. Sequence diagram had the same two omissions: - POLL -> DBS write now in an alt/else (SUCCEEDED inserts, FAILED / RUN_ID_NOT_FOUND skips). - "loop polling" status read now in an alt/else with the workflow- query happy path AND the DB-fallback path (with its own records-vs-404 inner alt). 5. Storage-layer prose called BiosimSims "one per unique computation" without noting SUCCEEDED-only since PR #53. Pure label/annotation updates; no structural workflow change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jcschaff
added a commit
that referenced
this pull request
Jun 5, 2026
Mermaid's sequence-diagram parser treats # as a control character in alt/else labels, so '(since PR #53)' broke parsing on two lines. Reworded those two labels (the prose around the diagram still says PR #53 explicitly). Flowchart edge labels in quoted strings parse fine, so the two PR #53 mentions in the Shape C flowchart stay. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jcschaff
added a commit
that referenced
this pull request
Jun 5, 2026
…us-2026-06-05 Docs: post-PR #53 status + frontend wire-up reality on main
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to PR #52, addressing the four "High" findings from the code review (those that affect real production behavior).
What's in
biosim_runs/activities.py::submit_biosim_run_activitybiosim_service.run_biosim_simintry/finally;os.removeitself intry/except OSErrorso a missing-file race is non-fatal.biosim_runs/activities.py::poll_biosim_run_activityinsert_biosimulator_workflow_runonsimulation_run.status == SUCCEEDED. Activity still returns a fully-populated (un-saved) record soOmexSimWorkflowcan surface the FAILED state — it doesn't consultdatabase_id.init_standalone+ test fixturesensure_indexes()method (default no-op on the abstract base, Mongo impls override). Called frominit_standalone()and every test fixture. Indexes listed below.SimulationRunRecordrows are still in Mongosimulations/router.py::get_simulation_status_conglomerate_status_from_recordshelper builds aConglomerateStatusfrom persisted rows (mappingCREATED/SUCCEEDED/FAILEDback toprocessing/success/failure). Only 404 if both sources come up empty.Indexes created (#10)
BiosimSimulationRuns:processing_id,run_id(unique)BiosimSims: compound(file_hash_md5, image_digest, cache_buster)+biosim_run.idBiosimOmex:file_hash_md5(non-unique —get_cached_omex_file_from_*'s check-then-insert doesn't catchDuplicateKeyError, so logical uniqueness is enforced by the caller)Tests (#2)
Three new router tests:
test_get_simulation_status_falls_back_to_db_when_workflow_query_fails— workflow raises, DB has 2 records (1 SUCCEEDED, 1 FAILED) → 200 withsuccess/failurejobs and correctbiosim_run_ids.test_get_simulation_status_404_when_workflow_and_db_both_empty— workflow raises, DB returns[]→ 404.test_get_simulation_status_hybrid_enriches_biosim_run_id_from_db— workflow returnsbiosim_run_id=Nonemid-run, DB has the early-write value → response merges (live status from workflow, run id from DB).Existing
test_get_simulation_status_not_foundstill passes (no patched runs_db → falls back to None records → 404).#5 doesn't have a dedicated unit test (the activity's deps are tied to module-global getters; the change is a 4-line guard whose behavior is obvious from inspection, and
integration_localalready exercises the SUCCEEDED path end-to-end). #14 and #10 are mechanical with the existing integration tests exercising the surrounding flows.Verification
pytest -m "not integration": 83 passed, 1 failed (the pre-existingtest_slurm_serviceSSH-creds env test, unchanged through every PR in this stack).What's left
Medium and Lower findings from the review remain open (#6 HDF5 retry off-by-one, #7 stale
workflow_idon cache hit, #8 submit timeout, #9length=100cap, #11 ImportError, #12CancelledError, #13 initialget_sim_runno 404 handling, #15updatedbump). #13 is the most "fires under normal conditions" of the medium group — biosimulations.org eventual consistency between submit and poll. Worth addressing next if you want to keep going.🤖 Generated with Claude Code