refactor: reuse sql DB on GTFS reload#875
Conversation
Restructure GTFS static data loading/reloading into one shared function. Now, when periodically reloading GTFS static data, we update the existing database in place rather than opening a new database. This has the additional benefit that we no longer need a mutex on GtfsDB, since it is not modified after the Manager is constructed. Because of that, this change also deletes all of the external usages of Manager.staticMutex, and converts it to an implementation detail of Manager (still used to protect a few other fields).
Ahmedhossamdev
left a comment
There was a problem hiding this comment.
Terrific work @fletcherw
LGTM!
I have only one question:
SQLite will not block any reads when we update in place, right?
aaronbrethorst
left a comment
There was a problem hiding this comment.
Hi Fletcher,
Thanks for this one — it's exactly the kind of cleanup this codebase has been begging for. Collapsing the hot-swap dance (temp DB, file rename, recovery paths) into a single in-place import, and turning staticMutex from a handler-visible contract into a private implementation detail, is a big step forward in readability. Net −517 lines with broader handler simplification is the cherry on top.
I do have a couple of concerns I'd like resolved before this lands, plus some smaller fit-and-finish notes.
Critical Issues (0 found)
None.
Important Issues (2 found)
1. Post-commit work isn't retried, leaving the DB permanently missing min_arrival_time / max_departure_time
In importStaticIntoDB (internal/gtfs/static.go:97-128), the flow is:
ImportFromFile/DownloadAndStore→processAndStoreGTFSDataWithSource, which commits a transaction that writes all rows and upserts import metadata (gtfsdb/helpers.go:470).- After the commit succeeds,
BulkUpdateTripTimeBoundsruns as a separate write. - Then
PrecomputeAllDirectionsruns as another separate write.
Failure scenario: the main tx commits, then BulkUpdateTripTimeBounds errors (context cancel on shutdown, transient DB error, etc.). importStaticIntoDB returns (true, err); ReloadStatic propagates the error. On the next reload:
processAndStoreGTFSDataWithSourcechecksGetImportMetadataand finds the committed hash matches the source bytes → returns(false, nil).importStaticIntoDBshort-circuits on!changedand skipsBulkUpdateTripTimeBoundsentirely.
The DB is now stuck in a state where trips exist but min_arrival_time / max_departure_time are NULL forever, silently breaking queries like GetTripsInBlockWithTimeBounds and GetActiveTripsWithNullBlockForRoute (both filter on min_arrival_time <= / max_departure_time >=, which evaluates to false for NULL).
Fix: The underlying query BulkUpdateTripTimeBounds is already guarded by WHERE min_arrival_time IS NULL OR max_departure_time IS NULL, so it's cheap and idempotent. The simplest fix is to run it unconditionally in importStaticIntoDB, not only when changed == true. (Alternatively, move it inside the main import transaction — see issue #2.)
2. Reload is no longer atomic from a reader's perspective
The old ForceUpdate built a temp DB, precomputed everything, and then did an os.Rename to swap the file atomically. Readers saw either fully-old or fully-new state.
The new path commits the main import tx first, then runs BulkUpdateTripTimeBounds and PrecomputeAllDirections as separate writes. During the window between tx-commit and those post-commit steps, readers observe:
- New trips/routes/stops are already visible (committed).
- But
trips.min_arrival_time/max_departure_timeare still NULL until the bulk update runs. stops.directionis NULL untilPrecomputeAllDirectionsfinishes — and that one can take multiple seconds on larger agencies.
Queries that filter on these columns (e.g. block/null-block active-trip selection) will return empty or incorrect results during this window. It's not a total outage, but it's a regression in consistency that the old hot-swap gave us for free.
Fix: Move BulkUpdateTripTimeBounds inside the main import transaction in processAndStoreGTFSDataWithSource — it's a single UPDATE and will add negligible time to a transaction that already writes every row. PrecomputeAllDirections is harder (it's multi-step and uses its own transactions), but the on-demand fallback in AdvancedDirectionCalculator at least covers that gap for the direction API. Worth a line in the PR description acknowledging the trade-off either way.
Suggestions (3 found)
3. context.TODO() in the filter functions (internal/gtfs/realtime.go:516, 535, 553)
These three filter functions previously used context.Background(); they now use context.TODO(). Functionally identical, but TODO is conventionally a "I haven't decided yet" marker, and some linters flag it. Either revert to Background or — ideally — plumb a real context down from the callers (updateFeedRealtime has one available).
4. Redundant fetch-and-parse on every reload
ReloadStatic calls loadGTFSData (which fetches + parses), then importStaticIntoDB → ImportFromFile / DownloadAndStore, which re-reads the same file or re-downloads the same URL and parses again when the hash doesn't match. This double-fetch existed in the old code too, so not a regression, but since you're in here simplifying this code path, it's a reasonable time to thread the already-fetched bytes through instead of refetching. Not a blocker — note it or file a follow-up issue if you'd rather keep this PR focused.
5. IsHealthy → IsReady is a visible contract change
currentTimeHandler used to 503 when isHealthy was false, which happened specifically on a failed DB-swap recovery. The new code uses IsReady, which is just "startup completed" and never flips back. If any monitoring/alerting keyed off the old health semantic, they'll lose that signal. In practice the new behavior is probably fine (the DB stays valid because tx rollback preserves state), but worth a heads-up in the PR body.
Strengths
- Huge simplification. The temp-DB + rename + recovery dance was a nest of edge cases. Replacing it with a single in-place import is a clear win.
- Mutex surface shrunk correctly. Getting rid of public
RLock/RUnlockonManager, thequeriesMuonAdvancedDirectionCalculator, and all 40+ handler-side lock acquisitions is exactly the right direction. - Graceful startup fallback — if reload fails but metadata indicates prior data was imported, the manager continues serving the pre-existing DB. Nice touch.
- Tests updated thoroughly and they pass (
make testclean); lint errors are all pre-existing ingtfsdb/stops_rtree.go, unrelated to this PR.
Recommended Action
- Fix issue #1 by running
BulkUpdateTripTimeBoundsunconditionally (or moving it into the import tx). - Decide on issue #2 — ideally move the time-bounds update into the tx; if you'd rather accept the reader inconsistency window, document it in the PR description.
- Address suggestions #3–#5 as you see fit.
- Ping me when ready and I'll re-review.
Restructure GTFS static data loading/reloading into one shared function. Now, when periodically reloading GTFS static data, we update the existing database in place rather than opening a new database.
This has the additional benefit that we no longer need a mutex on GtfsDB, since it is not modified after the Manager is constructed. Because of that, this change also deletes all of the external usages of Manager.staticMutex, and converts it to an implementation detail of Manager (still used to protect a few other fields).