fix(storage): surface missing shared-structure decode misses (#1163)#1237
Conversation
…1163) A record that references a shared structure absent from this node's in-memory structures buffer (msgpackr/structon already reload from durable storage and retry, then throw) was caught, logged at error, and decoded to null — silently dropping a genuinely-existing record from query results and laundering that emptiness into caches and downstream consumers. Detect the two terminal missing-structure errors (typed: "Could not find typed structure"; classic: "Record id is not defined for") and route them to a distinct, non-fatal path: an analytics counter (decode-missing-structure) plus a dedicated warning, still returning null so internal reads (e.g. the __dbis__ metadata scan during initialization) remain tolerant. Other decode failures keep the existing error+null behavior. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Reviewed; no blockers found. |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Scope note for reviewers — this is fast-path-only observability, not the #1163 fix for the affected clusterAfter live investigation, the production cluster that motivated #1163 runs the standalone structon decode path (msgpackr 1.12.1, So please don't over-credit this PR for the live incident:
Its real value is a fast-path safety net on main/5.1 (and any future fast-path 5.0 build): if a structure is genuinely durable-absent there, this surfaces it instead of silently returning empty. The actual read-recovery for the standalone path (the cluster's failure mode) is HarperFast/structon#5 — reload typed structures on a decode miss, bringing them to parity with classic shared structures. And the operational remedy for Gen Digital remains Net: keep this as the fast-path observability net; the standalone recovery lives in structon#5. — Claude (Opus 4.7) |
| const MISSING_TYPED_STRUCTURE_PREFIX = 'Could not find typed structure '; | ||
| const MISSING_CLASSIC_STRUCTURE_PREFIX = 'Record id is not defined for '; |
There was a problem hiding this comment.
Just out of curiosity, why the global strings?
There was a problem hiding this comment.
Do you mean vs being inside isMissingStructureError? (its not on the global scope)
Summary
When a record references a shared structure that is absent from this node's in-memory structures buffer, msgpackr/structon already reload the structures from durable storage and retry; if the structure is still missing they throw a terminal error.
RecordEncoder.decodepreviously caught all decode errors, logged aterror, and returnednull— silently dropping a genuinely-existing record from query results and laundering that emptiness into caches and downstream consumers.This detects the two terminal missing-structure errors and routes them to a distinct, non-fatal path:
decode-missing-structure, keyed by store name), andwarnlog,…while still returning
null. All other decode failures keep the existingerror+nullbehavior.Purpose
Resolves the silent-result aspect of #1163 ("nodes silently return empty/partial query results without error"). Field context: a production cluster returned partial query results to end users on affected replicas with nothing surfaced to the caller. This change makes the condition observable and alertable (metric + distinct warning) instead of silent. It does not recover the missing structure — that requires the replication-delivery path that ships the structure-buffer update to the replica (tracked separately); a node genuinely missing the structure still returns
null, but now loudly.Why non-fatal (returns null, not throws)
An earlier iteration threw on a missing structure. Testing showed that regresses database initialization: the internal
__dbis__/meta scan duringinitStoreslegitimately tolerates an undecodable record by returningnull, and throwing there broke startup (cascadingbefore allfailures). Returningnullkeeps internal reads tolerant while still surfacing the condition.Where to focus review
isMissingStructureError): neither msgpackr nor structon throws a typed error, so we match the terminal stringsCould not find typed structure(structonreadStruct) andRecord id is not defined for(msgpackrcreateSecondByteReader). Both fire only after the dependency's own on-miss reload fails. Is matching by prefix acceptable, and are these the complete set?Data read, but end of buffer not reached N(notRecord id is not defined), so it would route to the generic error path and not increment the metric — and small static tables (the typical Stale shared-structures: nodes silently return empty/partial query results without error #1163 victims) tend to have low ids. I deliberately did not match that message because it is ambiguous with genuine corruption (false positives would misclassify real corruption as a missing structure), and under this non-fatal design such a case is still logged (aterror) and returnsnull— not silently lost, just not counted in the dedicated metric. Gemini's review reached the opposite conclusion (it considers the two prefixes the complete terminal miss signals). @kriszyp — as msgpackr author, your call on whether low-id classic misses produce that message in practice and whether it's worth matching.this.nameis only set on the RocksDB encoder, so the metric/warning fall back tothis.rootStore?.namefor LMDB (best-effort; may be the root db rather than the exact table).recordActionon the decode path: only invoked on the cold error branch; it's already used on the encode hot path in this file and batches asynchronously.Tests
unitTests/resources/recordEncoder.test.js: typed miss →null+ distinct warn (not the error path); recovery decodes once the structure is present;isMissingStructureErrorcovers both prefixes and rejects the ambiguous/RangeError/undefinedcases; genuine corruption still →error+null.🤖 Generated by Claude (Opus 4.7). The detection-prefix completeness (esp. the open classic low-id question above) is the main thing worth a careful human look.