Skip to content

fix(storage): surface missing shared-structure decode misses (#1163)#1237

Merged
kriszyp merged 1 commit into
mainfrom
kris/fix-1163-stale-structure-decode
Jun 11, 2026
Merged

fix(storage): surface missing shared-structure decode misses (#1163)#1237
kriszyp merged 1 commit into
mainfrom
kris/fix-1163-stale-structure-decode

Conversation

@kriszyp

@kriszyp kriszyp commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

When a record references a shared structure that is absent from this node's in-memory structures buffer, msgpackr/structon already reload the structures from durable storage and retry; if the structure is still missing they throw a terminal error. RecordEncoder.decode previously caught all decode errors, logged at error, and returned null — silently dropping a genuinely-existing record from query results and laundering that emptiness into caches and downstream consumers.

This detects the two terminal missing-structure errors and routes them to a distinct, non-fatal path:

  • an analytics counter (decode-missing-structure, keyed by store name), and
  • a dedicated warn log,

…while still returning null. All other decode failures keep the existing error + null behavior.

Purpose

Resolves the silent-result aspect of #1163 ("nodes silently return empty/partial query results without error"). Field context: a production cluster returned partial query results to end users on affected replicas with nothing surfaced to the caller. This change makes the condition observable and alertable (metric + distinct warning) instead of silent. It does not recover the missing structure — that requires the replication-delivery path that ships the structure-buffer update to the replica (tracked separately); a node genuinely missing the structure still returns null, but now loudly.

Why non-fatal (returns null, not throws)

An earlier iteration threw on a missing structure. Testing showed that regresses database initialization: the internal __dbis__/meta scan during initStores legitimately tolerates an undecodable record by returning null, and throwing there broke startup (cascading before all failures). Returning null keeps internal reads tolerant while still surfacing the condition.

Where to focus review

  • Detection by error-message prefix (isMissingStructureError): neither msgpackr nor structon throws a typed error, so we match the terminal strings Could not find typed structure (structon readStruct) and Record id is not defined for (msgpackr createSecondByteReader). Both fire only after the dependency's own on-miss reload fails. Is matching by prefix acceptable, and are these the complete set?
  • OPEN — classic low-id misses (unresolved cross-model disagreement). Codex flagged that a classic structure miss for a low record id may instead surface as the ambiguous Data read, but end of buffer not reached N (not Record id is not defined), so it would route to the generic error path and not increment the metric — and small static tables (the typical Stale shared-structures: nodes silently return empty/partial query results without error #1163 victims) tend to have low ids. I deliberately did not match that message because it is ambiguous with genuine corruption (false positives would misclassify real corruption as a missing structure), and under this non-fatal design such a case is still logged (at error) and returns null — not silently lost, just not counted in the dedicated metric. Gemini's review reached the opposite conclusion (it considers the two prefixes the complete terminal miss signals). @kriszyp — as msgpackr author, your call on whether low-id classic misses produce that message in practice and whether it's worth matching.
  • LMDB store attribution: this.name is only set on the RocksDB encoder, so the metric/warning fall back to this.rootStore?.name for LMDB (best-effort; may be the root db rather than the exact table).
  • recordAction on the decode path: only invoked on the cold error branch; it's already used on the encode hot path in this file and batches asynchronously.

Tests

unitTests/resources/recordEncoder.test.js: typed miss → null + distinct warn (not the error path); recovery decodes once the structure is present; isMissingStructureError covers both prefixes and rejects the ambiguous/RangeError/undefined cases; genuine corruption still → error + null.


🤖 Generated by Claude (Opus 4.7). The detection-prefix completeness (esp. the open classic low-id question above) is the main thing worth a careful human look.

…1163)

A record that references a shared structure absent from this node's in-memory
structures buffer (msgpackr/structon already reload from durable storage and
retry, then throw) was caught, logged at error, and decoded to null — silently
dropping a genuinely-existing record from query results and laundering that
emptiness into caches and downstream consumers.

Detect the two terminal missing-structure errors (typed: "Could not find typed
structure"; classic: "Record id is not defined for") and route them to a
distinct, non-fatal path: an analytics counter (decode-missing-structure) plus
a dedicated warning, still returning null so internal reads (e.g. the __dbis__
metadata scan during initialization) remain tolerant. Other decode failures
keep the existing error+null behavior.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Reviewed; no blockers found.

@kriszyp kriszyp marked this pull request as ready for review June 10, 2026 22:24
@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kriszyp

kriszyp commented Jun 10, 2026

Copy link
Copy Markdown
Member Author

Scope note for reviewers — this is fast-path-only observability, not the #1163 fix for the affected cluster

After live investigation, the production cluster that motivated #1163 runs the standalone structon decode path (msgpackr 1.12.1, SUPPORTS_STRUCT_HOOKS undefined — confirmed on the running encoders). On that path a missing typed structure surfaces as the ambiguous Data read, but end of buffer not reached, which this PR's detection deliberately does not match (it's indistinguishable from genuine corruption). The terminal strings this PR keys on (Could not find typed structure, Record id is not defined for) are the fast-path (msgpackr ≥ 2.0.1 / 5.1) signatures.

So please don't over-credit this PR for the live incident:

  • It is observability only (counter + warning) — it does not recover a dropped record.
  • On the cluster's standalone build it would not even fire.

Its real value is a fast-path safety net on main/5.1 (and any future fast-path 5.0 build): if a structure is genuinely durable-absent there, this surfaces it instead of silently returning empty.

The actual read-recovery for the standalone path (the cluster's failure mode) is HarperFast/structon#5 — reload typed structures on a decode miss, bringing them to parity with classic shared structures. And the operational remedy for Gen Digital remains storage.randomAccessFields: false (already in 5.0.30) + re-upserting the static rows, plus the replication-delivery fix.

Net: keep this as the fast-path observability net; the standalone recovery lives in structon#5.

— Claude (Opus 4.7)

Comment on lines +49 to +50
const MISSING_TYPED_STRUCTURE_PREFIX = 'Could not find typed structure ';
const MISSING_CLASSIC_STRUCTURE_PREFIX = 'Record id is not defined for ';

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, why the global strings?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean vs being inside isMissingStructureError? (its not on the global scope)

@kriszyp kriszyp merged commit 5ee54dd into main Jun 11, 2026
52 checks passed
@kriszyp kriszyp deleted the kris/fix-1163-stale-structure-decode branch June 11, 2026 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants