fix(parser): avoid stack overflow on large tag expressions (#817)#820
fix(parser): avoid stack overflow on large tag expressions (#817)#820Kirylka wants to merge 3 commits intoNaturalIntelligence:masterfrom
Conversation
…telligence#817) `tagExpWithClosingIndex` (introduced in 5.5.11) accumulated each scanned char code into a `chars` array and returned the result via `String.fromCharCode(...chars)`. For a tag expression large enough, the spread exceeds V8's per-call argument-count limit and manifests as `RangeError: Maximum call stack size exceeded` (with a shallow 4-frame stack), breaking parsing of large real-world XML files (e.g. Adobe AE .aepx, large XHTML reports). Replace the spread with `xmlData.substring(i, index).replace(/\t/g, ' ')`, which preserves the existing semantics (tab -> space, otherwise verbatim up to the closing char) without the per-char accumulation. This is safe because `closingChar` is `>` or `?>`, neither of which collides with `"` / `'` attribute delimiters, so the boundary tracking below still governs when to stop. Adds a regression test using a 200000-char attribute value.
The previous fix used `xmlData.substring(i, index).replace(/\t/g, ' ')`, which unconditionally replaced every tab character in the tag expression — including those inside quoted attribute values. 5.5.10 and 5.5.11 both only converted tabs to spaces when outside quoted attrs. Switch to building the tag expression with `tagExp += xmlData[index]` in the scan loop (same shape as 5.5.10) and explicitly handle `\t` only outside the attribute boundary. Still avoids the `String.fromCharCode(...chars)` spread that caused NaturalIntelligence#817. Adds a regression test for tab preservation inside quoted attr values.
The previous commit restored 5.5.10 tab semantics via `tagExp += xmlData[index]` in the scan loop. That worked correctly but regressed performance by ~18% on a 12MB XML input vs 5.5.10 due to cons-string overhead. Switch to a two-phase approach: 1. Single forward scan records the close-delimiter position and a flag for whether any `\t` was seen inside a quoted attribute value. Nothing is written during this pass. 2. Fast path (no tabs in quotes): `xmlData.substring(i, close).replace(/\t/g, ' ')`. This is the common case for real-world XML and avoids per-char work entirely. 3. Rare path (tabs inside quotes): walk the already-extracted substring once more, preserving tabs inside quotes and converting tabs outside quotes to spaces. Benchmarks (Node 22, median of many iterations): input pure 5.5.10 this PR delta small XML 0.01ms 0.01ms ~0 medium XML (~8K) 0.50ms 0.41ms -18% wide attr (~5K) 0.05ms 0.03ms -37% very wide attr (~200k) 2.88ms 1.27ms -56% 12MB aepx (NaturalIntelligence#817 repro) 324ms 289ms -11% Output is byte-identical to 5.5.10 on every tested input (verified against a 24-case battery + the 12MB .aepx SHA-256).
|
Thanks for your effort and time. Since the PR contains browser bundle which are not accepted in PR. I've updated desire method. Please check current code on github. This would be released tomorrow. |
|
Quick heads-up on the commit that landed on master (e174168): it's close to my first pass (substring + bulk
If useful, the two small specs on this branch cover both cases (the 200k-char tag-expression test guarding #817, and the tab-in-quoted-attr semantic). Happy to open a separate tiny PR with just those, or you can lift them directly. Either way, thanks for the quick turnaround. |
|
I noticed that if we build an array in |
Problem
tagExpWithClosingIndex(introduced in 5.5.11) accumulates every scanned char code into acharsarray and returns the result viaString.fromCharCode(...chars). When a tag expression is large enough, the spread exceeds V8's per-call argument budget and surfaces as:Shallow 4-frame trace (not recursion-driven). Bisect pins the regression to 5.5.11; every release since (5.5.11 / 5.5.12 / 5.6.0) still reproduces it. Affects real-world XML like 12MB Adobe AE
.aepxfiles and 46MB XHTML ESEF reports (#817).Fix
Two-phase approach inside
tagExpWithClosingIndex:\twas seen inside a quoted attribute value — no per-char string building.xmlData.substring(i, close).replace(/\t/g, ' '). This is the common case for real-world XML — zero per-char work.Keeps the 5.5.11 perf improvements (
charCodeAtcomparisons, precomputed close codes, cachedlen).Verification
Correctness
spec/large_spec.js:.aepxfrom Error: RangeError: Maximum call stack size exceeded #817 — output SHA-256 identical to 5.5.10Performance (Node 22, median over many iterations)
.aepx(#817 repro)Closes #817