Skip to content

fix(parser): avoid stack overflow on large tag expressions (#817)#820

Open
Kirylka wants to merge 3 commits intoNaturalIntelligence:masterfrom
Kirylka:fix/issue-817-stack-overflow-long-tag
Open

fix(parser): avoid stack overflow on large tag expressions (#817)#820
Kirylka wants to merge 3 commits intoNaturalIntelligence:masterfrom
Kirylka:fix/issue-817-stack-overflow-long-tag

Conversation

@Kirylka
Copy link
Copy Markdown

@Kirylka Kirylka commented Apr 16, 2026

Problem

tagExpWithClosingIndex (introduced in 5.5.11) accumulates every scanned char code into a chars array and returns the result via String.fromCharCode(...chars). When a tag expression is large enough, the spread exceeds V8's per-call argument budget and surfaces as:

RangeError: Maximum call stack size exceeded
    at fast-xml-parser/lib/fxp.cjs:1:xxxxx
    at st (...)
    at parseXml (...)
    at parse (...)

Shallow 4-frame trace (not recursion-driven). Bisect pins the regression to 5.5.11; every release since (5.5.11 / 5.5.12 / 5.6.0) still reproduces it. Affects real-world XML like 12MB Adobe AE .aepx files and 46MB XHTML ESEF reports (#817).

Fix

Two-phase approach inside tagExpWithClosingIndex:

  1. Single forward scan locates the close-delimiter position and tracks quote boundaries. It only records whether any \t was seen inside a quoted attribute value — no per-char string building.
  2. Fast path (no tab inside a quoted attr): xmlData.substring(i, close).replace(/\t/g, ' '). This is the common case for real-world XML — zero per-char work.
  3. Rare path (tab inside a quoted attr): walk the extracted substring once, preserving tabs inside quotes and normalising tabs outside quotes to spaces. Matches 5.5.10 semantics exactly.

Keeps the 5.5.11 perf improvements (charCodeAt comparisons, precomputed close codes, cached len).

Verification

Correctness

  • All 310 existing specs pass
  • Two new regression tests in spec/large_spec.js:
    • 200k-char attribute value — fails on unpatched 5.5.11+, passes after
    • Tab preserved inside quoted attribute — guards the 5.5.10 semantic
  • A 24-case byte-equality battery (edge cases: tabs inside/outside/adjacent quotes, single vs double quotes, surrogate pairs, long attrs, PI with tabs, entities, boolean attrs, etc.) — every case produces output byte-identical to 5.5.10
  • Verified against the 12MB .aepx from Error: RangeError: Maximum call stack size exceeded #817 — output SHA-256 identical to 5.5.10

Performance (Node 22, median over many iterations)

input pure 5.5.10 this PR delta
small XML 0.01ms 0.01ms ~0
medium XML (~8K) 0.50ms 0.41ms −18%
wide attr (~5K chars) 0.05ms 0.03ms −37%
very wide attr (~200k chars) 2.88ms 1.27ms −56%
12MB .aepx (#817 repro) 324ms 289ms −11%

Closes #817

…telligence#817)

`tagExpWithClosingIndex` (introduced in 5.5.11) accumulated each scanned
char code into a `chars` array and returned the result via
`String.fromCharCode(...chars)`. For a tag expression large enough, the
spread exceeds V8's per-call argument-count limit and manifests as
`RangeError: Maximum call stack size exceeded` (with a shallow 4-frame
stack), breaking parsing of large real-world XML files (e.g. Adobe AE
.aepx, large XHTML reports).

Replace the spread with `xmlData.substring(i, index).replace(/\t/g, ' ')`,
which preserves the existing semantics (tab -> space, otherwise verbatim
up to the closing char) without the per-char accumulation. This is
safe because `closingChar` is `>` or `?>`, neither of which collides
with `"` / `'` attribute delimiters, so the boundary tracking below
still governs when to stop.

Adds a regression test using a 200000-char attribute value.
Kirylka added 2 commits April 16, 2026 20:18
The previous fix used `xmlData.substring(i, index).replace(/\t/g, ' ')`,
which unconditionally replaced every tab character in the tag
expression — including those inside quoted attribute values. 5.5.10 and
5.5.11 both only converted tabs to spaces when outside quoted attrs.

Switch to building the tag expression with `tagExp += xmlData[index]` in
the scan loop (same shape as 5.5.10) and explicitly handle `\t` only
outside the attribute boundary. Still avoids the `String.fromCharCode(...chars)`
spread that caused NaturalIntelligence#817.

Adds a regression test for tab preservation inside quoted attr values.
The previous commit restored 5.5.10 tab semantics via `tagExp += xmlData[index]`
in the scan loop. That worked correctly but regressed performance by ~18% on
a 12MB XML input vs 5.5.10 due to cons-string overhead.

Switch to a two-phase approach:

1. Single forward scan records the close-delimiter position and a flag
   for whether any `\t` was seen inside a quoted attribute value. Nothing
   is written during this pass.
2. Fast path (no tabs in quotes): `xmlData.substring(i, close).replace(/\t/g, ' ')`.
   This is the common case for real-world XML and avoids per-char work
   entirely.
3. Rare path (tabs inside quotes): walk the already-extracted substring
   once more, preserving tabs inside quotes and converting tabs outside
   quotes to spaces.

Benchmarks (Node 22, median of many iterations):

  input                         pure 5.5.10    this PR        delta
  small XML                     0.01ms         0.01ms         ~0
  medium XML (~8K)              0.50ms         0.41ms         -18%
  wide attr (~5K)               0.05ms         0.03ms         -37%
  very wide attr (~200k)        2.88ms         1.27ms         -56%
  12MB aepx (NaturalIntelligence#817 repro)        324ms          289ms          -11%

Output is byte-identical to 5.5.10 on every tested input (verified
against a 24-case battery + the 12MB .aepx SHA-256).
@amitguptagwl
Copy link
Copy Markdown
Member

amitguptagwl commented Apr 17, 2026

Thanks for your effort and time. Since the PR contains browser bundle which are not accepted in PR. I've updated desire method. Please check current code on github. This would be released tomorrow.

@Kirylka
Copy link
Copy Markdown
Author

Kirylka commented Apr 17, 2026

Quick heads-up on the commit that landed on master (e174168): it's close to my first pass (substring + bulk replace(/\t/g, ' ')) and does fix the stack overflow, but it drops two things that were in later commits on this branch:

  1. Tab preservation inside quoted attribute values. v5.5.10 preserved \t literally when it appeared inside "..." — only tabs outside quoted attrs were converted. The committed version replaces every tab in the tag expression, so <item a="x\ty"/> now comes back as "x y" instead of "x\ty". Arguably more XML 1.0 §3.3.3 compliant for CDATA attrs, but a silent behavior change for anyone upgrading from 5.5.10.

  2. Regression tests. This failure mode (a perf refactor introducing the spread bomb in the tag-expression reader) only came in during 5.5.11 itself — without tests the same shape of bug could slip back on the next pass.

If useful, the two small specs on this branch cover both cases (the 200k-char tag-expression test guarding #817, and the tab-in-quoted-attr semantic). Happy to open a separate tiny PR with just those, or you can lift them directly.

Either way, thanks for the quick turnaround.

@amitguptagwl
Copy link
Copy Markdown
Member

I noticed that if we build an array in tagExpWithClosingIndex like this [tagname, true, booleanAttr, true, attr, val] then it will simplify the logic of boolean attributes and raw attributes methods too. But this will need some time. So I'l reverting my changes to release rest of the changes and then we can take this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error: RangeError: Maximum call stack size exceeded

2 participants