Skip to content

Fix character reference consumption for consecutive & in text and correct numeric NCR edge handling#172

Closed
Copilot wants to merge 2 commits into
masterfrom
copilot/fix-htmlscanner-character-reference
Closed

Fix character reference consumption for consecutive & in text and correct numeric NCR edge handling#172
Copilot wants to merge 2 commits into
masterfrom
copilot/fix-htmlscanner-character-reference

Conversation

Copy link
Copy Markdown

Copilot AI commented May 7, 2026

This updates HTML character reference parsing to preserve literal ampersands in text when a later named reference exists, and fixes two numeric character reference edge cases in unicode entity resolution.

  • HTMLScanner: preserve & runs in data state

    • In scanEntityRef, added an early && guard so an invalid & start consumes/emits exactly one literal & and leaves following input untouched.
    • This prevents collapsing ampersands in sequences like &&&> and allows only the final valid > to resolve.
  • HTMLUnicodeEntitiesParser: surrogate consumption correctness

    • In setMatchFromCode, surrogate handling now sets matchLength_ = consumedCount_.
    • Prevents leaking trailing literal fragment after replacement (� now fully consumes as ).
  • HTMLUnicodeEntitiesParser: Windows-1252 mapping correction

    • Corrected NCR mapping for 0x98 from U+20DC to U+02DC per HTML/Windows-1252 remap table.
  • Regression coverage

    • Added text-content regression cases for:
      • FOO&&&>BARFOO&&&>BAR
      • FOO�ZOOFOO�ZOO
      • FOO˜ZOOFOO˜ZOO
    • Extended unicode parser unit coverage for surrogate rewind behavior and 0x98 mapping.
// HTMLScanner.scanEntityRef
if ('&' == nextChar) {
    fCurrentEntity.rewind(1);
    if (plainValue != null) {
        plainValue.append('&');
    }
    str.clearAndAppend('&');
    return returnEntityRefString(str, content);
}

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • checkstyle.org
    • Triggering command: /opt/hostedtoolcache/CodeQL/2.25.1/x64/codeql/tools/linux64/java/bin/java /opt/hostedtoolcache/CodeQL/2.25.1/x64/codeql/tools/linux64/java/bin/java -jar /opt/hostedtoolcache/CodeQL/2.25.1/x64/codeql/xml/tools/xml-extractor.jar --fileList=/tmp/codeql-scratch-a1ec6fe11304a4f3/dbs/java/working/files-to-index15099172076840092536.list --sourceArchiveDir=/tmp/codeql-scratch-a1ec6fe11304a4f3/dbs/java/src --outputDir=/tmp/codeql-scratch-a1ec6fe11304a4f3/dbs/java/trap/java (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

Fix HTMLScanner character reference consumption so that in text content (data state) consecutive ampersands are not lost when followed by a named character reference.

Repro:
Input text content: FOO&&&>BAR
Current output produced by parser: FOO&&>;BAR
Expected output per HTML spec behavior: FOO&&&>BAR (only the final > becomes >; the preceding & remain literal).

Also fix numeric character reference handling in HTMLUnicodeEntitiesParser:

  1. Surrogate numeric character reference should not leak the literal remainder. Repro: FOO�ZOO currently yields FOO�xDFFF;ZOO, expected FOO�ZOO. Root cause: in HTMLUnicodeEntitiesParser.setMatchFromCode(), surrogate branch sets match_ but not matchLength_. Set matchLength_ = consumedCount_.
  2. Windows-1252 control mapping bug: FOO˜ZOO should map 0x98 -> U+02DC (SMALL TILDE). Current code maps to U+20DC. Change case 0x98 to "\u02DC".

Implementation notes:

  • Repo: HtmlUnit/htmlunit-neko.
  • Investigate src/main/java/org/htmlunit/cyberneko/HTMLScanner.java character reference parsing function that handles '&' and uses HTMLNamedEntitiesParser/HTMLUnicodeEntitiesParser. Ensure invariant: if no valid reference, consume exactly one '&' and emit exactly one '&' without dropping following characters; and if valid reference like > is present later in a run of '&', only that one should be consumed.
  • Add/adjust unit tests for these cases. Search existing tests (likely under src/test/java) and add new ones if missing.

Acceptance criteria:

  • Parsing text content FOO&&&>BAR results in a single text node with value exactly FOO&&&>BAR.
  • Parsing FOO�ZOO yields text FOO�ZOO.
  • Parsing FOO˜ZOO yields text FOO˜ZOO (U+02DC).
  • All tests pass.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: why the html parser generates this dom tree
|
|
|
|
|
|
|


|

from

Assistant: Now I have a thorough understanding of the code. Let me trace through what happens with <b><b><a><p></a> step by step.


Why <b><b><a><p></a> produces that DOM tree

The behaviour is driven entirely by HTMLTagBalancer.java, specifically two mechanisms:

Key mechanism 1 — "Save and re-open formatting elements" when a block arrives (lines 868–962)

// if block element, save immediate parent formatting elements
int depth = 0;
if (element.flags == 0) {   // <-- flags==0 means "neither INLINE nor BLOCK alone" — applies to <p>?
    ...
}
// ... later ...
// re-open formatting elements
for (int i = 0; i < depth; i++) {
    final Info info = fFormattingStack.pop();
    forceStartElement(info.qname, info.attributes, synthesizedAugs());
}

Actually <p> is a block element. When a block element is encountered, the parser:

  1. Scans backward through fElementStack for consecutive formatting elements (INLINE | FORMATTING flag).
  2. Synthesizes close events for those formatting elements.
  3. Emits the block element.
  4. Re-opens (forces new start events for) those same formatting elements inside the block.

<b> is defined as Element.INLINE | Element.FORMATTING and <a> as Element.INLINE (but not FORMATTING). The scan stops at <a> because it is not a formatting element.

Key mechanism 2 — <p> closes <p> in element.closes

<p> has a closes list that includes P itself, so an open <p> is implicitly closed by a new <p>.


Step-by-step parse of <b><b><a><p></a>

Step Input token Action Stack DOM emitted
1 <b> Implicit <html><head></head><body>, then push <b> html > body > b html, head, body, b
2 <b> Push second <b> (no closes rule stops it) html > body > b > b b
3 <a> Push <a> html > body > b > b > a a
4 <p> Block element arrives → scan formatting elements upward from <a>: <a> is INLINE but not FORMATTING → scan stops. Only the directly adjacent <b> (the second one) is FORMATTING. Save it to fFormattingStack, emit synthetic </b>. Stack is now b > a. Then <p> is emitted and pushed. Then the saved <b> is re-opened inside <p>. …> b > a > p > b </b>(synth), <p>, <b>(synth)
5 </a> Look for <a> in element stack. Found at depth 3 (above p and b). But getElementDepth stops at a block boundary — <p> is a block element, so the search breaks early → depth = -1 (not found). The </a> is discarded. unchanged *(dis...

This pull request was created from Copilot chat.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 7, 2026

Copilot AI changed the title [WIP] Fix HTMLScanner character reference consumption and handling Fix character reference consumption for consecutive & in text and correct numeric NCR edge handling May 7, 2026
Copilot finished work on behalf of rbri May 7, 2026 16:59
Copilot AI requested a review from rbri May 7, 2026 16:59
Copilot finished work on behalf of rbri May 7, 2026 17:33
Copilot finished work on behalf of rbri May 7, 2026 17:35
@rbri rbri closed this May 7, 2026
@rbri rbri deleted the copilot/fix-htmlscanner-character-reference branch May 7, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants