Fix character reference consumption for consecutive `&` in text and correct numeric NCR edge handling by Copilot · Pull Request #172 · HtmlUnit/htmlunit-neko

Copilot · 2026-05-07T16:53:32Z

This updates HTML character reference parsing to preserve literal ampersands in text when a later named reference exists, and fixes two numeric character reference edge cases in unicode entity resolution.

HTMLScanner: preserve & runs in data state
- In scanEntityRef, added an early && guard so an invalid & start consumes/emits exactly one literal & and leaves following input untouched.
- This prevents collapsing ampersands in sequences like &&&> and allows only the final valid > to resolve.
HTMLUnicodeEntitiesParser: surrogate consumption correctness
- In setMatchFromCode, surrogate handling now sets matchLength_ = consumedCount_.
- Prevents leaking trailing literal fragment after replacement (&#xDFFF; now fully consumes as �).
HTMLUnicodeEntitiesParser: Windows-1252 mapping correction
- Corrected NCR mapping for 0x98 from U+20DC to U+02DC per HTML/Windows-1252 remap table.
Regression coverage
- Added text-content regression cases for:
  - FOO&&&>BAR → FOO&&&>BAR
  - FOO&#xDFFF;ZOO → FOO�ZOO
  - FOOZOO → FOO˜ZOO
- Extended unicode parser unit coverage for surrogate rewind behavior and 0x98 mapping.

// HTMLScanner.scanEntityRef
if ('&' == nextChar) {
    fCurrentEntity.rewind(1);
    if (plainValue != null) {
        plainValue.append('&');
    }
    str.clearAndAppend('&');
    return returnEntityRefString(str, content);
}

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

checkstyle.org
- Triggering command: /opt/hostedtoolcache/CodeQL/2.25.1/x64/codeql/tools/linux64/java/bin/java /opt/hostedtoolcache/CodeQL/2.25.1/x64/codeql/tools/linux64/java/bin/java -jar /opt/hostedtoolcache/CodeQL/2.25.1/x64/codeql/xml/tools/xml-extractor.jar --fileList=/tmp/codeql-scratch-a1ec6fe11304a4f3/dbs/java/working/files-to-index15099172076840092536.list --sourceArchiveDir=/tmp/codeql-scratch-a1ec6fe11304a4f3/dbs/java/src --outputDir=/tmp/codeql-scratch-a1ec6fe11304a4f3/dbs/java/trap/java (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Original prompt

Fix HTMLScanner character reference consumption so that in text content (data state) consecutive ampersands are not lost when followed by a named character reference.

Repro:
Input text content: FOO&&&>BAR
Current output produced by parser: FOO&&>;BAR
Expected output per HTML spec behavior: FOO&&&>BAR (only the final > becomes >; the preceding & remain literal).

Also fix numeric character reference handling in HTMLUnicodeEntitiesParser:

Surrogate numeric character reference should not leak the literal remainder. Repro: FOO&#xDFFF;ZOO currently yields FOO�xDFFF;ZOO, expected FOO�ZOO. Root cause: in HTMLUnicodeEntitiesParser.setMatchFromCode(), surrogate branch sets match_ but not matchLength_. Set matchLength_ = consumedCount_.
Windows-1252 control mapping bug: FOOZOO should map 0x98 -> U+02DC (SMALL TILDE). Current code maps to U+20DC. Change case 0x98 to "\u02DC".

Implementation notes:

Repo: HtmlUnit/htmlunit-neko.
Investigate src/main/java/org/htmlunit/cyberneko/HTMLScanner.java character reference parsing function that handles '&' and uses HTMLNamedEntitiesParser/HTMLUnicodeEntitiesParser. Ensure invariant: if no valid reference, consume exactly one '&' and emit exactly one '&' without dropping following characters; and if valid reference like > is present later in a run of '&', only that one should be consumed.
Add/adjust unit tests for these cases. Search existing tests (likely under src/test/java) and add new ones if missing.

Acceptance criteria:

Parsing text content FOO&&&>BAR results in a single text node with value exactly FOO&&&>BAR.
Parsing FOO&#xDFFF;ZOO yields text FOO�ZOO.
Parsing FOOZOO yields text FOO˜ZOO (U+02DC).
All tests pass.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: why the html parser generates this dom tree
|
|
|
|
|
|
|

|

from

Assistant: Now I have a thorough understanding of the code. Let me trace through what happens with <a></a> step by step.

Why <a></a> produces that DOM tree

The behaviour is driven entirely by HTMLTagBalancer.java, specifically two mechanisms:

Key mechanism 1 — "Save and re-open formatting elements" when a block arrives (lines 868–962)

// if block element, save immediate parent formatting elements int depth = 0; if (element.flags == 0) { // <-- flags==0 means "neither INLINE nor BLOCK alone" — applies to ? ... } // ... later ... // re-open formatting elements for (int i = 0; i < depth; i++) { final Info info = fFormattingStack.pop(); forceStartElement(info.qname, info.attributes, synthesizedAugs()); }

Actually  is a block element. When a block element is encountered, the parser:

Scans backward through fElementStack for consecutive formatting elements (INLINE | FORMATTING flag).

Synthesizes close events for those formatting elements.

Emits the block element.

Re-opens (forces new start events for) those same formatting elements inside the block.

 is defined as Element.INLINE | Element.FORMATTING and <a> as Element.INLINE (but not FORMATTING). The scan stops at <a> because it is not a formatting element.

Key mechanism 2 —  closes  in element.closes

 has a closes list that includes P itself, so an open  is implicitly closed by a new .

Step-by-step parse of <a></a>

Step Input token Action Stack DOM emitted

1  Implicit <html><head></head><body>, then push  html > body > b html, head, body, b

2  Push second  (no closes rule stops it) html > body > b > b b

3 <a> Push <a> html > body > b > b > a a

4  Block element arrives → scan formatting elements upward from <a>: <a> is INLINE but not FORMATTING → scan stops. Only the directly adjacent  (the second one) is FORMATTING. Save it to fFormattingStack, emit synthetic . Stack is now b > a. Then  is emitted and pushed. Then the saved  is re-opened inside . …> b > a > p > b (synth), , (synth)

5 </a> Look for <a> in element stack. Found at depth 3 (above p and b). But getElementDepth stops at a block boundary —  is a block element, so the search breaks early → depth = -1 (not found). The </a> is discarded. unchanged *(dis...

This pull request was created from Copilot chat.

Agent-Logs-Url: https://github.com/HtmlUnit/htmlunit-neko/sessions/7ff3bc0d-a2b2-4ffc-a2bf-019d40132ce6 Co-authored-by: rbri <[email protected]>

sonarqubecloud · 2026-05-07T16:57:50Z

Quality Gate passed

Issues
25 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
1.6% Duplication on New Code

See analysis details on SonarQube Cloud

Initial plan

e43dc2c

Copilot AI assigned Copilot and rbri May 7, 2026

Copilot started work on behalf of rbri May 7, 2026 16:53 View session

Fix entity parsing for ampersands and numeric references

5230135

Agent-Logs-Url: https://github.com/HtmlUnit/htmlunit-neko/sessions/7ff3bc0d-a2b2-4ffc-a2bf-019d40132ce6 Co-authored-by: rbri <[email protected]>

Copilot AI changed the title ~~[WIP] Fix HTMLScanner character reference consumption and handling~~ Fix character reference consumption for consecutive & in text and correct numeric NCR edge handling May 7, 2026

Copilot finished work on behalf of rbri May 7, 2026 16:59

Copilot AI requested a review from rbri May 7, 2026 16:59

Copilot started work on behalf of rbri May 7, 2026 17:32 View session

Copilot finished work on behalf of rbri May 7, 2026 17:33

Copilot started work on behalf of rbri May 7, 2026 17:34 View session

Copilot finished work on behalf of rbri May 7, 2026 17:35

rbri closed this May 7, 2026

rbri deleted the copilot/fix-htmlscanner-character-reference branch May 7, 2026 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix character reference consumption for consecutive `&` in text and correct numeric NCR edge handling#172

Fix character reference consumption for consecutive `&` in text and correct numeric NCR edge handling#172
Copilot wants to merge 2 commits into
masterfrom
copilot/fix-htmlscanner-character-reference

Copilot AI commented May 7, 2026 •

edited

Loading

Why `<b><b><a><p></a>` produces that DOM tree

Key mechanism 1 — "Save and re-open formatting elements" when a block arrives (lines 868–962)

Key mechanism 2 — `<p>` closes `<p>` in `element.closes`

Step-by-step parse of `<b><b><a><p></a>`

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Step	Input token	Action	Stack	DOM emitted
1	`<b>`	Implicit `<html><head></head><body>`, then push `<b>`	`html > body > b`	`html, head, body, b`
2	`<b>`	Push second `<b>` (no `closes` rule stops it)	`html > body > b > b`	`b`
3	`<a>`	Push `<a>`	`html > body > b > b > a`	`a`
4	`<p>`	Block element arrives → scan formatting elements upward from `<a>`: `<a>` is INLINE but not FORMATTING → scan stops. Only the directly adjacent `<b>` (the second one) is FORMATTING. Save it to `fFormattingStack`, emit synthetic `</b>`. Stack is now `b > a`. Then `<p>` is emitted and pushed. Then the saved `<b>` is re-opened inside `<p>`.	`…> b > a > p > b`	`</b>(synth), <p>, <b>(synth)`
5	`</a>`	Look for `<a>` in element stack. Found at depth 3 (above `p` and `b`). But `getElementDepth` stops at a block boundary — `<p>` is a block element, so the search breaks early → depth = -1 (not found). The `</a>` is discarded.	unchanged	*(dis...

Uh oh!

Conversation

Copilot AI commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Why <b><b><a><p></a> produces that DOM tree

Key mechanism 1 — "Save and re-open formatting elements" when a block arrives (lines 868–962)

Key mechanism 2 — <p> closes <p> in element.closes

Step-by-step parse of <b><b><a><p></a>

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 7, 2026 •

edited

Loading

Why `<b><b><a><p></a>` produces that DOM tree

Key mechanism 2 — `<p>` closes `<p>` in `element.closes`

Step-by-step parse of `<b><b><a><p></a>`