Skip to content

Openpdf-core rendering integration in openpdf-renderer#1566

Open
andreasrosdal wants to merge 11 commits into
LibrePDF:masterfrom
andreasrosdal:claude/openpdf-core-integration-Fc9rv
Open

Openpdf-core rendering integration in openpdf-renderer#1566
andreasrosdal wants to merge 11 commits into
LibrePDF:masterfrom
andreasrosdal:claude/openpdf-core-integration-Fc9rv

Conversation

@andreasrosdal
Copy link
Copy Markdown
Contributor

@andreasrosdal andreasrosdal commented May 18, 2026

Summary

Continues the work to use openpdf-core (PdfReader + PdfContentParser) as the rendering engine in openpdf-renderer. The first commit on this branch extended the basic operator subset; the second commit closes the gap on real-world PDF features — XObjects (forms + images), inline-image safety, and a more robust parse loop.

Commit 1 — expand operator coverage

  • CMYK colors (k, K) and color-space-aware fills/strokes (cs, CS, sc, SC, scn, SCN) for DeviceGray / DeviceRGB / DeviceCMYK.
  • Clipping (W, W*) with proper save/restore through q/Q.
  • Line styling (J, j, M, d, i) plumbed into the BasicStroke.
  • Extended graphics state (gs) honoring CA/ca alpha and LW/ML/LC/LJ.
  • Text rise (Ts).
  • Marked content / compatibility (BMC, BDC, EMC, MP, DP, BX, EX) parsed as no-ops so content inside them still renders.

New convenience entry points on OpenPdfCoreRenderer:

  • renderPage(int, Graphics2D, int, int) — draws directly onto a caller-supplied Graphics2D (Swing, printer, SVG-backed graphics) without allocating a BufferedImage; saves and restores the caller's transform and clip.
  • renderAllPages(float) — convenience that returns one BufferedImage per page in document order.

Commit 2 — XObject support and inline-image safety

  • Form XObjects (Do with /Subtype /Form) render recursively, applying the form's own /Matrix and /BBox under the current CTM with full state save/restore so form content can't leak out.
  • Image XObjects (Do with /Subtype /Image) decode via:
    • ImageIO for JPEG (DCTDecode) and JPEG 2000 (JPXDecode, where the runtime supports it).
    • A manual raster builder for uncompressed / Flate-decoded 8-bit DeviceGray, DeviceRGB and DeviceCMYK streams. CMYK is approximated to sRGB on the fly since Java2D can't natively draw a CMYK raster.
    • Image XObjects honor the current fill alpha (ca from gs) and the CTM, drawing into the standard (0,0)-(1,1) unit square.
  • Inline images (BI/ID/EI) are pre-stripped from the content stream before PdfContentParser sees them. PdfContentParser has no inline-image handling and the raw image bytes after ID would otherwise derail tokenization for the rest of the page.
  • The content-stream parse loop now treats parser-level failures (malformed dictionary, unterminated array, ...) as "stop early" rather than aborting the whole renderer, matching how operator-level errors were already handled.

Tests

  • OpenPdfCorePageRendererOperatorsTest (8 tests) builds synthetic PDFs with PdfContentByte/PdfTemplate and renders them back, verifying:
    • CMYK fills, dashed strokes, W-clipped fills, marked-content sequences, text rise.
    • JPEG Image XObject — embed a red JPEG, check the rendered page contains red pixels.
    • Form XObject — stamp a PdfTemplate with an orange fill, check the form content reaches the rasterizer.
    • Inline image safety — hand-rolled stream with a BI/ID/EI block followed by a red rectangle, check the trailing rectangle still renders after the inline image is stripped.
  • OpenPdfCoreRendererTest (16 tests) covers the new renderPage(int, Graphics2D, int, int) and renderAllPages(float) overloads, including argument validation and Graphics2D state restoration.
  • Whole openpdf-renderer module test suite: 84 tests, 0 failures, 0 errors.

README's operator table updated to reflect the broader coverage; new code examples for Graphics2D and batch rendering.

My name: Andreas Røsdal

claude added 3 commits May 18, 2026 09:07
Second pass at using openpdf-core as the rendering engine in
openpdf-renderer. Extends the Java2D rasterizer driven by
PdfContentParser with the operators most commonly missing on
real-world PDFs:

- CMYK colors (k, K) and color-space-aware fills/strokes
  (cs, CS, sc, SC, scn, SCN) for DeviceGray / DeviceRGB / DeviceCMYK.
- Clipping (W, W*) with proper save/restore through q/Q.
- Line styling (J, j, M, d, i) plumbed into the BasicStroke.
- Extended graphics state (gs) honoring CA/ca alpha and LW/ML/LC/LJ.
- Text rise (Ts).
- Marked content / compatibility operators (BMC, BDC, EMC, MP, DP,
  BX, EX) parsed as no-ops so content inside them still renders.

Adds two new conveniences on OpenPdfCoreRenderer:
- renderPage(int, Graphics2D, int, int) draws directly onto a caller-
  supplied Graphics2D without allocating a BufferedImage, and saves/
  restores the caller's transform and clip.
- renderAllPages(float) returns one BufferedImage per page.

Adds OpenPdfCorePageRendererOperatorsTest that builds synthetic PDFs
with PdfContentByte and renders them back to verify CMYK fills,
dashed strokes, clipping, marked content and text rise all drive
the renderer end-to-end. README updated to reflect the broader
operator table.
Completes the openpdf-core-driven Java2D renderer by handling the
operators most commonly missing on real-world PDFs:

- Do: Form XObjects render recursively, applying the form's own
  /Matrix and /BBox under the current CTM with state save/restore.
  Image XObjects decode via:
   * ImageIO for DCTDecode (JPEG) and JPXDecode (JPEG 2000, when
     supported by the runtime),
   * a manual raster builder for uncompressed / Flate-decoded
     8-bit DeviceGray, DeviceRGB and DeviceCMYK streams (CMYK is
     approximated to sRGB on the fly, since Java2D can't natively
     draw a CMYK raster).
  Image XObjects honor the current fill alpha (ca from ExtGState)
  and the CTM, drawing into the standard (0,0)-(1,1) unit square.

- Inline images (BI/ID/EI) are now pre-stripped from the content
  stream before PdfContentParser sees them; the parser had no
  inline-image handling and the raw image bytes after ID would
  otherwise derail tokenization for the rest of the page.

- The content-stream parse loop now treats parser-level failures
  (malformed dictionaries, unterminated arrays) as "stop early"
  rather than aborting the whole renderer, matching how operator-
  level errors were already handled.

Tests added to OpenPdfCorePageRendererOperatorsTest:
- rendersJpegImageXObject builds a red JPEG, embeds it via
  PdfContentByte.addImage, and checks the page contains red pixels.
- rendersFormXObjectViaNestedContentStream stamps a PdfTemplate
  with an orange fill and checks the form's content reaches the
  rasterizer.
- inlineImagesDoNotBreakPageRendering writes a hand-rolled stream
  with a BI/ID/EI block followed by a red rectangle and checks the
  trailing rectangle still renders.

README updated; module test suite: 84 tests, 0 failures.
Addresses the checkstyle 'single-line Javadoc comment should be
multi-line' rule on the new openpdf-core renderer code. Affects
ten one-line Javadocs across OpenPdfCorePageRenderer and one in
OpenPdfCoreRenderer; behavior unchanged.
@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented May 18, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 414 complexity · 12 duplication

Metric Results
Complexity 414
Duplication 12

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

claude added 3 commits May 18, 2026 09:43
Splits the two over-branchy helpers Codacy flagged into smaller
focused methods, and stops reassigning a method parameter:

- applyExtGState(String) was a flat list of seven null checks
  driving an NPath of 2048. Split into resolveExtGStateDict,
  applyExtGStateAlpha and applyExtGStateLineStyle.
- imageComponents(PdfObject) was a chain of PdfName.equals checks
  on freshly-allocated PdfNames (NPath 3136). Now uses static
  Set<PdfName> lookups (DEVICE_GRAY_NAMES / DEVICE_RGB_NAMES /
  DEVICE_CMYK_NAMES) with named PdfName constants, split across
  componentsForNamedColorSpace, componentsForArrayColorSpace and
  iccBasedComponents.
- imageComponents no longer reassigns its csObj parameter; uses
  a local `direct` reference instead.

Also wraps the long XObject row in openpdf-renderer/README.md
that was exceeding the 120-column limit (was 207).

No behavior change; module test suite still 84/84 green.
Biggest correctness gap on real-world PDFs has been text rendering:
mapFont() picked a generic Java2D family (Serif/Sans/Mono) from the
PostScript font name, so PDFs that embedded their own subsetted fonts
drew with the wrong glyph shapes (and missed glyphs whenever the name
heuristic chose a family that didn't cover the Unicode chars).

This commit closes that gap for the dominant case (embedded TrueType /
FontFile2):

- mapFont() now first calls embeddedFontFor(...), which pulls the
  CMapAwareDocumentFont's FontDescriptor via openpdf-core, finds the
  embedded font program stream (FontFile2 / FontFile3 / FontFile in
  that preference order), and loads it with Font.createFont. The
  resulting AWT Font is cached by FontDescriptor identity so the same
  font program isn't re-parsed for every Tj/TJ call.
- When no font program is embedded, or parsing fails, falls back to
  the previous name-heuristic path (now mapFontByName(...)).
- Failures are cached (as a null Font) so we don't retry every glyph.

Test:
- rendersTextUsingEmbeddedTrueTypeFont embeds LiberationSans-Regular
  (shipped with openpdf-core for font-fallback) in a freshly built PDF,
  renders the page back and verifies dark pixels appear in the text
  region. The embedded program is required: no name-based AWT family
  would match "LiberationSans".

README's "Status" section updated and a candid "Honest limitations &
roadmap" subsection added. It calls out the remaining gaps in priority
order (Type 1 / CFF fonts, Type 3 fonts, ICC color management, patterns
and shadings, inline images, soft masks, indexed/Separation/DeviceN,
encryption) so future contributors know which gap to grab next.

Module test suite: 85 tests, 0 failures.
Codacy markdownlint flagged the bullet list under "XObject coverage:"
for missing a leading blank line (lists should be surrounded by blank
lines). Single-line fix.
@andreasrosdal andreasrosdal changed the title Complete openpdf-core rendering integration in openpdf-renderer Openpdf-core rendering integration in openpdf-renderer May 18, 2026
claude added 5 commits May 18, 2026 10:10
Implements the inline-image roadmap item: instead of pre-stripping inline
image blocks, the preprocessor now promotes each one into a synthetic Image
XObject and substitutes a `/__inline_image__N Do` invocation into the
content stream. The rest of the renderer treats it exactly like a regular
Image XObject and reuses the existing buildGrayImage / buildRgbImage /
buildCmykImage / ImageIO decode paths.

Two framing strategies are used so the parser doesn't get confused by
binary data:

- For DCT / DCTDecode / JPXDecode filters, find the JPEG end-of-image
  marker (FFD9) instead of scanning for "EI" bounded by whitespace, since
  JPEG payloads routinely contain byte sequences that look like EI by
  accident.
- For other filters (including no filter and FlateDecode), keep the
  whitespace-bounded EI heuristic but stop trimming "trailing whitespace"
  greedily -- image bytes can legitimately be 0x00 or 0x0A and the spec
  guarantees exactly one whitespace byte before EI.

Abbreviated dict keys (/W, /H, /BPC, /CS, /F) and full names (/Width,
/Height, ...) are both accepted; abbreviated colorspace values (/G, /RGB,
/CMYK) and full names map to component counts.

Tests:
- inlineImageRendersAtCtmLocation builds a 2x2 DeviceGray inline image
  with a [black, white; white, black] checker, scales it 120x via a cm,
  and asserts the rendered page contains dark pixels in the right region.
- jpegInlineImageDecodes uses PdfContentByte.addImage(image, ..., true)
  to embed a green JPEG as an inline image, then asserts the rendered
  page contains green pixels.

README's status section now says inline images render, and the
limitations list no longer mentions them.

Also addresses Codacy's "unnecessary fully qualified name" warning on
java.util.List / java.util.Set usage. The class now imports List, Set,
Arrays, ByteArrayOutputStream, StandardCharsets and Rectangle2D directly
instead of inlining the FQNs; 7 call sites simplified.

Module test suite: 86 tests, 0 failures.
Highest-ROI item left on the renderer roadmap: every PNG-to-PDF
conversion produces images with `[/Indexed /DeviceRGB hival lookup]`
color spaces, and the renderer was silently skipping them
(decodeRawRaster falls through to null for non-Device colorspaces).
This commit adds the decode path.

- decodeImage now recognizes `[/Indexed base hival lookup]` (with
  CS_INDEXED constant) and routes to a new decodeIndexedImage.
- decodeIndexedImage reads 8-bit indices from the (already
  Flate-decoded) stream, expands each pixel through the lookup table
  into the base color space's component bytes, then reuses the
  existing buildGrayImage / buildRgbImage / buildCmykImage helpers.
  The base color space's component count is determined via the
  existing imageComponents().
- readIndexedLookup handles both forms the spec allows: a PdfString
  containing the palette bytes, or a PRStream whose decoded content
  is the palette.
- Sub-byte bit depths (1/2/4-bit indices) are explicitly rejected
  for now -- 8-bit is the dominant case for PNG-derived images.

Test:
- rendersIndexedColorImageXObject builds a 32x32 BufferedImage with
  an IndexColorModel (top half = magenta, bottom = cyan), embeds it
  via Image.getInstance(BufferedImage), and asserts both palette
  colors appear in the rendered page. openpdf-core's Image.getInstance
  preserves the IndexColorModel as `[/Indexed /DeviceRGB ...]`, so
  this exercises the new decode path end-to-end.

README updated: Indexed moved from "limitations" to the supported
Image XObject formats; only sub-byte-packed indexed images remain
called out as unsupported.

Module test suite: 87 tests, 0 failures.
Codacy flagged decodeIndexedImage with NPath 385 (threshold 200).
Splits the method along its natural seams without changing behavior:

- decodeIndexedImage now just wraps the try/catch around
  decodeIndexedImageOrThrow.
- decodeIndexedImageOrThrow handles validation + orchestration.
- readBitsPerComponent extracts the /BitsPerComponent read.
- expandIndexedPalette is the per-pixel arraycopy loop.
- buildImageForBaseComponents is the switch on component count.

No behavior change; module test suite still 87/87 green.
PDFs that draw tables (PdfPTable, hand-rolled re/m/l/S grids, ...) lean
hard on three pieces of stroke handling that this renderer was getting
wrong or skipping:

- Zero-width hairline strokes (PDF §8.4.3.2). `w 0` means "the thinnest
  line the device can render", i.e. one device pixel. The previous
  `Math.max(lineWidth, 0.001f)` collapsed those hairlines to invisibility
  once the page CTM scaled them. Now strokePath() computes an effective
  width of `1 / max(|sx|, |sy|)` from the current transform so a `0 w`
  stroke renders as a one-device-pixel line at any DPI.

- ExtGState line styling beyond LW/ML/LC/LJ. The dash array `/D` and the
  stroke-adjust flag `/SA` are now read out of gs dictionaries; `/D`
  feeds the existing dash-pattern path, `/SA` is tracked through q/Q.

- Crisp axis-aligned borders. KEY_STROKE_CONTROL is now set to
  VALUE_STROKE_NORMALIZE so 0.5pt borders snap to integer device pixels
  instead of smearing into two rows of antialiased grey.

Adds two regression tests: a full PdfPTable render (background fills,
red 2pt header border, body-row text) and a `0 w` hairline render that
asserts the stroke is actually visible after CTM scaling.

https://claude.ai/code/session_01Bobvbg8Ccp2g9S5DRFsnNb
The existing PdfPTable test only exercised single-word cell values
("Col A", "r0c0"). This adds a regression test that pushes harder on
the text-in-table path: multi-line wrapped descriptions, a Phrase
composed of multiple Chunks with different fonts and colors (regular,
bold, italic, RED), varied horizontal alignments, a colored colspan
cell with vertical centering, and a larger header font.

The four assertions cover the parts that are easy to silently break:

- white-on-blue header glyphs (header row text under cell background),
- a red Chunk inside an otherwise-black Phrase (per-Chunk fill color),
- a blue colspan-cell Phrase (text under multi-column layout),
- a multi-line wrapped cell producing several distinct glyph rows.

https://claude.ai/code/session_01Bobvbg8Ccp2g9S5DRFsnNb
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants