Openpdf-core rendering integration in openpdf-renderer#1566
Open
andreasrosdal wants to merge 11 commits into
Open
Openpdf-core rendering integration in openpdf-renderer#1566andreasrosdal wants to merge 11 commits into
andreasrosdal wants to merge 11 commits into
Conversation
Second pass at using openpdf-core as the rendering engine in openpdf-renderer. Extends the Java2D rasterizer driven by PdfContentParser with the operators most commonly missing on real-world PDFs: - CMYK colors (k, K) and color-space-aware fills/strokes (cs, CS, sc, SC, scn, SCN) for DeviceGray / DeviceRGB / DeviceCMYK. - Clipping (W, W*) with proper save/restore through q/Q. - Line styling (J, j, M, d, i) plumbed into the BasicStroke. - Extended graphics state (gs) honoring CA/ca alpha and LW/ML/LC/LJ. - Text rise (Ts). - Marked content / compatibility operators (BMC, BDC, EMC, MP, DP, BX, EX) parsed as no-ops so content inside them still renders. Adds two new conveniences on OpenPdfCoreRenderer: - renderPage(int, Graphics2D, int, int) draws directly onto a caller- supplied Graphics2D without allocating a BufferedImage, and saves/ restores the caller's transform and clip. - renderAllPages(float) returns one BufferedImage per page. Adds OpenPdfCorePageRendererOperatorsTest that builds synthetic PDFs with PdfContentByte and renders them back to verify CMYK fills, dashed strokes, clipping, marked content and text rise all drive the renderer end-to-end. README updated to reflect the broader operator table.
Completes the openpdf-core-driven Java2D renderer by handling the
operators most commonly missing on real-world PDFs:
- Do: Form XObjects render recursively, applying the form's own
/Matrix and /BBox under the current CTM with state save/restore.
Image XObjects decode via:
* ImageIO for DCTDecode (JPEG) and JPXDecode (JPEG 2000, when
supported by the runtime),
* a manual raster builder for uncompressed / Flate-decoded
8-bit DeviceGray, DeviceRGB and DeviceCMYK streams (CMYK is
approximated to sRGB on the fly, since Java2D can't natively
draw a CMYK raster).
Image XObjects honor the current fill alpha (ca from ExtGState)
and the CTM, drawing into the standard (0,0)-(1,1) unit square.
- Inline images (BI/ID/EI) are now pre-stripped from the content
stream before PdfContentParser sees them; the parser had no
inline-image handling and the raw image bytes after ID would
otherwise derail tokenization for the rest of the page.
- The content-stream parse loop now treats parser-level failures
(malformed dictionaries, unterminated arrays) as "stop early"
rather than aborting the whole renderer, matching how operator-
level errors were already handled.
Tests added to OpenPdfCorePageRendererOperatorsTest:
- rendersJpegImageXObject builds a red JPEG, embeds it via
PdfContentByte.addImage, and checks the page contains red pixels.
- rendersFormXObjectViaNestedContentStream stamps a PdfTemplate
with an orange fill and checks the form's content reaches the
rasterizer.
- inlineImagesDoNotBreakPageRendering writes a hand-rolled stream
with a BI/ID/EI block followed by a red rectangle and checks the
trailing rectangle still renders.
README updated; module test suite: 84 tests, 0 failures.
Addresses the checkstyle 'single-line Javadoc comment should be multi-line' rule on the new openpdf-core renderer code. Affects ten one-line Javadocs across OpenPdfCorePageRenderer and one in OpenPdfCoreRenderer; behavior unchanged.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 414 |
| Duplication | 12 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
Splits the two over-branchy helpers Codacy flagged into smaller focused methods, and stops reassigning a method parameter: - applyExtGState(String) was a flat list of seven null checks driving an NPath of 2048. Split into resolveExtGStateDict, applyExtGStateAlpha and applyExtGStateLineStyle. - imageComponents(PdfObject) was a chain of PdfName.equals checks on freshly-allocated PdfNames (NPath 3136). Now uses static Set<PdfName> lookups (DEVICE_GRAY_NAMES / DEVICE_RGB_NAMES / DEVICE_CMYK_NAMES) with named PdfName constants, split across componentsForNamedColorSpace, componentsForArrayColorSpace and iccBasedComponents. - imageComponents no longer reassigns its csObj parameter; uses a local `direct` reference instead. Also wraps the long XObject row in openpdf-renderer/README.md that was exceeding the 120-column limit (was 207). No behavior change; module test suite still 84/84 green.
Biggest correctness gap on real-world PDFs has been text rendering: mapFont() picked a generic Java2D family (Serif/Sans/Mono) from the PostScript font name, so PDFs that embedded their own subsetted fonts drew with the wrong glyph shapes (and missed glyphs whenever the name heuristic chose a family that didn't cover the Unicode chars). This commit closes that gap for the dominant case (embedded TrueType / FontFile2): - mapFont() now first calls embeddedFontFor(...), which pulls the CMapAwareDocumentFont's FontDescriptor via openpdf-core, finds the embedded font program stream (FontFile2 / FontFile3 / FontFile in that preference order), and loads it with Font.createFont. The resulting AWT Font is cached by FontDescriptor identity so the same font program isn't re-parsed for every Tj/TJ call. - When no font program is embedded, or parsing fails, falls back to the previous name-heuristic path (now mapFontByName(...)). - Failures are cached (as a null Font) so we don't retry every glyph. Test: - rendersTextUsingEmbeddedTrueTypeFont embeds LiberationSans-Regular (shipped with openpdf-core for font-fallback) in a freshly built PDF, renders the page back and verifies dark pixels appear in the text region. The embedded program is required: no name-based AWT family would match "LiberationSans". README's "Status" section updated and a candid "Honest limitations & roadmap" subsection added. It calls out the remaining gaps in priority order (Type 1 / CFF fonts, Type 3 fonts, ICC color management, patterns and shadings, inline images, soft masks, indexed/Separation/DeviceN, encryption) so future contributors know which gap to grab next. Module test suite: 85 tests, 0 failures.
Codacy markdownlint flagged the bullet list under "XObject coverage:" for missing a leading blank line (lists should be surrounded by blank lines). Single-line fix.
Implements the inline-image roadmap item: instead of pre-stripping inline image blocks, the preprocessor now promotes each one into a synthetic Image XObject and substitutes a `/__inline_image__N Do` invocation into the content stream. The rest of the renderer treats it exactly like a regular Image XObject and reuses the existing buildGrayImage / buildRgbImage / buildCmykImage / ImageIO decode paths. Two framing strategies are used so the parser doesn't get confused by binary data: - For DCT / DCTDecode / JPXDecode filters, find the JPEG end-of-image marker (FFD9) instead of scanning for "EI" bounded by whitespace, since JPEG payloads routinely contain byte sequences that look like EI by accident. - For other filters (including no filter and FlateDecode), keep the whitespace-bounded EI heuristic but stop trimming "trailing whitespace" greedily -- image bytes can legitimately be 0x00 or 0x0A and the spec guarantees exactly one whitespace byte before EI. Abbreviated dict keys (/W, /H, /BPC, /CS, /F) and full names (/Width, /Height, ...) are both accepted; abbreviated colorspace values (/G, /RGB, /CMYK) and full names map to component counts. Tests: - inlineImageRendersAtCtmLocation builds a 2x2 DeviceGray inline image with a [black, white; white, black] checker, scales it 120x via a cm, and asserts the rendered page contains dark pixels in the right region. - jpegInlineImageDecodes uses PdfContentByte.addImage(image, ..., true) to embed a green JPEG as an inline image, then asserts the rendered page contains green pixels. README's status section now says inline images render, and the limitations list no longer mentions them. Also addresses Codacy's "unnecessary fully qualified name" warning on java.util.List / java.util.Set usage. The class now imports List, Set, Arrays, ByteArrayOutputStream, StandardCharsets and Rectangle2D directly instead of inlining the FQNs; 7 call sites simplified. Module test suite: 86 tests, 0 failures.
Highest-ROI item left on the renderer roadmap: every PNG-to-PDF conversion produces images with `[/Indexed /DeviceRGB hival lookup]` color spaces, and the renderer was silently skipping them (decodeRawRaster falls through to null for non-Device colorspaces). This commit adds the decode path. - decodeImage now recognizes `[/Indexed base hival lookup]` (with CS_INDEXED constant) and routes to a new decodeIndexedImage. - decodeIndexedImage reads 8-bit indices from the (already Flate-decoded) stream, expands each pixel through the lookup table into the base color space's component bytes, then reuses the existing buildGrayImage / buildRgbImage / buildCmykImage helpers. The base color space's component count is determined via the existing imageComponents(). - readIndexedLookup handles both forms the spec allows: a PdfString containing the palette bytes, or a PRStream whose decoded content is the palette. - Sub-byte bit depths (1/2/4-bit indices) are explicitly rejected for now -- 8-bit is the dominant case for PNG-derived images. Test: - rendersIndexedColorImageXObject builds a 32x32 BufferedImage with an IndexColorModel (top half = magenta, bottom = cyan), embeds it via Image.getInstance(BufferedImage), and asserts both palette colors appear in the rendered page. openpdf-core's Image.getInstance preserves the IndexColorModel as `[/Indexed /DeviceRGB ...]`, so this exercises the new decode path end-to-end. README updated: Indexed moved from "limitations" to the supported Image XObject formats; only sub-byte-packed indexed images remain called out as unsupported. Module test suite: 87 tests, 0 failures.
Codacy flagged decodeIndexedImage with NPath 385 (threshold 200). Splits the method along its natural seams without changing behavior: - decodeIndexedImage now just wraps the try/catch around decodeIndexedImageOrThrow. - decodeIndexedImageOrThrow handles validation + orchestration. - readBitsPerComponent extracts the /BitsPerComponent read. - expandIndexedPalette is the per-pixel arraycopy loop. - buildImageForBaseComponents is the switch on component count. No behavior change; module test suite still 87/87 green.
PDFs that draw tables (PdfPTable, hand-rolled re/m/l/S grids, ...) lean hard on three pieces of stroke handling that this renderer was getting wrong or skipping: - Zero-width hairline strokes (PDF §8.4.3.2). `w 0` means "the thinnest line the device can render", i.e. one device pixel. The previous `Math.max(lineWidth, 0.001f)` collapsed those hairlines to invisibility once the page CTM scaled them. Now strokePath() computes an effective width of `1 / max(|sx|, |sy|)` from the current transform so a `0 w` stroke renders as a one-device-pixel line at any DPI. - ExtGState line styling beyond LW/ML/LC/LJ. The dash array `/D` and the stroke-adjust flag `/SA` are now read out of gs dictionaries; `/D` feeds the existing dash-pattern path, `/SA` is tracked through q/Q. - Crisp axis-aligned borders. KEY_STROKE_CONTROL is now set to VALUE_STROKE_NORMALIZE so 0.5pt borders snap to integer device pixels instead of smearing into two rows of antialiased grey. Adds two regression tests: a full PdfPTable render (background fills, red 2pt header border, body-row text) and a `0 w` hairline render that asserts the stroke is actually visible after CTM scaling. https://claude.ai/code/session_01Bobvbg8Ccp2g9S5DRFsnNb
The existing PdfPTable test only exercised single-word cell values
("Col A", "r0c0"). This adds a regression test that pushes harder on
the text-in-table path: multi-line wrapped descriptions, a Phrase
composed of multiple Chunks with different fonts and colors (regular,
bold, italic, RED), varied horizontal alignments, a colored colspan
cell with vertical centering, and a larger header font.
The four assertions cover the parts that are easy to silently break:
- white-on-blue header glyphs (header row text under cell background),
- a red Chunk inside an otherwise-black Phrase (per-Chunk fill color),
- a blue colspan-cell Phrase (text under multi-column layout),
- a multi-line wrapped cell producing several distinct glyph rows.
https://claude.ai/code/session_01Bobvbg8Ccp2g9S5DRFsnNb
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Continues the work to use
openpdf-core(PdfReader+PdfContentParser) as the rendering engine inopenpdf-renderer. The first commit on this branch extended the basic operator subset; the second commit closes the gap on real-world PDF features — XObjects (forms + images), inline-image safety, and a more robust parse loop.Commit 1 — expand operator coverage
k,K) and color-space-aware fills/strokes (cs,CS,sc,SC,scn,SCN) for DeviceGray / DeviceRGB / DeviceCMYK.W,W*) with proper save/restore throughq/Q.J,j,M,d,i) plumbed into theBasicStroke.gs) honoringCA/caalpha andLW/ML/LC/LJ.Ts).BMC,BDC,EMC,MP,DP,BX,EX) parsed as no-ops so content inside them still renders.New convenience entry points on
OpenPdfCoreRenderer:renderPage(int, Graphics2D, int, int)— draws directly onto a caller-suppliedGraphics2D(Swing, printer, SVG-backed graphics) without allocating aBufferedImage; saves and restores the caller's transform and clip.renderAllPages(float)— convenience that returns oneBufferedImageper page in document order.Commit 2 — XObject support and inline-image safety
Dowith/Subtype /Form) render recursively, applying the form's own/Matrixand/BBoxunder the current CTM with full state save/restore so form content can't leak out.Dowith/Subtype /Image) decode via:ImageIOfor JPEG (DCTDecode) and JPEG 2000 (JPXDecode, where the runtime supports it).cafromgs) and the CTM, drawing into the standard(0,0)-(1,1)unit square.BI/ID/EI) are pre-stripped from the content stream beforePdfContentParsersees them.PdfContentParserhas no inline-image handling and the raw image bytes afterIDwould otherwise derail tokenization for the rest of the page.Tests
OpenPdfCorePageRendererOperatorsTest(8 tests) builds synthetic PDFs withPdfContentByte/PdfTemplateand renders them back, verifying:W-clipped fills, marked-content sequences, text rise.PdfTemplatewith an orange fill, check the form content reaches the rasterizer.BI/ID/EIblock followed by a red rectangle, check the trailing rectangle still renders after the inline image is stripped.OpenPdfCoreRendererTest(16 tests) covers the newrenderPage(int, Graphics2D, int, int)andrenderAllPages(float)overloads, including argument validation andGraphics2Dstate restoration.openpdf-renderermodule test suite: 84 tests, 0 failures, 0 errors.README's operator table updated to reflect the broader coverage; new code examples for
Graphics2Dand batch rendering.My name: Andreas Røsdal