Skip to content

feat(api): add optional MarkItDown OCR support#2145

Draft
Sanderhoff-alt wants to merge 1 commit into
vectorize-io:mainfrom
Sanderhoff-alt:feat/markitdown-ocr
Draft

feat(api): add optional MarkItDown OCR support#2145
Sanderhoff-alt wants to merge 1 commit into
vectorize-io:mainfrom
Sanderhoff-alt:feat/markitdown-ocr

Conversation

@Sanderhoff-alt

Copy link
Copy Markdown
Contributor

Summary

Adds optional OCR support for the default MarkItDown file parser by wiring MarkItDown's OpenAI-compatible llm_client integration into Hindsight configuration.

This addresses image and scanned-document uploads where MarkItDown advertises image extensions but, without a vision-capable OCR model, cannot extract useful text from screenshots or scanned pages.

Closes #927.

Motivation

Before this change, .jpg, .jpeg, and .png files were accepted by the default markitdown parser path, but deployments without OCR support could end up with low-level parser/no-content errors. From an API or control-plane user perspective, that did not explain whether the file type was unsupported, the parser was misconfigured, or OCR simply was not enabled.

The goal is to keep the default local MarkItDown behavior unchanged, while giving operators a clear opt-in path for OCR using a vision-capable OpenAI-compatible endpoint.

What Changed

  • Adds server-level MarkItDown OCR configuration:
    • HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED
    • HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_API_KEY
    • HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_BASE_URL
    • HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_MODEL
    • HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_PROMPT
  • Keeps OCR disabled by default.
  • Falls back to the main LLM API key, base URL, and model when the parser-specific OCR values are unset.
  • Creates an OpenAI-compatible client and passes it to MarkItDown as llm_client, with llm_model and llm_prompt.
  • Adds a built-in OCR prompt focused on faithful transcription, original language preservation, reading order, tables, fields, and uncertain text handling.
  • Makes image uploads fail fast with an actionable configuration error when MarkItDown OCR is disabled.
  • Updates API docs, developer configuration docs, generated skill docs, and .env.example files.
  • Updates control-plane parser description copy across locales so the UI says image OCR depends on server configuration instead of implying the standard parser cannot handle images at all.

Configuration Behavior

OCR is opt-in:

export HINDSIGHT_API_FILE_PARSER=markitdown
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED=true

When enabled, the MarkItDown OCR-specific settings take precedence:

export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_API_KEY=your-vision-api-key
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_BASE_URL=https://vision.example/v1
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_MODEL=vision-model

If those are not set, the parser falls back to the existing main LLM settings:

export HINDSIGHT_API_LLM_API_KEY=your-api-key
export HINDSIGHT_API_LLM_BASE_URL=https://api.openai.com/v1
export HINDSIGHT_API_LLM_MODEL=gpt-4o-mini

The selected endpoint must support OpenAI Chat Completions with image input, because MarkItDown's OCR integration is model/client based.

User-Facing Behavior

With OCR disabled, image uploads now fail with an explicit, actionable error similar to:

Image OCR is not enabled for the markitdown parser. Set HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED=true and configure a vision-capable OpenAI-compatible model, or choose an OCR-capable parser.

This replaces the less helpful behavior where image conversion could fail later with a generic no-content or parser error.

For parser fallback chains such as iris,markitdown, this remains compatible with the existing fallback behavior: MarkItDown failures can still allow the next configured parser to run.

Compatibility

  • No behavior change for text-based documents, PDFs, Office files, audio, or HTML.
  • No external service is required by default.
  • Existing deployments keep OCR disabled unless they opt in.
  • Existing MarkItDown initialization remains unchanged unless OCR is enabled.

Tests

Added coverage for:

  • MarkItDown does not receive an LLM/OCR client by default.
  • OCR-enabled MarkItDown receives an OpenAI-compatible client, model, default OCR prompt, base URL, API key, and default headers.
  • OCR-enabled MarkItDown fails fast if no model is configured.
  • Image uploads through MarkItDown fail with an actionable OCR-disabled error before calling MarkItDown.
  • MarkItDown OCR config defaults to disabled.
  • OCR config falls back to the main LLM config when parser-specific values are unset.
  • Parser-specific OCR config overrides the main LLM config.

Validation run locally:

uv run pytest \
  tests/test_file_retain.py::test_markitdown_converter \
  tests/test_file_retain.py::test_markitdown_converter_does_not_enable_ocr_by_default \
  tests/test_file_retain.py::test_markitdown_image_without_ocr_has_actionable_error \
  tests/test_file_retain.py::test_markitdown_converter_can_enable_ocr \
  tests/test_file_retain.py::test_markitdown_converter_requires_model_when_ocr_enabled \
  tests/test_config_validation.py::test_markitdown_ocr_defaults_disabled \
  tests/test_config_validation.py::test_markitdown_ocr_falls_back_to_main_llm_config \
  tests/test_config_validation.py::test_markitdown_ocr_specific_config_overrides_main_llm_config

Result: 8 passed.

Also run:

uv run ruff check hindsight_api/engine/parsers/markitdown.py \
  hindsight_api/engine/memory_engine.py \
  hindsight_api/api/http.py \
  hindsight_api/config.py \
  tests/test_file_retain.py \
  tests/test_config_validation.py

npm run i18n:check

git diff --check

The commit pre-hook also ran generate-docs-skill.sh and lint.sh successfully.

MarkItDown advertises image extensions, but without an OCR model the
image path cannot read screenshots or scanned pages and can fail with
low-level parsing/no-content errors.

Add server-level MARKITDOWN_OCR_* config that is off by default, falls
back to the main LLM API key/base URL/model when unset, and wires those
settings into MarkItDown's llm_client support.

Image uploads now fail fast with an actionable OCR configuration error
when OCR is disabled. Docs and front-end copy also explain that image
OCR depends on server configuration.

Closes vectorize-io#927
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Support LLM-powered OCR in MarkitdownParser for enhanced document extraction

1 participant