feat(api): add optional MarkItDown OCR support by Sanderhoff-alt · Pull Request #2145 · vectorize-io/hindsight

Sanderhoff-alt · 2026-06-11T15:31:33Z

Summary

Adds optional OCR support for the default MarkItDown file parser by wiring MarkItDown's OpenAI-compatible llm_client integration into Hindsight configuration.

This addresses image and scanned-document uploads where MarkItDown advertises image extensions but, without a vision-capable OCR model, cannot extract useful text from screenshots or scanned pages.

Closes #927.

Motivation

Before this change, .jpg, .jpeg, and .png files were accepted by the default markitdown parser path, but deployments without OCR support could end up with low-level parser/no-content errors. From an API or control-plane user perspective, that did not explain whether the file type was unsupported, the parser was misconfigured, or OCR simply was not enabled.

The goal is to keep the default local MarkItDown behavior unchanged, while giving operators a clear opt-in path for OCR using a vision-capable OpenAI-compatible endpoint.

What Changed

Adds server-level MarkItDown OCR configuration:
- HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED
- HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_API_KEY
- HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_BASE_URL
- HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_MODEL
- HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_PROMPT
Keeps OCR disabled by default.
Falls back to the main LLM API key, base URL, and model when the parser-specific OCR values are unset.
Creates an OpenAI-compatible client and passes it to MarkItDown as llm_client, with llm_model and llm_prompt.
Adds a built-in OCR prompt focused on faithful transcription, original language preservation, reading order, tables, fields, and uncertain text handling.
Makes image uploads fail fast with an actionable configuration error when MarkItDown OCR is disabled.
Updates API docs, developer configuration docs, generated skill docs, and .env.example files.
Updates control-plane parser description copy across locales so the UI says image OCR depends on server configuration instead of implying the standard parser cannot handle images at all.

Configuration Behavior

OCR is opt-in:

export HINDSIGHT_API_FILE_PARSER=markitdown
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED=true

When enabled, the MarkItDown OCR-specific settings take precedence:

export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_API_KEY=your-vision-api-key
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_BASE_URL=https://vision.example/v1
export HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_MODEL=vision-model

If those are not set, the parser falls back to the existing main LLM settings:

export HINDSIGHT_API_LLM_API_KEY=your-api-key
export HINDSIGHT_API_LLM_BASE_URL=https://api.openai.com/v1
export HINDSIGHT_API_LLM_MODEL=gpt-4o-mini

The selected endpoint must support OpenAI Chat Completions with image input, because MarkItDown's OCR integration is model/client based.

User-Facing Behavior

With OCR disabled, image uploads now fail with an explicit, actionable error similar to:

Image OCR is not enabled for the markitdown parser. Set HINDSIGHT_API_FILE_PARSER_MARKITDOWN_OCR_ENABLED=true and configure a vision-capable OpenAI-compatible model, or choose an OCR-capable parser.

This replaces the less helpful behavior where image conversion could fail later with a generic no-content or parser error.

For parser fallback chains such as iris,markitdown, this remains compatible with the existing fallback behavior: MarkItDown failures can still allow the next configured parser to run.

Compatibility

No behavior change for text-based documents, PDFs, Office files, audio, or HTML.
No external service is required by default.
Existing deployments keep OCR disabled unless they opt in.
Existing MarkItDown initialization remains unchanged unless OCR is enabled.

Tests

Added coverage for:

MarkItDown does not receive an LLM/OCR client by default.
OCR-enabled MarkItDown receives an OpenAI-compatible client, model, default OCR prompt, base URL, API key, and default headers.
OCR-enabled MarkItDown fails fast if no model is configured.
Image uploads through MarkItDown fail with an actionable OCR-disabled error before calling MarkItDown.
MarkItDown OCR config defaults to disabled.
OCR config falls back to the main LLM config when parser-specific values are unset.
Parser-specific OCR config overrides the main LLM config.

Validation run locally:

uv run pytest \
  tests/test_file_retain.py::test_markitdown_converter \
  tests/test_file_retain.py::test_markitdown_converter_does_not_enable_ocr_by_default \
  tests/test_file_retain.py::test_markitdown_image_without_ocr_has_actionable_error \
  tests/test_file_retain.py::test_markitdown_converter_can_enable_ocr \
  tests/test_file_retain.py::test_markitdown_converter_requires_model_when_ocr_enabled \
  tests/test_config_validation.py::test_markitdown_ocr_defaults_disabled \
  tests/test_config_validation.py::test_markitdown_ocr_falls_back_to_main_llm_config \
  tests/test_config_validation.py::test_markitdown_ocr_specific_config_overrides_main_llm_config

Result: 8 passed.

Also run:

uv run ruff check hindsight_api/engine/parsers/markitdown.py \
  hindsight_api/engine/memory_engine.py \
  hindsight_api/api/http.py \
  hindsight_api/config.py \
  tests/test_file_retain.py \
  tests/test_config_validation.py

npm run i18n:check

git diff --check

The commit pre-hook also ran generate-docs-skill.sh and lint.sh successfully.

MarkItDown advertises image extensions, but without an OCR model the image path cannot read screenshots or scanned pages and can fail with low-level parsing/no-content errors. Add server-level MARKITDOWN_OCR_* config that is off by default, falls back to the main LLM API key/base URL/model when unset, and wires those settings into MarkItDown's llm_client support. Image uploads now fail fast with an actionable OCR configuration error when OCR is disabled. Docs and front-end copy also explain that image OCR depends on server configuration. Closes vectorize-io#927

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): add optional MarkItDown OCR support#2145

feat(api): add optional MarkItDown OCR support#2145
Sanderhoff-alt wants to merge 1 commit into
vectorize-io:mainfrom
Sanderhoff-alt:feat/markitdown-ocr

Sanderhoff-alt commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sanderhoff-alt commented Jun 11, 2026

Summary

Motivation

What Changed

Configuration Behavior

User-Facing Behavior

Compatibility

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant