Multi-modal crawler: given a URL, fetch page markdown (Firecrawl), find images, describe them with a vision LLM (OpenRouter via Instructor), and return markdown with structured image descriptions inlined.
- Scrape — Firecrawl returns page markdown and metadata.
- Extract — Regex hints plus a text LLM list image URLs (deduped, capped).
- Describe — Vision LLM describes each image (bounded concurrency; per-image failures become fallback text).
- Replace — Image references become
<!-- image-desc:{json} -->plus*[Image: …]*.
Optional Langfuse tracing: one span per pipeline step, one generation per LLM call.
- Python 3.11+
- Firecrawl API key
- OpenRouter API key
- Optional: Langfuse keys for observability
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
cp .env.example .env
# Edit .env: FIRECRAWL_API_KEY, OPENROUTER_API_KEYSettings are read from the process environment and from a .env file in the project root (via Pydantic Settings). Langfuse keys in .env are copied into os.environ on each get_settings() call so the Langfuse SDK can see them.
| Variable | Description |
|---|---|
FIRECRAWL_API_KEY |
Required |
OPENROUTER_API_KEY |
Required |
CLAWCRAWL_TEXT_MODEL |
Image URL extraction (default: openrouter/google/gemini-2.0-flash-lite-001). OpenRouter IDs like google/gemma-… are auto-prefixed with openrouter/. |
CLAWCRAWL_VISION_MODEL |
Image description (default: openrouter/google/gemma-3-27b-it). Same openrouter/ prefix rule. |
CLAWCRAWL_MAX_IMAGES |
Max images per crawl (default: 30) |
CLAWCRAWL_IMAGE_MAX_BYTES |
Max download size per image (default: 5242880) |
CLAWCRAWL_REQUEST_TIMEOUT_S |
HTTP timeout for image fetch (default: 120) |
CLAWCRAWL_DESCRIBE_CONCURRENCY |
Parallel vision calls (default: 5) |
LANGFUSE_SECRET_KEY |
Optional; Langfuse project secret |
LANGFUSE_PUBLIC_KEY |
Optional; Langfuse public key |
LANGFUSE_BASE_URL |
Langfuse host (default: https://cloud.langfuse.com) |
CLAWCRAWL_LANGFUSE_ENABLED |
Enable tracing (default: true; set false for tests) |
CLAWCRAWL_CORS_ORIGINS |
Comma-separated browser origins (default: http://localhost:3000) |
Install the package (pip install -e .), configure .env (or export the same variables), then call the crawler from your code. No HTTP server required.
Async (recommended):
import asyncio
from clawcrawl import crawl
async def main() -> None:
result = await crawl("https://example.com")
print(result.markdown)
for image in result.images:
print(image.url, image.description[:200])
asyncio.run(main())Sync (scripts, notebooks):
from clawcrawl import crawl_sync
result = crawl_sync("https://example.com")
print(result.markdown)Custom settings (skip .env or override values):
from clawcrawl import Settings, crawl_sync
settings = Settings(
firecrawl_api_key="fc-...",
openrouter_api_key="sk-or-...",
max_images=10,
langfuse_enabled=False,
)
result = crawl_sync("https://example.com", settings=settings)crawl / crawl_sync return a CrawlResponse (pydantic model):
| Field | Description |
|---|---|
url |
Crawled URL |
markdown |
Page markdown with image blocks inlined |
images |
List of ImageDescription (url, description) |
metadata |
Firecrawl metadata dict |
Lower-level access: run_crawl(url, settings) runs the pipeline without the top-level Langfuse “crawl” span (still traces per-step if Langfuse is enabled).
from clawcrawl import get_settings, run_crawl
result = await run_crawl("https://example.com", get_settings())The HTTP service is a thin wrapper around clawcrawl.crawl:
uvicorn clawcrawl.main:app --reload --app-dir src --port 8000GET /health— livenessGET /v1/models?output_modalities=text— proxy OpenRouter model list (usetext,imagefor vision-capable models)POST /v1/crawl— body:{"url": "https://example.com", "text_model": "...", "vision_model": "..."}(optional model overrides; blocking JSON response)POST /v1/crawl/stream— same body; Server-Sent Events with step-by-step progress and a terminalcrawl_doneorcrawl_errorevent
Response includes markdown, images (structured descriptions), and metadata from Firecrawl.
Set CLAWCRAWL_CORS_ORIGINS (comma-separated) if the browser calls the API directly instead of through the Next.js dev proxy (default: http://localhost:3000).
Paper-style Next.js app in frontend/:
# Terminal 1 — API on port 8000
uvicorn clawcrawl.main:app --reload --app-dir src --port 8000
# Terminal 2 — UI on port 3000 (proxies /api/* to the API)
cd frontend && npm install && npm run devOpen http://localhost:3000, paste a URL, and watch the crawl progress rail while the enriched markdown appears on the sheet. Use the gear icon to pick OpenRouter models for extract and describe steps; use Copy on the result sheet to copy the final markdown.
With the server running:
pip install httpx
python easy_test.py "https://example.com"Writes output/<host>.md. Override API base with CLAWCRAWL_BASE_URL (default http://127.0.0.1:9000).
pytest -qUnit tests disable Langfuse and use dummy API keys via env fixtures.
When tracing is enabled, a typical crawl looks like:
crawl → run_crawl → scrape_markdown | extract_image_links + extract_image_links.llm | describe_all → describe_one + describe_one.llm (×N) | replace_images_in_markdown
LLM generations record system/user prompts in input, structured output in output, token counts (input / output / total), and OpenRouter cost when returned on the completion.
src/clawcrawl/
__init__.py # Public exports: crawl, crawl_sync, models, settings
api.py # Library entrypoint (crawl / crawl_sync)
main.py # FastAPI app
pipeline.py # Orchestration
config.py # Settings from .env
prompts/ # LLM prompts: <step>/system.md, user.md
services/ # scrape, image_links, describe, replace
llm/ # Instructor + OpenRouter clients
telemetry/ # Langfuse helpers
easy_test.py # POST URL → output/*.md
frontend/ # Next.js paper UI
tests/
Each pipeline step that calls an LLM has a folder under src/clawcrawl/prompts/ named after the step (e.g. extract_image_links, describe_one). Edit system.md and user.md there; user.md may use {placeholders} filled in by the service code.
See repository for license terms if applicable.