clawcrawl

Multi-modal crawler: given a URL, fetch page markdown (Firecrawl), find images, describe them with a vision LLM (OpenRouter via Instructor), and return markdown with structured image descriptions inlined.

How it works

Scrape — Firecrawl returns page markdown and metadata.
Extract — Regex hints plus a text LLM list image URLs (deduped, capped).
Describe — Vision LLM describes each image (bounded concurrency; per-image failures become fallback text).
Replace — Image references become  plus *[Image: …]*.

Optional Langfuse tracing: one span per pipeline step, one generation per LLM call.

Requirements

Python 3.11+
Firecrawl API key
OpenRouter API key
Optional: Langfuse keys for observability

Setup

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
cp .env.example .env
# Edit .env: FIRECRAWL_API_KEY, OPENROUTER_API_KEY

Configuration

Settings are read from the process environment and from a .env file in the project root (via Pydantic Settings). Langfuse keys in .env are copied into os.environ on each get_settings() call so the Langfuse SDK can see them.

Variable	Description
`FIRECRAWL_API_KEY`	Required
`OPENROUTER_API_KEY`	Required
`CLAWCRAWL_TEXT_MODEL`	Image URL extraction (default: `openrouter/google/gemini-2.0-flash-lite-001`). OpenRouter IDs like `google/gemma-…` are auto-prefixed with `openrouter/`.
`CLAWCRAWL_VISION_MODEL`	Image description (default: `openrouter/google/gemma-3-27b-it`). Same `openrouter/` prefix rule.
`CLAWCRAWL_MAX_IMAGES`	Max images per crawl (default: `30`)
`CLAWCRAWL_IMAGE_MAX_BYTES`	Max download size per image (default: `5242880`)
`CLAWCRAWL_REQUEST_TIMEOUT_S`	HTTP timeout for image fetch (default: `120`)
`CLAWCRAWL_DESCRIBE_CONCURRENCY`	Parallel vision calls (default: `5`)
`LANGFUSE_SECRET_KEY`	Optional; Langfuse project secret
`LANGFUSE_PUBLIC_KEY`	Optional; Langfuse public key
`LANGFUSE_BASE_URL`	Langfuse host (default: `https://cloud.langfuse.com`)
`CLAWCRAWL_LANGFUSE_ENABLED`	Enable tracing (default: `true`; set `false` for tests)
`CLAWCRAWL_CORS_ORIGINS`	Comma-separated browser origins (default: `http://localhost:3000`)

Use as a Python library

Install the package (pip install -e .), configure .env (or export the same variables), then call the crawler from your code. No HTTP server required.

Async (recommended):

import asyncio

from clawcrawl import crawl

async def main() -> None:
    result = await crawl("https://example.com")
    print(result.markdown)
    for image in result.images:
        print(image.url, image.description[:200])

asyncio.run(main())

Sync (scripts, notebooks):

from clawcrawl import crawl_sync

result = crawl_sync("https://example.com")
print(result.markdown)

Custom settings (skip .env or override values):

from clawcrawl import Settings, crawl_sync

settings = Settings(
    firecrawl_api_key="fc-...",
    openrouter_api_key="sk-or-...",
    max_images=10,
    langfuse_enabled=False,
)
result = crawl_sync("https://example.com", settings=settings)

Return value

crawl / crawl_sync return a CrawlResponse (pydantic model):

Field	Description
`url`	Crawled URL
`markdown`	Page markdown with image blocks inlined
`images`	List of `ImageDescription` (`url`, `description`)
`metadata`	Firecrawl metadata dict

Lower-level access: run_crawl(url, settings) runs the pipeline without the top-level Langfuse “crawl” span (still traces per-step if Langfuse is enabled).

from clawcrawl import get_settings, run_crawl

result = await run_crawl("https://example.com", get_settings())

Run the API

The HTTP service is a thin wrapper around clawcrawl.crawl:

uvicorn clawcrawl.main:app --reload --app-dir src --port 8000

GET /health — liveness
GET /v1/models?output_modalities=text — proxy OpenRouter model list (use text,image for vision-capable models)
POST /v1/crawl — body: {"url": "https://example.com", "text_model": "...", "vision_model": "..."} (optional model overrides; blocking JSON response)
POST /v1/crawl/stream — same body; Server-Sent Events with step-by-step progress and a terminal crawl_done or crawl_error event

Response includes markdown, images (structured descriptions), and metadata from Firecrawl.

Set CLAWCRAWL_CORS_ORIGINS (comma-separated) if the browser calls the API directly instead of through the Next.js dev proxy (default: http://localhost:3000).

Web UI

Paper-style Next.js app in frontend/:

# Terminal 1 — API on port 8000
uvicorn clawcrawl.main:app --reload --app-dir src --port 8000

# Terminal 2 — UI on port 3000 (proxies /api/* to the API)
cd frontend && npm install && npm run dev

Open http://localhost:3000, paste a URL, and watch the crawl progress rail while the enriched markdown appears on the sheet. Use the gear icon to pick OpenRouter models for extract and describe steps; use Copy on the result sheet to copy the final markdown.

Quick test client

With the server running:

pip install httpx
python easy_test.py "https://example.com"

Writes output/<host>.md. Override API base with CLAWCRAWL_BASE_URL (default http://127.0.0.1:9000).

Tests

pytest -q

Unit tests disable Langfuse and use dummy API keys via env fixtures.

Langfuse trace tree

When tracing is enabled, a typical crawl looks like:

crawl → run_crawl → scrape_markdown | extract_image_links + extract_image_links.llm | describe_all → describe_one + describe_one.llm (×N) | replace_images_in_markdown

LLM generations record system/user prompts in input, structured output in output, token counts (input / output / total), and OpenRouter cost when returned on the completion.

Project layout

src/clawcrawl/
  __init__.py      # Public exports: crawl, crawl_sync, models, settings
  api.py           # Library entrypoint (crawl / crawl_sync)
  main.py          # FastAPI app
  pipeline.py      # Orchestration
  config.py        # Settings from .env
  prompts/         # LLM prompts: <step>/system.md, user.md
  services/        # scrape, image_links, describe, replace
  llm/             # Instructor + OpenRouter clients
  telemetry/       # Langfuse helpers
easy_test.py       # POST URL → output/*.md
frontend/          # Next.js paper UI
tests/

Prompts

Each pipeline step that calls an LLM has a folder under src/clawcrawl/prompts/ named after the step (e.g. extract_image_links, describe_one). Edit system.md and user.md there; user.md may use {placeholders} filled in by the service code.

License

See repository for license terms if applicable.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.cursor/skills/python-best-practice		.cursor/skills/python-best-practice
.opencode/skills/python-best-practice		.opencode/skills/python-best-practice
frontend		frontend
src/clawcrawl		src/clawcrawl
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
easy_test.py		easy_test.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clawcrawl

How it works

Requirements

Setup

Configuration

Use as a Python library

Return value

Run the API

Web UI

Quick test client

Tests

Langfuse trace tree

Project layout

Prompts

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

clawcrawl

How it works

Requirements

Setup

Configuration

Use as a Python library

Return value

Run the API

Web UI

Quick test client

Tests

Langfuse trace tree

Project layout

Prompts

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages