Skip to content

HeinleinSupport/xspct_scan

Repository files navigation

xspct_scan

[[TOC]]

xspct_scan is an async HTTP daemon that analyses Office, PDF, HTML, image, and archive files for malware indicators. It is designed to integrate with Rspamd and other mail-security pipelines, and exposes a simple HTTP API for on-demand scanning.

Features

Document analysis

  • Office / OLE2 + OOXML — VBA macro extraction and keyword analysis via oletools; automatic decryption of password-protected files with msoffcrypto-tool
  • PDF — deep content analysis via PyMuPDF (JavaScript, URIs, document metadata, encryption) plus structural keyword counts via vendored pdfid
  • HTML / SVG — script extraction, CSS-hiding detection, external resource tracking; SVG files (image/svg+xml) are treated as HTML because they are XML-based and can carry <script> elements, inline event handlers, and external href/src references
  • RTF — embedded object extraction via rtfobj (opt-in per request)
  • Dynamic JS emulation — sandboxed execution with quickjs and deobfuscation with jsbeautifier (optional; QuickJS emulation is disabled by default — enable with xspct_analyzers.javascript.quickjs: true)

Enrichment

  • IOC extraction — URLs, IPs, and domains from all document types
  • Extended IOCs — email addresses, file hashes, CVE IDs, and more via iocsearcher (optional)
  • YARA scanning — static signature matching via yara-python (classic engine, optional Hyperscan acceleration) and/or yara-x (Rust rewrite); both engines can run simultaneously for comparison
  • Image analysis — OCR text extraction via pytesseract (wraps Tesseract) and EasyOCR (both run in parallel; results are merged and deduplicated), QR-code and barcode decoding via pyzbar (requires the libzbar0 system library), and EXIF metadata extraction with GPS coordinate flagging (optional; all require [enrichment])
  • Archive extraction — sandboxed extraction via SFlock2 (zipjail usermode sandbox) covering ZIP, 7z, RAR, TAR/TGZ/TBZ2, CAB, ACE, ISO, EML, MSG, MSO, lzip, and ZPAQ; configurable depth/size limits, password loop for encrypted archives, recursive sub-file analysis. Falls back to stdlib zipfile/py7zr without SFlock2.
  • ClamAV integration — every file (and individual archive members) forwarded to a running clamd daemon for antivirus signature matching; results surfaced in the clamav response field and appended to analyses (optional; requires [enrichment])

Infrastructure

  • Parallel pipeline — analyzers run as concurrent asyncio tasks; partial results returned on timeout (202 Accepted) with analyzers_completed / analyzers_pending fields
  • Redis result cache — optional; survives restarts, shared across instances
  • Prometheus metrics — exposed at /v1/metrics
  • OpenAPI 3.0 — spec at /v1/openapi.json; ReDoc UI at /v1/apidoc/redoc
  • Admin API — live reload of config / passwords / YARA rules via POST /v1/admin/reload
  • API key auth — per-header key with rotation support; separate admin key

Quick start

pip install "git+https://github.com/HeinleinSupport/xspct_scan.git"
xspct_scan /etc/xspct_scan/config.yml

Scan a document:

curl -s -F "doc=@invoice.docx" http://localhost:8080/v1/scan | python3 -m json.tool

Or upload raw bytes:

curl -s -X POST http://localhost:8080/v1/scan \
  --data-binary @invoice.docx \
  -H "Content-Type: application/octet-stream" \
  | python3 -m json.tool

Requirements

  • Python 3.10+

  • libmagic system library

    # Debian / Ubuntu
    sudo apt-get install libmagic1
    
    # RHEL / Fedora
    sudo dnf install file-libs

Installation

From GitHub

pip install "git+https://github.com/HeinleinSupport/xspct_scan.git"

Optional extras

Extra Installs Use when
uvloop uvloop Higher-throughput async event loop
redis redis[asyncio] Persistent result cache across restarts
enrichment Pillow, pytesseract, pyzbar, easyocr, clamd, jsbeautifier, quickjs, tree-sitter Image OCR/barcode/EXIF (Tesseract + EasyOCR), ClamAV integration, JS deobfuscation (QuickJS sandbox opt-in via config)
openapi pydantic>=2.0 OpenAPI 3.0 spec + ReDoc UI
advanced yara-python, yara-x, iocsearcher, py7zr, SFlock2 YARA scanning, extended IOCs, sandboxed archive extraction (ZIP/RAR/7z/EML/MSG/…)
serialization msgpack, cbor2 msgpack and CBOR response serialization (negotiated via Accept header or xspct_response_format config)
compression zstandard zstd response compression (Accept-Encoding: zstd) and transparent zstd decompression of uploaded files
pip install "xspct_scan[uvloop,redis,enrichment,openapi,advanced] @ git+https://github.com/HeinleinSupport/xspct_scan.git"

From source

git clone https://github.com/HeinleinSupport/xspct_scan.git
cd xspct_scan
pip install -e ".[uvloop,redis,enrichment,openapi,advanced]"

Configuration

Copy the example config and edit to suit:

cp config/xspct_scan.example.yml /etc/xspct_scan/config.yml
xspct_scan /etc/xspct_scan/config.yml

Key settings:

Key Default Description
xspct_listen_address 0.0.0.0 Bind address(es)
xspct_listen_port 8080 Listen port
xspct_api_key (empty) Shared secret for X-Api-Key auth
xspct_admin_api_key (empty) Key for POST /v1/admin/reload
xspct_redis_cache.enabled false Enable Redis result cache
xspct_password_file Path to wordlist for decrypting encrypted files
xspct_analyzers (all enabled) Per-analyzer enable/disable + options
xspct_analyzers.javascript.quickjs false Enable QuickJS sandbox emulation for JS
xspct_include_text false Include full extracted text in reports
xspct_response_format auto Response serialization: auto (negotiate via Accept header), json, msgpack, or cbor
xspct_archive_max_depth 2 Recursion limit for archive extraction
xspct_foreground_slots 16 Max concurrent scans holding a client connection open
xspct_background_slots 4 Max concurrent scans continuing after 202 timeout

See docs/configuration.md for the full reference.

HTTP API

POST /v1/scan

Submit a document for analysis.

multipart/form-data (field doc):

curl -s -F "doc=@malware.xlsm" http://localhost:8080/v1/scan

application/octet-stream (raw bytes, metadata as query params):

curl -s -X POST "http://localhost:8080/v1/scan?filename=malware.xlsm" \
  --data-binary @malware.xlsm \
  -H "Content-Type: application/octet-stream"

msgpack / CBOR responses — set the Accept header to request a non-JSON wire format (application/x-msgpack or application/cbor). Requires pip install "xspct_scan[serialization] @ git+https://github.com/HeinleinSupport/xspct_scan.git". The server-wide default is controlled by xspct_response_format.

zstd-compressed responses — add Accept-Encoding: zstd to receive a zstd-compressed response body (Content-Encoding: zstd). Requires pip install "xspct_scan[compression] @ git+https://github.com/HeinleinSupport/xspct_scan.git".

zstd-compressed uploads — the daemon transparently decompresses a zstd-compressed doc part or octet-stream body (detected via the Zstandard frame magic bytes). The .zst filename suffix is stripped before type detection.

Example response:

{
  "filename": "malware.xlsm",
  "file_hash": "sha256...",
  "detected_type": "office",
  "has_macro": true,
  "analyses": [{"type": "AutoExec", "keyword": "AutoOpen", "description": "..."}],
  "iocs": {"urls": ["https://evil.example/payload"], "ips": [], "domains": []},
  "iocs_extended": {"url": ["https://evil.example/payload"], "email": []},
  "yara_matches": [{"engine": "classic", "rule": "Eicar_Test", "tags": [], ...}],
  "pdfid_keywords": null,
  "archive_files": [],
  "exif": {},
  "text_preview": "...",
  "analyzers_completed": ["office", "yara", "iocs"],
  "analyzers_pending": [],
  "status": "finished",
  "time_taken": 0.18
}

Returns 202 Accepted when analysis exceeds the configured timeout. Poll /v1/query?hash=<sha256> for the result:

curl "http://localhost:8080/v1/query?hash=sha256..."

Endpoints

Endpoint Method Description
/v1/scan POST Submit file for analysis
/v1/query GET / POST Retrieve result by SHA-256 hash
/health GET {"status":"ok"} — load-balancer check (unversioned)
/ping GET Returns pong (unversioned)
/v1/metrics GET Prometheus metrics
/v1/openapi.json GET OpenAPI 3.0 spec (requires [openapi])
/v1/apidoc/redoc GET ReDoc UI (requires [openapi])
/v1/admin/reload POST Live-reload config/passwords/YARA rules

See docs/api-http.md for full request/response details.

Decrypting password-protected files

xspct_scan automatically tries to decrypt encrypted Office and PDF documents using a password list loaded at startup.

Global password list

Point xspct_password_file at a newline-delimited file of candidate passwords (lines starting with # are ignored):

xspct_password_file: /etc/xspct_scan/passwords.txt

The file is reloaded on POST /v1/admin/reload. If not found, a small set of built-in defaults (infected, virus, malware, …) is used.

Per-request passwords

Extra passwords supplied with the request are tried before the global list:

curl -s \
  -F "doc=@protected.xlsx" \
  -F "passwords=Secret123,CompanyPass" \
  http://localhost:8080/v1/scan

When decryption succeeds the response includes "decrypted": true and "decryption_password": "Secret123".

YARA scanning

When YARA rules are loaded, YARA runs on every file — PDFs, HTML, Office documents, images, plain text, archive members, and unknown blobs. Two engines can run in parallel for comparison or redundancy:

xspct_analyzers:
  yara:
    enabled: true
    rules_path: /etc/xspct_scan/rules/       # classic yara-python
  yara_x:
    enabled: true
    rules_path: /etc/xspct_scan/rules/       # yara-x (Rust)

Each match in yara_matches carries an "engine" field ("classic" or "yara-x"). Reload rules without restart with POST /v1/admin/reload.

Sandboxed archive extraction

Install SFlock2 (included in [advanced]) to enable sandboxed extraction via zipjail:

# Python package
pip install "xspct_scan[advanced] @ git+https://github.com/HeinleinSupport/xspct_scan.git"

# System packages for full native-format support (Debian / Ubuntu)
sudo apt-get install p7zip-full rar unace-nonfree cabextract lzip zpaq

With SFlock2 installed, the following formats are extracted in-sandbox: ZIP, 7z, RAR, TAR, TAR.GZ, TBZ2, CAB, ACE, ISO, EML, MSG, MSO, lzip, ZPAQ. EML and MSG files are routed through the archive pipeline automatically so that email attachments are extracted and analysed.

Image OCR and QR/barcode scanning

When [enrichment] is installed, raster images (JPEG, PNG, GIF, BMP, TIFF, WebP, ICO) are passed through two additional analysis steps:

  1. OCR — Tesseract and EasyOCR both run in parallel and extract embedded text, which is then included in the IOC extraction pipeline (URLs, IPs, domains, etc.). EasyOCR tries a normal and an inverted variant and stops as soon as text is found.
  2. QR / barcode decode — pyzbar decodes any QR codes or 1-D barcodes found in the image; decoded payloads are surfaced in qr_codes and added to the IOC results.

System dependencies

# Debian / Ubuntu
sudo apt-get install tesseract-ocr libzbar0

# RHEL / Fedora
sudo dnf install tesseract zbar

Enabling / disabling

xspct_analyzers:
  image:
    enabled: true   # set to false to skip OCR and QR decode entirely

SVG analysis

SVG files are XML-based vector graphics that can embed <script> tags, inline event handlers (onload, onclick, …), and external references — making them a phishing and malware delivery vector.

xspct_scan detects SVG by MIME type (image/svg+xml) and .svg extension and routes the file through the HTML analyzer rather than the image pipeline. All HTML checks apply: script extraction, CSS-hiding detection, external resource tracking, and IOC extraction.

No additional configuration or packages are required; SVG analysis is active whenever the HTML analyzer is enabled.

ClamAV integration

xspct_scan can forward every scanned file (and individual archive members) to a running clamd daemon for antivirus signature matching.

Requirements

# Python library
pip install "xspct_scan[enrichment] @ git+https://github.com/HeinleinSupport/xspct_scan.git"

# ClamAV daemon (Debian / Ubuntu)
sudo apt-get install clamav-daemon
sudo systemctl enable --now clamav-daemon

Configuration

xspct_clamav:
  enabled: true
  socket: /var/run/clamav/clamd.ctl   # Unix socket (preferred); set to '' to use TCP
  host: 127.0.0.1                      # TCP host (used when socket is empty)
  port: 3310                           # TCP port
  timeout: 60                          # per-scan timeout in seconds
  max_size: 26214400                   # skip files larger than this (bytes; default 25 MB)
  scan_members: true                   # also scan individual archive members

When socket is non-empty, a Unix domain socket is used; otherwise a TCP connection is made to host:port.

Response fields

ClamAV results appear in the scan response under clamav:

{
  "clamav": {
    "status": "infected",
    "signature": "Win.Trojan.Agent-12345"
  }
}

Possible status values: clean, infected, error, skipped (file exceeds max_size), disabled.

Prometheus counters xspct_clamav_clean, xspct_clamav_infected, xspct_clamav_errors, and xspct_clamav_timeouts track ClamAV scan outcomes at /v1/metrics.

Systemd unit

[Unit]
Description=xspct_scan malware scanner
After=network.target

[Service]
Type=simple
User=xspct-scan
ExecStart=/usr/local/bin/xspct_scan /etc/xspct_scan/config.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

Documentation

Full docs are in the docs/ directory and can be built with Sphinx:

pip install "xspct_scan[docs] @ git+https://github.com/HeinleinSupport/xspct_scan.git"
sphinx-build docs docs/_build/html

Licence

EUPL-1.2 — © 2026 Carsten Rosenberg, Heinlein Support GmbH

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors