CrawlerScope

CrawlerScope collects operator-published crawler, fetcher, monitoring, scanner, and preview-bot network ranges, normalizes them into deployable CIDR data, and publishes a static dashboard plus machine-readable artifacts for infrastructure and security teams.

Live dashboard: ipanalytics.github.io/CrawlerScope
Current dataset: data/current/crawlers.json

Overview

CrawlerScope is a small, auditable data pipeline for bot network intelligence. It tracks published source health, separates authoritative IP feeds from documented user-agent-only identities, and emits artifacts suitable for WAF rules, reverse proxies, allowlists, deny controls, analytics enrichment, and incident triage.

The project intentionally keeps source definitions in data, not code. Collector behavior lives in scripts/update.py; operator sources live in config/sources.json.

Current Dataset

Generated at 2026-05-26T12:01:22Z.

Metric	Count
Services	43
Healthy sources	43
Authoritative IP lists	32
CIDR prefixes	7,180
IPv4 prefixes	6,705
IPv6 prefixes	475
AI crawler/fetcher prefixes	1,653

Category	Services
AI crawlers	13
Search crawlers	9
Monitoring probes	5
Social previews	4
Fetchers	3
SEO crawlers	3
Ad verification	2
Security scanners	2
Archive	1
Analytics crawlers	1

Tracked services

Service	Category	Source type	Prefixes
Google common crawlers	search	official_json	69
Google special crawlers	search	official_json	46
Google user-triggered fetchers	fetcher	official_json	223
Bingbot	search	official_json	28
DuckDuckBot	search	official_json	334
DuckAssistBot	ai	official_json	334
Applebot	search	official_json	12
MojeekBot	search	official_json	1
Naver Yeti	search	official_json	36
YandexBot	search	known_static	13
Baiduspider	search	known_static	2
GPTBot	ai	official_json	17
OAI-SearchBot	ai	official_json	32
ChatGPT-User	ai	official_json	214
OAI-AdsBot	ai	documented_user_agent	0
PerplexityBot	ai	official_json	8
Perplexity-User	ai	official_json	4
ClaudeBot / Claude-SearchBot	ai	documented_user_agent	0
Amazonbot	ai	official_embedded_json	524
Amzn-SearchBot	ai	official_embedded_json	512
Amzn-User	fetcher	official_embedded_json	1,023
Meta-ExternalAgent / Meta-WebIndexer	ai	known_static	4
Bytespider	ai	documented_user_agent	0
MistralAI-User	ai	official_json	4
AhrefsBot	seo	official_json	51
Lumar crawler	seo	official_json	66
SemrushBot	seo	documented_user_agent	0
Censys scanners	security-scanner	known_static	2
Shodan scanners	security-scanner	known_static	9
Datadog Synthetics	monitoring	official_json	113
IAS crawler	ad-verification	official_json	14
TTD-Content crawler	ad-verification	official_text	2,615
UptimeRobot	monitoring	official_text	217
Pingdom probes	monitoring	official_text	158
StatusCake probes	monitoring	official_json	296
Better Stack probes	monitoring	official_text	34
Common Crawl CCBot	archive	official_json	6
Flipboard crawler	social	official_text	136
Parse.ly crawler	analytics	official_json	10
Pinterestbot	social	documented_user_agent	0
LinkedInBot	social	documented_user_agent	0
Telegram link preview	social	official_text	11
RSS API feed parser	fetcher	official_text	2

Architecture

CrawlerScope runs as a scheduled GitHub Actions collector and publishes static artifacts.

flowchart LR
  A["config/sources.json"] --> B["scripts/update.py"]
  B --> C["Fetch operator sources"]
  C --> D["Normalize and collapse CIDR prefixes"]
  D --> E["data/current/crawlers.json"]
  D --> F["data/current/robots-ai.txt"]
  D --> G["data/current/nginx-ai-map.conf"]
  D --> H["data/snapshots/*.json"]
  E --> I["Static dashboard"]
  H --> J["GitHub Release artifacts"]

Source types:

Type	Meaning
`official_json`	Operator-published machine-readable JSON feed
`official_text`	Operator-published plain-text CIDR/IP feed
`official_embedded_json`	Operator page with machine-readable ranges embedded in HTML
`documented_user_agent`	Documented bot identity without a stable public IP list
`known_static`	Useful static seed list, not treated as complete authority

Features

Operator-published source collection with source health tracking.
IPv4/IPv6 normalization, CIDR coercion, and prefix collapsing.
Static dashboard with category, operator, source, service, and search filters.
Filtered exports for JSON, CSV, CIDR lists, robots.txt, and Nginx user-agent maps.
Snapshot retention and historical summary tracking.
GitHub Pages publication and automatic dataset releases.
Config-driven source inventory in config/sources.json.

Quick Start

Run the collector and serve the dashboard locally:

python3 scripts/update.py
python3 -m http.server 8080

Open:

http://127.0.0.1:8080/public/

When serving from public/, the app reads data from ../data/current. For GitHub Pages deployment, the workflow copies public/ and data/ into the Pages artifact.

Installation

CrawlerScope has no runtime dependency outside the Python standard library for data collection.

git clone https://github.com/ipanalytics/CrawlerScope.git
cd CrawlerScope
python3 scripts/update.py

Optional environment controls:

export CRAWLER_SCOPE_USER_AGENT="CrawlerScope/0.1 (+https://example.org/contact)"
export CRAWLER_SCOPE_SNAPSHOT_RETENTION=168
export CRAWLER_SCOPE_HISTORY_RETENTION=720
python3 scripts/update.py

Usage Examples

Export all current CIDRs:

jq -r '.services[].prefixes | .ipv4[], .ipv6[]' data/current/crawlers.json

Export AI crawler CIDRs:

jq -r '.services[] | select(.category == "ai") | .prefixes | .ipv4[], .ipv6[]' data/current/crawlers.json

List sources that are documented but do not publish IP ranges:

jq -r '.services[] | select(.sourceType == "documented_user_agent") | [.id, .service, .sourceUrl] | @tsv' data/current/crawlers.json

Generate an Nginx include from the current dataset:

cp data/current/nginx-ai-map.conf /etc/nginx/conf.d/crawler-scope-ai-map.conf
nginx -t

Outputs

Path	Description
`data/current/crawlers.json`	Full normalized dataset
`data/current/robots-ai.txt`	Generated AI crawler `robots.txt` block
`data/current/nginx-ai-map.conf`	Nginx `map` for AI crawler user-agents
`data/history/summary.csv`	Historical summary rows
`data/snapshots/*.json`	Timestamped dataset snapshots
`config/sources.json`	Source inventory and classification config

Data Format

Each service record includes source metadata, user-agent patterns, reverse-DNS hints, health status, prefix counts, and split IPv4/IPv6 arrays.

{
  "id": "openai-gptbot",
  "service": "GPTBot",
  "operator": "OpenAI",
  "category": "ai",
  "sourceType": "official_json",
  "sourceOk": true,
  "ipListAuthoritative": true,
  "userAgentPatterns": ["GPTBot"],
  "counts": {
    "prefixes": 17,
    "ipv4": 17,
    "ipv6": 0
  },
  "prefixes": {
    "ipv4": ["20.42.10.176/28"],
    "ipv6": []
  }
}

Operational Notes

Treat sourceOk=false as a collection failure for that run. The collector falls back to the previous cached prefixes when available.
IP ranges identify published infrastructure, not intent. Use user-agent, reverse DNS, request behavior, and application context where enforcement risk matters.
Static and documented-only sources are included because they are operationally useful, but authoritative flags remain separate.
Release artifacts are generated by GitHub Actions after collection and attached to timestamped dataset releases.

Project Scope

CrawlerScope tracks public crawler, fetcher, monitoring, scanner, analytics, and preview-bot infrastructure that is useful for request classification and network policy. It prioritizes primary operator-published sources. Aggregator repositories may be reviewed for discovery, but their URLs are not used as dataset sources.

Use Cases

WAF allow/deny policy design for crawler traffic.
Search and AI crawler visibility audits.
Security logging enrichment and bot attribution.
Monitoring probe allowlisting.
Fraud/risk triage for automated traffic.
Change tracking for published crawler infrastructure.

Limitations

Some operators publish user-agent documentation but no stable IP feed.
Cloud-hosted crawlers may share network space with unrelated workloads.
CIDR lists can change without notice; scheduled collection reduces but does not remove that latency.

Directory Structure

.
├── config/
│   └── sources.json
├── data/
│   ├── current/
│   ├── history/
│   └── snapshots/
├── public/
│   ├── assets/
│   └── index.html
├── scripts/
│   └── update.py
└── .github/
    └── workflows/

Deployment

The included workflow runs every six hours and can be triggered manually:

on:
  schedule:
    - cron: "23 */6 * * *"
  workflow_dispatch:

The workflow:

Runs scripts/update.py.
Commits updated data/ and config/ changes.
Publishes a timestamped GitHub Release with dataset artifacts.
Deploys the static dashboard to GitHub Pages.

License

CrawlerScope is released under the MIT License.

Disclaimer

CrawlerScope publishes normalized data from public operator sources. Review upstream terms and validate enforcement logic before using the dataset in production controls.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
config		config
data		data
public		public
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrawlerScope

Overview

Current Dataset

Architecture

Features

Quick Start

Installation

Usage Examples

Outputs

Data Format

Operational Notes

Project Scope

Use Cases

Limitations

Directory Structure

Deployment

License

Disclaimer

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CrawlerScope

Overview

Current Dataset

Architecture

Features

Quick Start

Installation

Usage Examples

Outputs

Data Format

Operational Notes

Project Scope

Use Cases

Limitations

Directory Structure

Deployment

License

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages