CrawlerScope collects operator-published crawler, fetcher, monitoring, scanner, and preview-bot network ranges, normalizes them into deployable CIDR data, and publishes a static dashboard plus machine-readable artifacts for infrastructure and security teams.
Live dashboard: ipanalytics.github.io/CrawlerScope
Current dataset: data/current/crawlers.json
CrawlerScope is a small, auditable data pipeline for bot network intelligence. It tracks published source health, separates authoritative IP feeds from documented user-agent-only identities, and emits artifacts suitable for WAF rules, reverse proxies, allowlists, deny controls, analytics enrichment, and incident triage.
The project intentionally keeps source definitions in data, not code. Collector behavior lives in scripts/update.py; operator sources live in config/sources.json.
Generated at 2026-05-26T12:01:22Z.
| Metric | Count |
|---|---|
| Services | 43 |
| Healthy sources | 43 |
| Authoritative IP lists | 32 |
| CIDR prefixes | 7,180 |
| IPv4 prefixes | 6,705 |
| IPv6 prefixes | 475 |
| AI crawler/fetcher prefixes | 1,653 |
| Category | Services |
|---|---|
| AI crawlers | 13 |
| Search crawlers | 9 |
| Monitoring probes | 5 |
| Social previews | 4 |
| Fetchers | 3 |
| SEO crawlers | 3 |
| Ad verification | 2 |
| Security scanners | 2 |
| Archive | 1 |
| Analytics crawlers | 1 |
Tracked services
| Service | Category | Source type | Prefixes |
|---|---|---|---|
| Google common crawlers | search | official_json | 69 |
| Google special crawlers | search | official_json | 46 |
| Google user-triggered fetchers | fetcher | official_json | 223 |
| Bingbot | search | official_json | 28 |
| DuckDuckBot | search | official_json | 334 |
| DuckAssistBot | ai | official_json | 334 |
| Applebot | search | official_json | 12 |
| MojeekBot | search | official_json | 1 |
| Naver Yeti | search | official_json | 36 |
| YandexBot | search | known_static | 13 |
| Baiduspider | search | known_static | 2 |
| GPTBot | ai | official_json | 17 |
| OAI-SearchBot | ai | official_json | 32 |
| ChatGPT-User | ai | official_json | 214 |
| OAI-AdsBot | ai | documented_user_agent | 0 |
| PerplexityBot | ai | official_json | 8 |
| Perplexity-User | ai | official_json | 4 |
| ClaudeBot / Claude-SearchBot | ai | documented_user_agent | 0 |
| Amazonbot | ai | official_embedded_json | 524 |
| Amzn-SearchBot | ai | official_embedded_json | 512 |
| Amzn-User | fetcher | official_embedded_json | 1,023 |
| Meta-ExternalAgent / Meta-WebIndexer | ai | known_static | 4 |
| Bytespider | ai | documented_user_agent | 0 |
| MistralAI-User | ai | official_json | 4 |
| AhrefsBot | seo | official_json | 51 |
| Lumar crawler | seo | official_json | 66 |
| SemrushBot | seo | documented_user_agent | 0 |
| Censys scanners | security-scanner | known_static | 2 |
| Shodan scanners | security-scanner | known_static | 9 |
| Datadog Synthetics | monitoring | official_json | 113 |
| IAS crawler | ad-verification | official_json | 14 |
| TTD-Content crawler | ad-verification | official_text | 2,615 |
| UptimeRobot | monitoring | official_text | 217 |
| Pingdom probes | monitoring | official_text | 158 |
| StatusCake probes | monitoring | official_json | 296 |
| Better Stack probes | monitoring | official_text | 34 |
| Common Crawl CCBot | archive | official_json | 6 |
| Flipboard crawler | social | official_text | 136 |
| Parse.ly crawler | analytics | official_json | 10 |
| Pinterestbot | social | documented_user_agent | 0 |
| LinkedInBot | social | documented_user_agent | 0 |
| Telegram link preview | social | official_text | 11 |
| RSS API feed parser | fetcher | official_text | 2 |
CrawlerScope runs as a scheduled GitHub Actions collector and publishes static artifacts.
flowchart LR
A["config/sources.json"] --> B["scripts/update.py"]
B --> C["Fetch operator sources"]
C --> D["Normalize and collapse CIDR prefixes"]
D --> E["data/current/crawlers.json"]
D --> F["data/current/robots-ai.txt"]
D --> G["data/current/nginx-ai-map.conf"]
D --> H["data/snapshots/*.json"]
E --> I["Static dashboard"]
H --> J["GitHub Release artifacts"]
Source types:
| Type | Meaning |
|---|---|
official_json |
Operator-published machine-readable JSON feed |
official_text |
Operator-published plain-text CIDR/IP feed |
official_embedded_json |
Operator page with machine-readable ranges embedded in HTML |
documented_user_agent |
Documented bot identity without a stable public IP list |
known_static |
Useful static seed list, not treated as complete authority |
- Operator-published source collection with source health tracking.
- IPv4/IPv6 normalization, CIDR coercion, and prefix collapsing.
- Static dashboard with category, operator, source, service, and search filters.
- Filtered exports for JSON, CSV, CIDR lists,
robots.txt, and Nginx user-agent maps. - Snapshot retention and historical summary tracking.
- GitHub Pages publication and automatic dataset releases.
- Config-driven source inventory in
config/sources.json.
Run the collector and serve the dashboard locally:
python3 scripts/update.py
python3 -m http.server 8080Open:
http://127.0.0.1:8080/public/
When serving from public/, the app reads data from ../data/current. For GitHub Pages deployment, the workflow copies public/ and data/ into the Pages artifact.
CrawlerScope has no runtime dependency outside the Python standard library for data collection.
git clone https://github.com/ipanalytics/CrawlerScope.git
cd CrawlerScope
python3 scripts/update.pyOptional environment controls:
export CRAWLER_SCOPE_USER_AGENT="CrawlerScope/0.1 (+https://example.org/contact)"
export CRAWLER_SCOPE_SNAPSHOT_RETENTION=168
export CRAWLER_SCOPE_HISTORY_RETENTION=720
python3 scripts/update.pyExport all current CIDRs:
jq -r '.services[].prefixes | .ipv4[], .ipv6[]' data/current/crawlers.jsonExport AI crawler CIDRs:
jq -r '.services[] | select(.category == "ai") | .prefixes | .ipv4[], .ipv6[]' data/current/crawlers.jsonList sources that are documented but do not publish IP ranges:
jq -r '.services[] | select(.sourceType == "documented_user_agent") | [.id, .service, .sourceUrl] | @tsv' data/current/crawlers.jsonGenerate an Nginx include from the current dataset:
cp data/current/nginx-ai-map.conf /etc/nginx/conf.d/crawler-scope-ai-map.conf
nginx -t| Path | Description |
|---|---|
data/current/crawlers.json |
Full normalized dataset |
data/current/robots-ai.txt |
Generated AI crawler robots.txt block |
data/current/nginx-ai-map.conf |
Nginx map for AI crawler user-agents |
data/history/summary.csv |
Historical summary rows |
data/snapshots/*.json |
Timestamped dataset snapshots |
config/sources.json |
Source inventory and classification config |
Each service record includes source metadata, user-agent patterns, reverse-DNS hints, health status, prefix counts, and split IPv4/IPv6 arrays.
{
"id": "openai-gptbot",
"service": "GPTBot",
"operator": "OpenAI",
"category": "ai",
"sourceType": "official_json",
"sourceOk": true,
"ipListAuthoritative": true,
"userAgentPatterns": ["GPTBot"],
"counts": {
"prefixes": 17,
"ipv4": 17,
"ipv6": 0
},
"prefixes": {
"ipv4": ["20.42.10.176/28"],
"ipv6": []
}
}- Treat
sourceOk=falseas a collection failure for that run. The collector falls back to the previous cached prefixes when available. - IP ranges identify published infrastructure, not intent. Use user-agent, reverse DNS, request behavior, and application context where enforcement risk matters.
- Static and documented-only sources are included because they are operationally useful, but authoritative flags remain separate.
- Release artifacts are generated by GitHub Actions after collection and attached to timestamped dataset releases.
CrawlerScope tracks public crawler, fetcher, monitoring, scanner, analytics, and preview-bot infrastructure that is useful for request classification and network policy. It prioritizes primary operator-published sources. Aggregator repositories may be reviewed for discovery, but their URLs are not used as dataset sources.
- WAF allow/deny policy design for crawler traffic.
- Search and AI crawler visibility audits.
- Security logging enrichment and bot attribution.
- Monitoring probe allowlisting.
- Fraud/risk triage for automated traffic.
- Change tracking for published crawler infrastructure.
- Some operators publish user-agent documentation but no stable IP feed.
- Cloud-hosted crawlers may share network space with unrelated workloads.
- CIDR lists can change without notice; scheduled collection reduces but does not remove that latency.
.
├── config/
│ └── sources.json
├── data/
│ ├── current/
│ ├── history/
│ └── snapshots/
├── public/
│ ├── assets/
│ └── index.html
├── scripts/
│ └── update.py
└── .github/
└── workflows/
The included workflow runs every six hours and can be triggered manually:
on:
schedule:
- cron: "23 */6 * * *"
workflow_dispatch:The workflow:
- Runs
scripts/update.py. - Commits updated
data/andconfig/changes. - Publishes a timestamped GitHub Release with dataset artifacts.
- Deploys the static dashboard to GitHub Pages.
CrawlerScope is released under the MIT License.
CrawlerScope publishes normalized data from public operator sources. Review upstream terms and validate enforcement logic before using the dataset in production controls.