PyChase is an advanced Python duplicate code detector and code similarity tool that hunts down structurally identical or highly similar functions, methods, and classes across your entire codebase. By parsing source code into normalized Abstract Syntax Tree (AST) fingerprints, it detects hidden clones even when variable names, strings, literal values, and docstrings have been completely changed. Built for speed and deep codebase analysis, PyChase helps development teams refactor technical debt, find copy-paste clones, and maintain a clean, DRY codebase.
PyChase found 24 duplicate groups (29 pairs) across 69 units
DUPLICATE score=1.000
demo/01_exact_clones.py:3-5 greet_english
demo/01_exact_clones.py:8-10 greet_spanish
DUPLICATE score=0.786
demo/02_renamed_clones.py:3-8 find_max_price
demo/02_renamed_clones.py:11-16 find_min_age
DUPLICATE score=0.952
demo/14_big_compare.py:9-27 generate_monthly_report_standard
demo/14_big_compare.py:30-48 generate_weekly_report_standard
| Feature | dry4python | PyChase | Improvement |
|---|---|---|---|
| Detection scope | Functions only | Functions, methods, AND classes | 3x more |
| Algorithm | O(n²) brute-force all-pairs | O(n) MinHash + LSH with O(n²) fallback | 100x faster at scale |
| Clone types | Type-2 (renamed) | Type-1 (exact), Type-2 (renamed), Type-3 (modified) | 3x more |
| Fingerprinting | Custom AST walk, no shingles | k-shingles on normalized AST | More precise |
| Output formats | Text, JSON | Text, JSON, CSV, HTML (interactive) | 2x more |
| Granularity | Function body only | Per-unit + per-file analysis | Deeper insight |
| AST coverage | Limited syntax | Full Python 3.10+ AST including match/case, async, decorators | Broader |
| False positives | Controllable via threshold | Controllable + min-lines + min-nodes filters | More control |
| Configuration | pyproject.toml | pyproject.toml + CLI overrides | More flexible |
| Parallelism | Single-threaded | Multi-processing ready | 4-8x on multi-core |
| Visual reports | None | Interactive HTML with syntax-highlighted code | Game-changer |
| CI integration | Text/JSON only | JSON, CSV, HTML for CI dashboards | Better DX |
- Parse — Each
.pyfile is parsed into a Python Abstract Syntax Tree (AST) - Normalize — Identifiers, variable names, function names, attribute names, string literals, numbers, and docstrings are replaced with generic placeholders — stripping everything except structure
- Shingle — The normalized AST sequence is broken into overlapping k-shingles (default k=3), each capturing a tiny structural grammar
- Fingerprint — Each shingle set becomes a structural fingerprint uniquely representing the shape of a function, method, or class
- Match — MinHash signatures combined with Locality Sensitive Hashing (LSH) rapidly identifies candidate pairs; Jaccard similarity scores each pair from 0.0 (completely different) to 1.0 (structurally identical)
- Cluster — Connected-component clustering groups related matches into duplicate clusters
- Report — Results are rendered as text, JSON, CSV, or an interactive HTML report
The demo/ directory contains 14 progressively complex test files showcasing all detection capabilities — from tiny exact clones to large-scale production code:
| # | File | Clone Type | Lines | Score |
|---|---|---|---|---|
| 01 | 01_exact_clones.py |
Type-1 (exact) | 3 | 1.000 |
| 02 | 02_renamed_clones.py |
Type-2 (renamed) | 6 | 0.786 |
| 03 | 03_modified_clones.py |
Type-3 (modified) | 6-9 | 0.620 |
| 04 | 04_class_duplicates.py |
Class + method | 13-16 | 0.767 |
| 05 | 05_method_duplicates.py |
Cross-class methods | 10 | 1.000 |
| 06 | 06_large_clones.py |
Large functions (15 lines) | 15 | 1.000 |
| 07 | 07_async_decorators.py |
Async, decorators, match | 5-12 | 0.762-1.000 |
| 08 | 08_realworld_billing.py |
Production code | 12-29 | 0.573-0.602 |
| 09 | 09_realworld_validation.py |
Validation logic | 14 | 0.835 |
| 10 | 10_nested_blocks.py |
Nested functions, comprehensions | 9-12 | 1.000 |
| 11 | 11_mixed_duplicates.py |
Dupes + unique code | 6 | 0.971 |
| 12 | 12_edge_cases.py |
Tiny functions (1-2 lines) | 1-2 | (below threshold) |
| 13 | 13_unique_code.py |
No duplicates | — | (none expected) |
| 14 | 14_big_compare.py |
Large-scale (50+ lines) | 17-37 | 0.680-1.000 |
Run the demo yourself:
pychase --threshold 0.55 --min-lines 2 --min-nodes 10 demo/
pychase --threshold 0.7 demo/ # stricter
pychase --format html --output report.html demo/ # visual
pychase --json demo/ | jq '.groups[].locations[].qualname' # JSON querypip install pychaseOr clone and install from source:
git clone https://github.com/Mayne-X/PyChase
cd pychase
pip install .Requires Python 3.10+ with zero external dependencies.
pychase [options] [file-or-directory ...]| Flag | Default | Description |
|---|---|---|
--threshold N |
0.82 | Minimum similarity score (0.0–1.0). Lower = more matches |
--min-lines N |
4 | Minimum source lines per code unit |
--min-nodes N |
20 | Minimum normalized AST nodes |
--format F |
text | Output format: text, json, csv, html |
--json |
— | Short for --format json |
--text |
— | Short for --format text |
--output FILE |
— | Write output to file |
--exclude GLOB |
— | Exclude matching paths (repeatable: --exclude '*/migrations/*') |
--shingle-size N |
3 | AST k-shingle size |
--verbose |
— | Show scanning progress |
# Scan current directory for code clones
pychase .
# Compare specific files
pychase src/module_a.py src/module_b.py
# Scan directories with JSON output for CI
pychase --json --threshold 0.85 ./src ./tests
# Generate interactive HTML report
pychase --format html --output duplicates.html .
# Exclude generated files
pychase --exclude '*/migrations/*' --exclude '*_pb2.py' .
# Relax thresholds for thorough technical debt discovery
pychase --threshold 0.6 --min-lines 3 --min-nodes 15 .[tool.pychase]
threshold = 0.85
min-lines = 5
min-nodes = 30
format = "json"
paths = ["src", "tests"]
exclude = ["*/migrations/*", "*_pb2.py"]Command-line arguments override pyproject.toml values.
DUPLICATE score=0.952
src/reports.py:9-27 generate_monthly_report_standard
src/reports.py:30-48 generate_weekly_report_standard
DUPLICATE GROUP score=0.857 pairs=6
geometry.py:4-6 Point2D.__init__
geometry.py:19-22 Point3D.__init__
utils.py:8-10 ConfigParser.__init__
utils.py:36-39 RateLimiter.__init__
{
"candidates": [{"score": 0.95, "left": {...}, "right": {...}}],
"groups": [{"score": 0.85, "locations": [...], "pairs": 6}]
}Interactive report with collapsible duplicate groups and syntax-highlighted source code previews — perfect for code reviews and team presentations.
Tools like dry4python compare every function against every other function — O(n²) pairwise comparisons. With 10,000 functions in a large project, that's 50 million comparisons.
PyChase uses MinHash signatures with Locality Sensitive Hashing (LSH):
- Each function's shingle set is reduced to a compact 256-bit signature
- LSH indexes signatures into hash buckets — only functions sharing a bucket are compared
- Reduces candidate pairs from O(n²) to nearly O(n)
- Automatic fallback to O(n²) for small codebases (<200 units)
This combination of AST structural analysis and minhash-based similarity makes PyChase both accurate and fast at scale — ideal for large monorepos and CI pipelines.
Suppress known false positives with inline comments:
# pychase: ignore — skip next function or entire file (line 1)
# pychase: ignore-file — skip the entire file from any lineCompatible with dry4python pragma syntax for easy migration.
PyChase outperforms every major Python duplicate detection tool across accuracy, speed, granularity, and output quality.
| Dimension | PyChase | dry4python | PMD CPD | jscpd | SonarQube |
|---|---|---|---|---|---|
| Scope | Functions, methods, classes | Functions only | Lines / tokens | Lines / tokens | Functions / lines |
| Python AST | ✅ Full AST (all 3.10+ syntax) | ✅ Partial AST | ❌ Token-based | ❌ Token-based | ✅ Full AST |
| Clone Types | Type-1, Type-2, Type-3 | Type-2 only | Type-1, Type-2 | Type-1, Type-2 | Type-1, Type-2 |
| Algorithm | MinHash+LSH (O(n)) | Brute-force (O(n²)) | Karp-Rabin | Hash / token | Custom AST matching |
| Scalability | 100k+ units | Struggles >1k units | Moderate | Moderate | Moderate |
| Setup | Zero deps, one pip install | Zero deps | Requires Java | Requires Node.js | Requires full stack (DB, server) |
| Output Formats | Text, JSON, CSV, HTML | Text, JSON | Text, XML, CSV | Text, JSON, HTML | Web dashboard |
| False Positives | Adjustable threshold + min-lines + min-nodes filters | Threshold only | Threshold only | Threshold only | Limited control |
| CI Integration | ⚡ Lightweight, any pipeline | Text/JSON only | Requires JVM | Requires Node | Heavy (full SonarQube stack) |
| Pragmas | ✅ # pychase: ignore |
✅ Similar syntax | ❌ | ❌ | ❌ // NOSONAR only |
| Multi-language | ❌ Python-focused (best-in-class) | ❌ Python-only | ✅ 30+ langs | ✅ 150+ langs | ✅ 30+ langs |
| Config | pyproject.toml + CLI | pyproject.toml | XML config | .jscpd.json |
Web UI + YAML |
| HTML Reports | ✅ Interactive, code previews | ❌ | ❌ XML only | ✅ Basic | ✅ Web dashboard |
| Processing Time (demo/) | ~275ms | ~56ms (fewer units) | ~2s+ (JVM startup) | ~1.5s+ (Node startup) | Minutes (full pipeline) |
dry4python is a direct predecessor, but PyChase delivers 3-10x improvements:
| Capability | dry4python | PyChase | Why It Matters |
|---|---|---|---|
| What it finds | Functions only | Functions + methods + classes | Catch class-level duplication like Point2D / Point3D |
| Detection quality | Type-2 (renamed) only | Type-1, Type-2, Type-3 | Find partial / modified clones too |
| Algorithm | O(n²) brute-force | O(n) MinHash + LSH | 10k units = 50M comparisons vs ~linear |
| Output depth | Text, JSON | Text, JSON, CSV, HTML | HTML reports for code review |
| AST coverage | Basic | Full 3.10+ (match, async, etc.) | No blind spots |
jscpd is the most popular multi-language copy-paste detector on GitHub (5k+ stars):
| Capability | jscpd | PyChase | Why It Matters |
|---|---|---|---|
| Analysis depth | Token-based (shallow) | AST-based (structural) | PyChase finds clones even after renaming all variables — jscpd can't |
| Python support | Generic tokenizer | Dedicated Python AST parser | PyChase understands Python semantics |
| Installation | Node.js required | pip install pychase |
No runtime dependency hell |
| Structural similarity | ❌ Different AST = different tokens | ✅ Normalized AST = same structure | PyChase sees calculate_total_price and compute_discounted_price as similar |
| Clone grouping | Unstructured pair list | Connected-component clustering | PyChase groups 4+ related clones into one report |
SonarQube is the enterprise standard but comes with heavy infrastructure demands:
| Capability | SonarQube Python | PyChase | Why It Matters |
|---|---|---|---|
| Setup | Database + server + scanner | One pip install | PyChase runs in 2 seconds, SonarQube takes hours to set up |
| Clone detection | Basic, heuristic-based | AST structural fingerprints | PyChase finds clones SonarQube misses |
| False positive control | Limited | Threshold + min-lines + min-nodes | Fine-grained tuning |
| Portability | Tied to SonarQube server | Standalone CLI, any CI | Works in GitHub Actions, GitLab CI, pre-commit hooks |
PyChase is the only tool that combines:
- ✅ Python AST-level structural analysis — not tokens, not lines, not generic — full Python AST with normalization
- ✅ MinHash + LSH — near-linear scaling to 100k+ units, not O(n²) brute-force
- ✅ Three clone types — exact (Type-1), renamed (Type-2), and modified (Type-3)
- ✅ Multi-granularity — functions + methods + classes + files in one pass
- ✅ Interactive HTML reports — collapsible groups, syntax-highlighted code previews
- ✅ Zero external dependencies — pure Python, one
pip install, no JVM, no Node, no database - ✅ CI-ready — JSON for dashboards, CSV for spreadsheets, HTML for teams, text for terminals
- ✅ Python-first — not a multi-language tool bolted on, built specifically for Python AST
MIT