Skip to content

nameistzzhang/phd_hunter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PhD Hunter πŸŽ“

PhD Advisor Application Assistant - Automate CS professor information collection, intelligent matching analysis, and cold email generation

License Python Status

πŸ“– Full documentation: https://nameistzzhang.github.io/phd_hunter/

✨ Features

Data Collection

  • πŸ“Š CSRankings Crawler - Automatically fetch CS professor rankings and lists
  • πŸ“š OpenAlex Paper Fetching - Fetch papers via institution + author matching (primary source)
  • πŸ”— arXiv Abstract Enrichment - Supplement OpenAlex abstracts with accurate arXiv data
  • 🏠 Homepage Scraping - Scrape professor homepages and generate AI summaries
  • πŸ’Ύ SQLite Storage - All data persisted locally

AI Analysis

  • πŸ€– Professor Matching Scoring - LLM-based direction match (1-5) and admission difficulty (1-5)
  • πŸ’¬ Intelligent Chat Analysis - One-click professor analysis report + cold email draft
  • 🎯 Personalized Cold Emails - Customized emails based on your Profile (CV/PS/papers)

Web Frontend

  • 🌐 Modern SPA Interface - Flask-based interactive single-page application
  • 🏷️ Priority Management - Reach / Match / Target / Safety / Not Considered
  • πŸ” Multi-dimensional Filtering - By priority, research area, university, score
  • πŸ‘€ Profile Management - Upload CV/PS, manage arXiv papers, set research preferences
  • βš™οΈ LLM Configuration - Configure API Key, model, temperature, iterations

πŸš€ Quick Start

Requirements

  • Python 3.10+
  • uv (recommended) or pip
  • Chrome/Chromium browser (for Selenium homepage scraping)

Installation

# 1. Clone the repository
git clone <repository-url>
cd phd-hunter

# 2. Install dependencies
# Using uv (recommended):
uv sync
# Or using pip:
pip install -e .
# Or using uv pip:
uv pip install -e .

# 3. Install api_infra (REQUIRED for LLM features)
cd src/phd_hunter/api_infra
pip install -e .
cd ../../..

⚠️ Required Configuration

You must create the config files before running the application.

# 1. Configure LLM parameters (REQUIRED for AI features)
cp src/phd_hunter/frontend/hound_config.example.json src/phd_hunter/frontend/hound_config.json
# Edit hound_config.json and fill in your API key and model settings

# 2. Configure crawl parameters (optional)
cp src/phd_hunter/frontend/hunt_config.example.json src/phd_hunter/frontend/hunt_config.json

hound_config.json example:

{
  "api_key": "your-api-key-here",
  "model": "deepseek-v3.2",
  "provider": "yunwu",
  "url": "https://yunwu.ai/v1",
  "temperature": 0.6,
  "max_tokens": 800,
  "scoring_iterations": 3,
  "nickname": "YourName"
}

Note: Without hound_config.json, the Analyzer (chat), Scorer (matching score), and Homepage Crawler will not work. You can still browse professor data and manage priorities without it.

Data Collection (CLI Mode)

# 1. Crawl professor data
python main.py crawl --area ai --region world --max-professors 5

# 2. Fetch papers
python main.py fetch-papers --max-papers 10

# 3. Scrape professor homepages (requires LLM config)
python -m phd_hunter.crawlers.homepage_crawler

# 4. Run matching score (requires LLM config)
python -m phd_hunter.hound.scorer

# 5. View statistics
python main.py stats

Start Web Interface

# Start Flask Web Server (default http://localhost:8080)

# Linux / macOS:
PYTHONPATH=src python -m phd_hunter.frontend.app

# Windows (Command Prompt):
set PYTHONPATH=src && python -m phd_hunter.frontend.app

# Windows (PowerShell):
$env:PYTHONPATH="src"; python -m phd_hunter.frontend.app

Then open http://localhost:8080 in your browser:

  • Hunt page: Browse professor cards, filter, sort, mark priorities
  • Chat page: Click a professor to start AI conversation with auto-generated analysis and cold email draft
  • Profile page: Upload CV/PS, add arXiv papers, set research preferences

πŸ“ Project Structure

phd_hunter/
β”œβ”€β”€ main.py                       # CLI entry
β”œβ”€β”€ pyproject.toml                # Project config
β”œβ”€β”€ README.md                     # This file
β”œβ”€β”€ docs/                         # Sphinx documentation
β”œβ”€β”€ tests/                        # Test files
└── src/phd_hunter/
    β”œβ”€β”€ __init__.py               # Package init
    β”œβ”€β”€ models.py                 # Pydantic data models
    β”œβ”€β”€ database.py               # SQLite database operations
    β”œβ”€β”€ api_infra/                # LLM API infrastructure
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   └── core/
    β”‚       └── client.py         # Unified LLM client
    β”œβ”€β”€ crawlers/
    β”‚   β”œβ”€β”€ __init__.py           # Export crawlers
    β”‚   β”œβ”€β”€ base.py               # Crawler base class (with caching)
    β”‚   β”œβ”€β”€ csrankings.py         # CSRankings crawler (Selenium)
    β”‚   β”œβ”€β”€ openalex_crawler.py   # OpenAlex crawler (primary paper source)
    β”‚   β”œβ”€β”€ arxiv_crawler.py      # arXiv crawler (abstract enrichment + manual add)
    β”‚   └── homepage_crawler.py   # Homepage scraper + AI summary
    β”œβ”€β”€ hound/
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   β”œβ”€β”€ scorer.py             # Professor matching scorer
    β”‚   └── scorer_daemon.py      # Background auto-scoring daemon
    β”œβ”€β”€ analyzer/
    β”‚   β”œβ”€β”€ __init__.py           # Export analyze_professor, chat_with_professor
    β”‚   β”œβ”€β”€ analyzer.py           # Professor analysis + cold email core
    β”‚   └── prompts.py            # Analyzer prompt templates
    β”œβ”€β”€ utils/
    β”‚   β”œβ”€β”€ logger.py             # Logging config
    β”‚   β”œβ”€β”€ helpers.py            # Utility functions
    β”‚   └── pdf_extract.py        # PDF text extraction + Profile builder
    └── frontend/                 # Web frontend
        β”œβ”€β”€ app.py                # Flask API server
        β”œβ”€β”€ index.html            # Main page
        β”œβ”€β”€ hound_config.json     # LLM config (create from example!)
        β”œβ”€β”€ hunt_config.json      # Crawl config (create from example!)
        β”œβ”€β”€ static/
        β”‚   β”œβ”€β”€ styles.css        # Stylesheet
        β”‚   β”œβ”€β”€ app.js            # Frontend logic
        β”‚   └── windsurf.svg      # AI avatar icon
        └── templates/            # HTML templates

πŸ—„οΈ Database Schema

SQLite database with core tables:

professors table

  • Basic info: name, university, rank, department, email, homepage
  • Research interests, priority (-1~3)
  • AI analysis: homepage_summary, direction_match_score, admission_difficulty_score
  • Chat history: messages (JSON)

papers table

  • Paper metadata (title, authors, abstract, year, venue)
  • arXiv ID, PDF link, citation count
  • Linked to professor record

applicant_profile table

  • User Profile: CV text, PS text
  • Research preferences, arXiv paper list

πŸ”§ Core Modules

Analyzer - Professor Analysis & Cold Email

Based on your Profile and professor data, auto-generates:

  1. Professor research direction analysis
  2. Matching points between you and the professor
  3. Cold email writing guidelines
  4. Complete cold email draft

Supports multi-round conversation to refine the draft.

Scorer - Matching Score

Uses LLM to score each professor:

  • Direction Match (1-5): Research direction matching degree
  • Admission Difficulty (1-5): Admission difficulty assessment

Homepage Crawler - Homepage Scraping

Uses Selenium to scrape professor homepages, then LLM extracts:

  • Research focus
  • Recruiting status
  • Homepage content summary

🌐 Web Interface Guide

1. Configure LLM

Click the βš™οΈ settings icon in the top-right corner to configure:

  • API Key
  • Provider / Model
  • URL (custom API endpoint)
  • Temperature / Max Tokens
  • Scoring Iterations

2. Complete Your Profile

Go to the Profile page:

  • Upload CV and PS (PDF format)
  • Add interesting arXiv paper links
  • Set research preferences

3. Browse Professors

The Hunt page displays all professor cards:

  • Top bar shows statistics: universities, professors, papers, avg scores
  • Use filter bar to filter by priority / area / university / score
  • Click professor card to view details (papers link to arXiv)

Professor Detail Modal:

  • Rescore β€” Re-run LLM scoring after editing papers
  • Add Paper β€” Paste an arXiv URL to manually add a paper
  • Delete Paper β€” Remove incorrect papers with the Γ— button

4. AI Chat Analysis

Click Chat to enter the conversation:

  • First entry auto-analyzes professor and generates cold email draft
  • Continue the conversation to modify or ask questions
  • Each message can be individually deleted

πŸ“Š CLI Reference

crawl - Crawl professor information

python main.py crawl --area ai --region world --max-professors 5

Parameters:

  • --area: Research area (default: ai)
  • --region: Region filter (default: world)
  • --max-universities: Max university count (default: all)
  • --max-professors: Max professors per university (default: 5)
  • --no-headless: Show browser window
  • --timeout: Page timeout (seconds, default: 30)

fetch-papers - Fetch papers

python main.py fetch-papers --max-papers 10 --max-professors 50

Parameters:

  • --max-papers: Max papers per professor (default: 10)
  • --max-professors: Max professors to process (default: all)
  • --delay: Request interval (seconds, default: 1.0)

stats - Statistics

python main.py stats

⚠️ Known Limitations

  1. arXiv vs Non-arXiv Papers: OpenAlex covers all venues, but only papers with an arXiv association get enriched with full abstracts and PDF links. Pure conference/journal papers may have limited metadata.
  2. OpenAlex Institution Matching: Author identification relies on OpenAlex's institution linking. Professors with ambiguous names or recent institution changes may occasionally be misidentified.
  3. LLM Cost: Analyzer, Scorer, and Homepage Paper Extraction all require LLM API calls. Watch your budget.
  4. Homepage Scraping: Some professor homepages have anti-bot mechanisms and may fail. Homepage extraction is best-effort; missing data does not block other features.

πŸ“– Documentation

Build docs locally:

cd docs && make html

πŸ§ͺ Development

Run Tests

uv run pytest tests/ -v

Code Checks

uv run black --check src/
uv run ruff check src/

πŸ“„ License

MIT License - see LICENSE file

πŸ™ Acknowledgements


⭐ Star this repo if it helps you!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors