AI-powered resume-to-job-description matching system. Scores resumes against job descriptions using semantic embeddings and NER-based skill extraction, with dual views for candidates (single resume analysis) and recruiters (bulk ranking + filtering).
Live deployment: http://13.51.207.145 (AWS EC2 eu-north-1)
┌─────────────────────────────────────────────────────────────────────┐
│ Client (Browser) │
└───────────────────────────────┬─────────────────────────────────────┘
│ HTTP :80
┌───────────▼───────────┐
│ nginx (port 80) │
│ reverse proxy │
└──────┬──────────┬─────┘
│ │
/ (SPA) │ │ /api/* /health
│ │
┌────────────▼──┐ ┌────▼──────────────────┐
│ React + │ │ FastAPI backend │
│ Vite SPA │ │ (port 8000) │
│ (port 3000) │ │ │
│ │ │ ┌──────────────────┐ │
│ ┌──────────┐ │ │ │ ML Pipeline │ │
│ │Candidate │ │ │ │ │ │
│ │ View │ │ │ │ NERExtractor │ │
│ └──────────┘ │ │ │ (spaCy trf) │ │
│ ┌──────────┐ │ │ │ ↓ │ │
│ │Recruiter │ │ │ │ EmbeddingModel │ │
│ │ View │ │ │ │ (MiniLM-L6) │ │
└──────────────┘ │ │ ↓ │ │
│ │ MatchScorer │ │
│ │ (4-component) │ │
│ └──────────────────┘ │
│ │ │
│ ┌───────▼──────┐ │
│ │ Redis cache │ │
│ │ (port 6379) │ │
│ └──────────────┘ │
│ ┌───────────────┐ │
│ │ PostgreSQL │ │
│ │ (port 5432) │ │
│ └───────────────┘ │
└────────────────────────┘
│
┌────────────▼────────────┐
│ AWS S3 │
│ hirelens-models-* │ ← fine-tuned weights
│ hirelens-resumes-* │ ← uploaded PDFs
└─────────────────────────┘
CI/CD (GitHub Actions → ghcr.io → EC2)
──────────────────────────────────────
push to main
│
├── Job 1: Backend lint (black + flake8) + pytest (30 tests)
├── Job 2: Frontend lint (ESLint) + Vite production build
├── Job 3: Build & push Docker images → ghcr.io
└── Job 4: SSH deploy to EC2 → docker compose pull + rolling restart
| Metric | Value | Target | Status |
|---|---|---|---|
| NDCG@10 | 0.962 | 0.80 | ✓ |
| Precision@1 | 0.896 | — | — |
| AUC-ROC | 0.837 | 0.80 | ✓ |
| MRR | 0.948 | — | — |
| NER F1 (skill extraction) | 0.771 | 0.88 | — |
| Scoring latency (GPU, single) | 165 ms | <500 ms | ✓ |
| Bulk 50 resumes | 8.1 s (162 ms/resume) | — | — |
| Model size | 608 MB | — | — |
| Test suite | 30/30 passing | — | ✓ |
Evaluation on 712 held-out resume–job pairs. Latency measured on GTX 1650; AWS t3.micro (CPU-only) is ~3–5× slower.
| Dataset | Rows | Used For |
|---|---|---|
| Kaggle Resume Dataset | 19,020 resumes | Training pairs + NER eval |
| LinkedIn Job Postings | 3.3M listings | Job description corpus |
| Kaggle Structured Resumes | 1,200 resumes | NER F1 evaluation |
Resumes (data/raw/resumes_clean.csv): scraped and cleaned resume text spanning engineering, data science, finance, marketing, and healthcare roles.
LinkedIn Jobs (data/raw/linkedin_jobs/): 3.3M job postings with structured metadata — skills, industries, salaries, company details. Filtered to English-language postings with at minimum a job description and skills list.
Raw CSVs
│
▼ src/data/ingestion.py
Normalize text, remove PII artifacts, filter short docs (<100 chars)
│
▼ src/data/preprocessing.py
Sentence-case, strip HTML, deduplicate, tokenize
│
▼ src/data/synthetic_gen.py
Weak-supervision pair generation:
- Embed all resumes + JDs with base MiniLM-L6-v2
- Cosine similarity > 0.65 → positive label (1.0)
- Cosine similarity < 0.35 → negative label (0.0)
- Balance ratio: ~1:2 (pos:neg)
│
▼
data/processed/train_pairs.csv (5,692 pairs — 1,897 pos / 3,795 neg)
data/processed/val_pairs.csv (712 pairs — 238 pos / 474 neg)
Fine-tuned sentence-transformers/all-MiniLM-L6-v2 on the generated pairs using CosineSimilarityLoss:
Base model : sentence-transformers/all-MiniLM-L6-v2 (384-dim)
Epochs : 5
Batch size : 32
LR : 2e-5 (cosine schedule, 200 warmup steps)
FP16 : enabled
Device : NVIDIA GTX 1650 (3.9 GB VRAM)
Duration : ~56 minutes
Val Pearson cosine : 0.622 (vs 0.355 for base model)
Val Spearman cosine : 0.561
The fine-tuned model improves Pearson correlation by +75% relative to the off-the-shelf base model on the domain-specific resume/JD pairs.
| Layer | Technology | Version |
|---|---|---|
| ML / NLP | sentence-transformers | 3.0.1 |
| PyTorch | 2.3.1 | |
| spaCy (en_core_web_trf) | 3.7.5 | |
| scikit-learn | 1.5.1 | |
| HuggingFace Transformers | 4.42.4 | |
| Backend | FastAPI | 0.111.1 |
| Uvicorn | 0.30.3 | |
| Pydantic v2 | 2.8.2 | |
| SQLAlchemy (async) | 2.0.31 | |
| asyncpg | 0.29.0 | |
| Frontend | React | 18 |
| Vite | 5 | |
| Tailwind CSS | 3 | |
| Data | PostgreSQL | 16.3 |
| Redis | 7.2 | |
| pdfplumber | 0.11.2 | |
| PyMuPDF | 1.24.7 | |
| Experiment tracking | MLflow | 2.17.2 |
| Infrastructure | Docker + Docker Compose | — |
| nginx | alpine | |
| AWS EC2 (t3.micro) | eu-north-1 | |
| AWS S3 | — | |
| GitHub Actions | — |
- Python 3.12
- Node.js 20+
- Docker + Docker Compose
- Git
git clone https://github.com/Akshats-git/HireLens.git
cd HireLens
# Python environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# spaCy models (not on PyPI — install from GitHub releases)
pip install "https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl"
pip install "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl"
# Frontend
cd frontend && npm ci && cd ..cp .env.example .env
# Edit .env — minimum required keys:
# POSTGRES_PASSWORD=<any password>
# APP_SECRET_KEY=<random 32-char hex>See Environment Variables for the full list.
# Start postgres + redis
docker compose up -d postgres redis
# Run backend (from repo root)
source .venv/bin/activate
PYTHONPATH=. uvicorn backend.main:app --reload --port 8000
# Run frontend (in another terminal)
cd frontend && npm run devOpen http://localhost:5173 — the React dev server proxies /api/ to :8000.
source .venv/bin/activate
pytest tests/ -v# Generate training pairs from raw data first
python -m src.data.data_pipeline
# Fine-tune
python -m src.models.train --epochs 5 --batch-size 32
# Evaluate
python -m src.evaluation.metrics- AWS CLI configured (
aws configure) - EC2 key pair at
~/.ssh/hirelens.pem - GitHub repo secrets set (see below)
| Secret | Value |
|---|---|
EC2_HOST |
EC2 public IP (e.g. 13.51.207.145) |
EC2_SSH_KEY |
Contents of your .pem key file |
POSTGRES_PASSWORD |
Strong password for Postgres |
GITHUB_TOKENis provided automatically — no setup needed for ghcr.io push.
# Launch Ubuntu 22.04 LTS t3.micro in eu-north-1
# Add inbound rules: SSH (22), HTTP (80)
# Bootstrap (installs Docker, nginx, swap, CloudWatch agent)
EC2_HOST=<your-ip> \
SSH_KEY=~/.ssh/hirelens.pem \
GHCR_USER=<github-username> \
GHCR_TOKEN=<github-pat> \
POSTGRES_PASSWORD=<password> \
bash aws/deploy.sh --bootstrapEvery push to main triggers the CI/CD pipeline:
- Python lint + 30 unit tests
- Frontend ESLint + Vite build check
- Build multi-stage Docker images and push to
ghcr.io - SSH into EC2, pull new images, rolling restart (postgres/redis kept running), smoke test
No manual steps needed after the initial bootstrap.
EC2_HOST=13.51.207.145 \
SSH_KEY=~/.ssh/hirelens.pem \
GHCR_USER=<github-username> \
GHCR_TOKEN=<github-pat> \
POSTGRES_PASSWORD=<password> \
bash aws/deploy.sh| Variable | Required | Default | Description |
|---|---|---|---|
POSTGRES_PASSWORD |
Yes | — | PostgreSQL password |
DATABASE_URL |
Yes | — | Full async DB URL (postgresql+asyncpg://...) |
REDIS_URL |
No | in-memory | Redis connection string |
APP_SECRET_KEY |
Yes | — | 32-byte hex secret for JWT signing |
JWT_SECRET_KEY |
Yes | — | 32-byte hex secret for tokens |
API_WORKERS |
No | 4 |
Uvicorn worker count (1 on t3.micro) |
API_RELOAD |
No | false |
Hot-reload (dev only) |
LOG_LEVEL |
No | INFO |
Loguru log level |
APP_ENV |
No | development |
production disables debug features |
SPACY_MODEL |
No | en_core_web_sm |
spaCy model (en_core_web_trf for best accuracy) |
MODEL_CACHE_DIR |
No | models/cache |
Directory for embedding cache |
FINE_TUNED_MODEL_PATH |
No | models/fine_tuned/hirelens_matcher |
Path to fine-tuned sentence-transformer |
AWS_DEFAULT_REGION |
No | us-east-1 |
AWS region for S3 |
AWS_S3_RESUMES_BUCKET |
No | — | S3 bucket for uploaded resumes |
AWS_S3_MODELS_BUCKET |
No | — | S3 bucket for model artifacts |
CORS_ORIGINS |
No | http://localhost:3000 |
Allowed CORS origins (comma-separated) |
MLFLOW_TRACKING_URI |
No | — | MLflow server URL for experiment logging |
Never commit credentials.
.envand.env.prodare in.gitignore. AWS credentials live in~/.aws/credentials, not in any project file.
| Method | Path | Description |
|---|---|---|
POST |
/api/candidate/analyze |
Score one resume PDF against a job description |
Request: multipart/form-data — resume (PDF ≤ 10 MB) + job_description (string ≥ 50 chars)
Response:
{
"score_pct": 82.5,
"label": "Good",
"matched_skills": ["python", "docker", "aws"],
"missing_skills": ["kubernetes", "terraform"],
"suggestions": ["Add Kubernetes experience to match senior DevOps requirements"],
"processing_time_ms": 347
}| Method | Path | Description |
|---|---|---|
POST |
/api/recruiter/bulk-analyze |
Score up to 50 resumes, returns ranked list |
GET |
/api/recruiter/candidate/{id} |
Full analysis for one candidate |
POST |
/api/recruiter/filter |
Filter/re-rank by score, skills, experience level |
| Method | Path | Description |
|---|---|---|
GET |
/health |
Returns {"status":"healthy","models_loaded":true} |
GET |
/docs |
Swagger UI (all endpoints) |
HireLens/
├── backend/ # FastAPI application
│ ├── main.py # App factory, lifespan, CORS
│ ├── routers/
│ │ ├── candidate.py # /api/candidate/analyze
│ │ └── recruiter.py # /api/recruiter/*
│ ├── services/
│ │ ├── ml_service.py # Singleton: NER + embeddings + scorer
│ │ ├── cache_service.py # Redis with in-memory fallback
│ │ └── pdf_service.py # pdfplumber + PyMuPDF extraction
│ └── schemas/ # Pydantic request/response models
├── src/
│ ├── models/
│ │ ├── train.py # Fine-tuning script
│ │ └── scorer.py # 4-component MatchScorer
│ ├── features/
│ │ ├── ner.py # spaCy NER + taxonomy skill extraction
│ │ └── embeddings.py # Sentence-transformer wrapper + cache
│ ├── evaluation/
│ │ └── metrics.py # NDCG, MRR, AUC-ROC, NER F1
│ └── data/
│ ├── ingestion.py # Raw CSV loading + cleaning
│ ├── preprocessing.py # Text normalization
│ └── synthetic_gen.py # Weak-supervision pair generation
├── frontend/ # React + Vite + Tailwind
│ └── src/pages/
│ ├── Landing.jsx
│ ├── Candidate.jsx # Single resume upload + score display
│ └── Recruiter.jsx # Bulk upload + ranked table + filters
├── data/
│ ├── raw/ # Original datasets (git-ignored)
│ ├── processed/ # train/val pairs (git-ignored)
│ ├── synthetic/ # Weak-supervision pairs (git-ignored)
│ └── skills_taxonomy.json # 595 tech + 51 soft + 35 cert skills
├── models/
│ ├── fine_tuned/hirelens_matcher/ # 608 MB fine-tuned weights
│ └── cache/embeddings/ # SHA-256 keyed embedding cache
├── docker/
│ ├── Dockerfile.backend # Multi-stage Python image
│ ├── Dockerfile.frontend # Node builder → nginx static
│ └── postgres/init.sql # DB + extension setup
├── aws/
│ ├── deploy.sh # Local → EC2 deploy script
│ ├── setup-ec2.sh # Bootstrap (Docker, nginx, swap)
│ ├── s3-setup.sh # S3 bucket creation with encryption
│ └── cloudwatch-agent.json # CloudWatch log/metric config
├── tests/
│ ├── integration/ # FastAPI endpoint tests (30 total)
│ └── conftest.py # Fixtures with mocked ML service
├── .github/workflows/
│ └── ci-cd.yml # 4-job CI/CD pipeline
├── docker-compose.yml # Local dev (postgres + redis)
├── docker-compose.prod.yml # Production (all 4 services)
├── requirements.txt
├── params.yaml # DVC pipeline parameters
├── dvc.yaml # DVC stage definitions
├── metrics_report.json # Full evaluation + latency report
└── configs/config.yaml # Master configuration
Add screenshots here after the UI is finalized.
| View | Description |
|---|---|
| Landing page | Role selection (Candidate / Recruiter) |
| Candidate view | Upload resume PDF + paste job description → score breakdown with matched/missing skills |
| Recruiter view | Upload up to 50 resumes → ranked table with score badges, filter by score/skills/experience |
MIT