HireLens

AI-powered resume-to-job-description matching system. Scores resumes against job descriptions using semantic embeddings and NER-based skill extraction, with dual views for candidates (single resume analysis) and recruiters (bulk ranking + filtering).

Live deployment: http://13.51.207.145 (AWS EC2 eu-north-1)

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          Client (Browser)                           │
└───────────────────────────────┬─────────────────────────────────────┘
                                │ HTTP :80
                    ┌───────────▼───────────┐
                    │    nginx (port 80)     │
                    │   reverse proxy       │
                    └──────┬──────────┬─────┘
                           │          │
              / (SPA)      │          │  /api/*  /health
                           │          │
              ┌────────────▼──┐  ┌────▼──────────────────┐
              │   React +     │  │   FastAPI backend      │
              │   Vite SPA    │  │   (port 8000)          │
              │   (port 3000) │  │                        │
              │               │  │  ┌──────────────────┐  │
              │  ┌──────────┐ │  │  │  ML Pipeline     │  │
              │  │Candidate │ │  │  │                  │  │
              │  │  View    │ │  │  │  NERExtractor    │  │
              │  └──────────┘ │  │  │  (spaCy trf)    │  │
              │  ┌──────────┐ │  │  │       ↓          │  │
              │  │Recruiter │ │  │  │  EmbeddingModel  │  │
              │  │  View    │ │  │  │  (MiniLM-L6)    │  │
              └──────────────┘  │  │       ↓          │  │
                                │  │  MatchScorer     │  │
                                │  │  (4-component)   │  │
                                │  └──────────────────┘  │
                                │          │              │
                                │  ┌───────▼──────┐       │
                                │  │  Redis cache │       │
                                │  │  (port 6379) │       │
                                │  └──────────────┘       │
                                │  ┌───────────────┐      │
                                │  │  PostgreSQL   │      │
                                │  │  (port 5432)  │      │
                                │  └───────────────┘      │
                                └────────────────────────┘
                                           │
                              ┌────────────▼────────────┐
                              │      AWS S3             │
                              │  hirelens-models-*      │  ← fine-tuned weights
                              │  hirelens-resumes-*     │  ← uploaded PDFs
                              └─────────────────────────┘

CI/CD (GitHub Actions → ghcr.io → EC2)
──────────────────────────────────────
push to main
    │
    ├── Job 1: Backend lint (black + flake8) + pytest (30 tests)
    ├── Job 2: Frontend lint (ESLint) + Vite production build
    ├── Job 3: Build & push Docker images → ghcr.io
    └── Job 4: SSH deploy to EC2 → docker compose pull + rolling restart

Achieved Metrics

Metric	Value	Target	Status
NDCG@10	0.962	0.80	✓
Precision@1	0.896	—	—
AUC-ROC	0.837	0.80	✓
MRR	0.948	—	—
NER F1 (skill extraction)	0.771	0.88	—
Scoring latency (GPU, single)	165 ms	<500 ms	✓
Bulk 50 resumes	8.1 s (162 ms/resume)	—	—
Model size	608 MB	—	—
Test suite	30/30 passing	—	✓

Evaluation on 712 held-out resume–job pairs. Latency measured on GTX 1650; AWS t3.micro (CPU-only) is ~3–5× slower.

Datasets

1. Raw Sources

Dataset	Rows	Used For
Kaggle Resume Dataset	19,020 resumes	Training pairs + NER eval
LinkedIn Job Postings	3.3M listings	Job description corpus
Kaggle Structured Resumes	1,200 resumes	NER F1 evaluation

Resumes (data/raw/resumes_clean.csv): scraped and cleaned resume text spanning engineering, data science, finance, marketing, and healthcare roles.

LinkedIn Jobs (data/raw/linkedin_jobs/): 3.3M job postings with structured metadata — skills, industries, salaries, company details. Filtered to English-language postings with at minimum a job description and skills list.

2. Processing Pipeline

Raw CSVs
    │
    ▼  src/data/ingestion.py
Normalize text, remove PII artifacts, filter short docs (<100 chars)
    │
    ▼  src/data/preprocessing.py
Sentence-case, strip HTML, deduplicate, tokenize
    │
    ▼  src/data/synthetic_gen.py
Weak-supervision pair generation:
  - Embed all resumes + JDs with base MiniLM-L6-v2
  - Cosine similarity > 0.65 → positive label (1.0)
  - Cosine similarity < 0.35 → negative label (0.0)
  - Balance ratio: ~1:2 (pos:neg)
    │
    ▼
data/processed/train_pairs.csv   (5,692 pairs — 1,897 pos / 3,795 neg)
data/processed/val_pairs.csv     (712 pairs — 238 pos / 474 neg)

3. Model Training

Fine-tuned sentence-transformers/all-MiniLM-L6-v2 on the generated pairs using CosineSimilarityLoss:

Base model : sentence-transformers/all-MiniLM-L6-v2 (384-dim)
Epochs     : 5
Batch size : 32
LR         : 2e-5 (cosine schedule, 200 warmup steps)
FP16       : enabled
Device     : NVIDIA GTX 1650 (3.9 GB VRAM)
Duration   : ~56 minutes

Val Pearson cosine  : 0.622  (vs 0.355 for base model)
Val Spearman cosine : 0.561

The fine-tuned model improves Pearson correlation by +75% relative to the off-the-shelf base model on the domain-specific resume/JD pairs.

Tech Stack

Layer	Technology	Version
ML / NLP	sentence-transformers	3.0.1
	PyTorch	2.3.1
	spaCy (en_core_web_trf)	3.7.5
	scikit-learn	1.5.1
	HuggingFace Transformers	4.42.4
Backend	FastAPI	0.111.1
	Uvicorn	0.30.3
	Pydantic v2	2.8.2
	SQLAlchemy (async)	2.0.31
	asyncpg	0.29.0
Frontend	React	18
	Vite	5
	Tailwind CSS	3
Data	PostgreSQL	16.3
	Redis	7.2
PDF	pdfplumber	0.11.2
	PyMuPDF	1.24.7
Experiment tracking	MLflow	2.17.2
Infrastructure	Docker + Docker Compose	—
	nginx	alpine
	AWS EC2 (t3.micro)	eu-north-1
	AWS S3	—
	GitHub Actions	—

Local Setup

Prerequisites

Python 3.12
Node.js 20+
Docker + Docker Compose
Git

1. Clone and install

git clone https://github.com/Akshats-git/HireLens.git
cd HireLens

# Python environment
python -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# spaCy models (not on PyPI — install from GitHub releases)
pip install "https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl"
pip install "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl"

# Frontend
cd frontend && npm ci && cd ..

2. Environment

cp .env.example .env
# Edit .env — minimum required keys:
#   POSTGRES_PASSWORD=<any password>
#   APP_SECRET_KEY=<random 32-char hex>

See Environment Variables for the full list.

3. Start services

# Start postgres + redis
docker compose up -d postgres redis

# Run backend (from repo root)
source .venv/bin/activate
PYTHONPATH=. uvicorn backend.main:app --reload --port 8000

# Run frontend (in another terminal)
cd frontend && npm run dev

Open http://localhost:5173 — the React dev server proxies /api/ to :8000.

4. Run tests

source .venv/bin/activate
pytest tests/ -v

5. Train model (optional — pre-trained weights included)

# Generate training pairs from raw data first
python -m src.data.data_pipeline

# Fine-tune
python -m src.models.train --epochs 5 --batch-size 32

# Evaluate
python -m src.evaluation.metrics

AWS Deployment

Prerequisites

AWS CLI configured (aws configure)
EC2 key pair at ~/.ssh/hirelens.pem
GitHub repo secrets set (see below)

GitHub Secrets Required

Secret	Value
`EC2_HOST`	EC2 public IP (e.g. `13.51.207.145`)
`EC2_SSH_KEY`	Contents of your `.pem` key file
`POSTGRES_PASSWORD`	Strong password for Postgres

GITHUB_TOKEN is provided automatically — no setup needed for ghcr.io push.

First-time EC2 bootstrap

# Launch Ubuntu 22.04 LTS t3.micro in eu-north-1
# Add inbound rules: SSH (22), HTTP (80)

# Bootstrap (installs Docker, nginx, swap, CloudWatch agent)
EC2_HOST=<your-ip> \
SSH_KEY=~/.ssh/hirelens.pem \
GHCR_USER=<github-username> \
GHCR_TOKEN=<github-pat> \
POSTGRES_PASSWORD=<password> \
bash aws/deploy.sh --bootstrap

Automatic deploys

Every push to main triggers the CI/CD pipeline:

Python lint + 30 unit tests
Frontend ESLint + Vite build check
Build multi-stage Docker images and push to ghcr.io
SSH into EC2, pull new images, rolling restart (postgres/redis kept running), smoke test

No manual steps needed after the initial bootstrap.

Manual re-deploy

EC2_HOST=13.51.207.145 \
SSH_KEY=~/.ssh/hirelens.pem \
GHCR_USER=<github-username> \
GHCR_TOKEN=<github-pat> \
POSTGRES_PASSWORD=<password> \
bash aws/deploy.sh

Environment Variables

Variable	Required	Default	Description
`POSTGRES_PASSWORD`	Yes	—	PostgreSQL password
`DATABASE_URL`	Yes	—	Full async DB URL (`postgresql+asyncpg://...`)
`REDIS_URL`	No	in-memory	Redis connection string
`APP_SECRET_KEY`	Yes	—	32-byte hex secret for JWT signing
`JWT_SECRET_KEY`	Yes	—	32-byte hex secret for tokens
`API_WORKERS`	No	`4`	Uvicorn worker count (`1` on t3.micro)
`API_RELOAD`	No	`false`	Hot-reload (dev only)
`LOG_LEVEL`	No	`INFO`	Loguru log level
`APP_ENV`	No	`development`	`production` disables debug features
`SPACY_MODEL`	No	`en_core_web_sm`	spaCy model (`en_core_web_trf` for best accuracy)
`MODEL_CACHE_DIR`	No	`models/cache`	Directory for embedding cache
`FINE_TUNED_MODEL_PATH`	No	`models/fine_tuned/hirelens_matcher`	Path to fine-tuned sentence-transformer
`AWS_DEFAULT_REGION`	No	`us-east-1`	AWS region for S3
`AWS_S3_RESUMES_BUCKET`	No	—	S3 bucket for uploaded resumes
`AWS_S3_MODELS_BUCKET`	No	—	S3 bucket for model artifacts
`CORS_ORIGINS`	No	`http://localhost:3000`	Allowed CORS origins (comma-separated)
`MLFLOW_TRACKING_URI`	No	—	MLflow server URL for experiment logging

Never commit credentials. .env and .env.prod are in .gitignore. AWS credentials live in ~/.aws/credentials, not in any project file.

API Reference

Candidate

Method	Path	Description
`POST`	`/api/candidate/analyze`	Score one resume PDF against a job description

Request: multipart/form-data — resume (PDF ≤ 10 MB) + job_description (string ≥ 50 chars)

Response:

{
  "score_pct": 82.5,
  "label": "Good",
  "matched_skills": ["python", "docker", "aws"],
  "missing_skills": ["kubernetes", "terraform"],
  "suggestions": ["Add Kubernetes experience to match senior DevOps requirements"],
  "processing_time_ms": 347
}

Recruiter

Method	Path	Description
`POST`	`/api/recruiter/bulk-analyze`	Score up to 50 resumes, returns ranked list
`GET`	`/api/recruiter/candidate/{id}`	Full analysis for one candidate
`POST`	`/api/recruiter/filter`	Filter/re-rank by score, skills, experience level

System

Method	Path	Description
`GET`	`/health`	Returns `{"status":"healthy","models_loaded":true}`
`GET`	`/docs`	Swagger UI (all endpoints)

Project Structure

HireLens/
├── backend/                  # FastAPI application
│   ├── main.py               # App factory, lifespan, CORS
│   ├── routers/
│   │   ├── candidate.py      # /api/candidate/analyze
│   │   └── recruiter.py      # /api/recruiter/*
│   ├── services/
│   │   ├── ml_service.py     # Singleton: NER + embeddings + scorer
│   │   ├── cache_service.py  # Redis with in-memory fallback
│   │   └── pdf_service.py    # pdfplumber + PyMuPDF extraction
│   └── schemas/              # Pydantic request/response models
├── src/
│   ├── models/
│   │   ├── train.py          # Fine-tuning script
│   │   └── scorer.py         # 4-component MatchScorer
│   ├── features/
│   │   ├── ner.py            # spaCy NER + taxonomy skill extraction
│   │   └── embeddings.py     # Sentence-transformer wrapper + cache
│   ├── evaluation/
│   │   └── metrics.py        # NDCG, MRR, AUC-ROC, NER F1
│   └── data/
│       ├── ingestion.py      # Raw CSV loading + cleaning
│       ├── preprocessing.py  # Text normalization
│       └── synthetic_gen.py  # Weak-supervision pair generation
├── frontend/                 # React + Vite + Tailwind
│   └── src/pages/
│       ├── Landing.jsx
│       ├── Candidate.jsx     # Single resume upload + score display
│       └── Recruiter.jsx     # Bulk upload + ranked table + filters
├── data/
│   ├── raw/                  # Original datasets (git-ignored)
│   ├── processed/            # train/val pairs (git-ignored)
│   ├── synthetic/            # Weak-supervision pairs (git-ignored)
│   └── skills_taxonomy.json  # 595 tech + 51 soft + 35 cert skills
├── models/
│   ├── fine_tuned/hirelens_matcher/   # 608 MB fine-tuned weights
│   └── cache/embeddings/             # SHA-256 keyed embedding cache
├── docker/
│   ├── Dockerfile.backend    # Multi-stage Python image
│   ├── Dockerfile.frontend   # Node builder → nginx static
│   └── postgres/init.sql     # DB + extension setup
├── aws/
│   ├── deploy.sh             # Local → EC2 deploy script
│   ├── setup-ec2.sh          # Bootstrap (Docker, nginx, swap)
│   ├── s3-setup.sh           # S3 bucket creation with encryption
│   └── cloudwatch-agent.json # CloudWatch log/metric config
├── tests/
│   ├── integration/          # FastAPI endpoint tests (30 total)
│   └── conftest.py           # Fixtures with mocked ML service
├── .github/workflows/
│   └── ci-cd.yml             # 4-job CI/CD pipeline
├── docker-compose.yml        # Local dev (postgres + redis)
├── docker-compose.prod.yml   # Production (all 4 services)
├── requirements.txt
├── params.yaml               # DVC pipeline parameters
├── dvc.yaml                  # DVC stage definitions
├── metrics_report.json       # Full evaluation + latency report
└── configs/config.yaml       # Master configuration

Screenshots

Add screenshots here after the UI is finalized.

View	Description
Landing page	Role selection (Candidate / Recruiter)
Candidate view	Upload resume PDF + paste job description → score breakdown with matched/missing skills
Recruiter view	Upload up to 50 resumes → ranked table with score badges, filter by score/skills/experience

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.dvc		.dvc
.github/workflows		.github/workflows
aws		aws
backend		backend
configs		configs
data		data
docker		docker
frontend		frontend
mlruns/0		mlruns/0
models		models
registry		registry
scripts		scripts
src		src
tests		tests
.dvcignore		.dvcignore
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
metrics_report.json		metrics_report.json
monitor.py		monitor.py
params.yaml		params.yaml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

HireLens

Architecture

Achieved Metrics

Datasets

1. Raw Sources

2. Processing Pipeline

3. Model Training

Tech Stack

Local Setup

Prerequisites

1. Clone and install

2. Environment

3. Start services

4. Run tests

5. Train model (optional — pre-trained weights included)

AWS Deployment

Prerequisites

GitHub Secrets Required

First-time EC2 bootstrap

Automatic deploys

Manual re-deploy

Environment Variables

API Reference

Candidate

Recruiter

System

Project Structure

Screenshots

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages