A domain-specific vector store that improves embedding search quality using PCA. By fitting a PCA model on a domain's own documents, embeddings are projected into a lower-dimensional subspace that concentrates domain-relevant variance — resulting in faster ANN search and better retrieval recall.
Based on: Giri, S. (2026). Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study. Zenodo. https://doi.org/10.5281/zenodo.20320367
General-purpose embedding models (OpenAI, BGE, Sentence-BERT) encode the full geometry of all language. For a narrow domain like legal contracts or medical literature, most of that variance is noise. Fitting PCA on the corpus and projecting everything into that subspace:
- Reduces dimensionality — fewer dimensions = faster approximate nearest-neighbour search
- Improves recall — domain-relevant directions are amplified; irrelevant ones are discarded
- Zero search downtime — re-indexing uses an atomic collection swap; old index serves queries until the new one is ready
Raw (unprojected) embeddings are always preserved in SQLite, so PCA can be rebuilt at any time as the corpus grows.
The underlying research (Giri, 2026) shows that PCA-32 fitted on a medical corpus achieves MAP 0.9203 vs. a baseline of 0.8750 — a +5.2% improvement — using no fine-tuning whatsoever. Key findings from the paper that inform this implementation:
- Fit on documents only, not queries — joint fitting degrades performance.
- Domain-directed axes are necessary — PCA on a generic corpus does not transfer.
- Gains strengthen with corpus diversity — the more varied the documents, the more headroom PCA has to find meaningful directions.
pip install -e ".[dev]"Dependencies: numpy, scikit-learn, joblib, faiss-cpu
import numpy as np
from domain_vector_optimizer import DomainVectorStore, FaissBackend
from domain_vector_optimizer.core.domain import DomainConfig
from domain_vector_optimizer.core.enums import PCAStrategy
# 1. Create the store
store = DomainVectorStore(
backend=FaissBackend(index_dir="./faiss_indices"),
db_path="./metadata.db",
model_dir="./pca_models",
)
# 2. Create a domain
domain = store.create_domain(
name="legal",
embedding_dim=1536,
config=DomainConfig(
pca_strategy=PCAStrategy.EXPLAINED_VARIANCE,
pca_variance=0.95, # retain 95% of variance
min_docs_for_pca=100,
),
)
# 3. Add documents
store.add_document(
domain_id=domain.id,
doc_id="contract-001",
embedding=np.array([...]), # shape: (1536,)
content="Force majeure clause...",
metadata={"year": 2023, "type": "contract"},
)
# Bulk add
store.add_documents(domain.id, [
{"id": "doc-1", "embedding": vec1, "content": "...", "metadata": {"k": "v"}},
{"id": "doc-2", "embedding": vec2, "content": "..."},
])
# 4. Build PCA — fits on all domain embeddings, then atomically re-indexes
store.build_pca(domain.id)
# Check what happened
info = store.pca_info(domain.id)
print(info["pca_dim"]) # e.g. 312 (from 1536)
print(info["total_explained_variance"]) # e.g. 0.951
# 5. Search — query is automatically projected through domain PCA
results = store.search(
domain_id=domain.id,
query_embedding=np.array([...]),
top_k=10,
)
for r in results:
print(r.doc_id, r.score, r.content, r.metadata)DomainVectorStore(backend, db_path, model_dir)| Parameter | Type | Description |
|---|---|---|
backend |
VectorStoreBackend |
Vector index backend (e.g. FaissBackend) |
db_path |
str |
Path to SQLite file for raw embeddings + metadata |
model_dir |
str |
Directory for persisted PCA model files |
| Method | Description |
|---|---|
create_domain(name, embedding_dim, config, description) |
Create a new domain; returns Domain |
get_domain(domain_id) |
Fetch domain by ID |
list_domains() |
List all domains |
delete_domain(domain_id) |
Delete domain, all its documents, PCA models, and index |
| Method | Description |
|---|---|
add_document(domain_id, doc_id, embedding, content, metadata) |
Add or update a document |
add_documents(domain_id, documents) |
Bulk add (list of dicts with id, embedding, optional content/metadata) |
get_document(domain_id, doc_id) |
Fetch a document |
delete_document(domain_id, doc_id) |
Remove from store and index |
| Method | Description |
|---|---|
build_pca(domain_id, n_components=None) |
Fit PCA and atomically re-index all documents |
pca_info(domain_id) |
Returns status, dimensions, explained variance, fitted doc count |
| Method | Description |
|---|---|
search(domain_id, query_embedding, top_k, filters) |
Nearest-neighbour search; query is auto-projected if PCA is built |
DomainConfig(
pca_strategy="explained_variance", # or "fixed_dim"
pca_n_components=256, # used when strategy="fixed_dim"
pca_variance=0.95, # used when strategy="explained_variance"
min_docs_for_pca=10, # minimum docs required before build_pca succeeds
distance_metric="cosine", # "cosine" | "l2" | "ip"
batch_size=512, # streaming batch size for re-indexing
)| Strategy | Config key | Behaviour |
|---|---|---|
fixed_dim |
pca_n_components=256 |
Always reduce to exactly N dimensions (capped to min(N, n_samples-1, n_features)) |
explained_variance |
pca_variance=0.95 |
Let sklearn choose the fewest components that explain ≥95% of variance |
src/domain_vector_optimizer/
├── store.py # DomainVectorStore — main entry point
├── core/
│ ├── domain.py # Domain, DomainConfig
│ ├── document.py # Document, SearchResult
│ └── enums.py # PCAStatus, PCAStrategy, DistanceMetric
├── storage/
│ ├── base.py # RawStore ABC
│ └── sqlite_store.py # SQLite implementation
├── pca/
│ ├── engine.py # PCAEngine + PCAModel
│ └── model_store.py # joblib-based model persistence
└── index/
├── manager.py # Atomic swap re-indexing
└── backends/
├── base.py # VectorStoreBackend ABC
└── faiss_backend.py # FAISS implementation (in-memory + disk)
pytest tests/ -v73 tests across three layers:
| File | What it covers |
|---|---|
tests/unit/test_pca_engine.py |
Fit strategies, dimension capping, transform dtype, cluster ordering |
tests/unit/test_document_store.py |
Domain/doc CRUD, embedding byte roundtrip, streaming, cascade delete |
tests/unit/test_faiss_backend.py |
All metrics, upsert/update, deletion, disk persistence |
tests/integration/test_end_to_end.py |
Full lifecycle, PCA rebuild, cluster separation quality |
- Raw embeddings are always kept. SQLite stores the original float32 arrays so PCA can be re-fitted any time (e.g. after the corpus doubles).
- Atomic re-indexing.
build_pcacreates a new FAISS collection, populates it fully, then switches the domain'sactive_collectionpointer — searches never hit a half-built index. - String → int ID mapping. FAISS requires integer IDs; the backend maintains a bidirectional map and handles upserts as delete + re-add.
- Cosine similarity is implemented as L2-normalise + inner product (standard practice; avoids a separate normalised index type).
- Pluggable backend. Implement
VectorStoreBackend(6 abstract methods) to swap in Qdrant, Pinecone, or any other store.
See ARCHITECTURE.md for component diagrams, data flow, and scaling considerations.
If you use this library, please cite the underlying research:
@misc{giri2026domainpca,
author = {Giri, Sandeep},
title = {Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.20320367},
url = {https://doi.org/10.5281/zenodo.20320367}
}