Domain Vector Optimizer

A domain-specific vector store that improves embedding search quality using PCA. By fitting a PCA model on a domain's own documents, embeddings are projected into a lower-dimensional subspace that concentrates domain-relevant variance — resulting in faster ANN search and better retrieval recall.

Based on: Giri, S. (2026). Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study. Zenodo. https://doi.org/10.5281/zenodo.20320367

How it works

General-purpose embedding models (OpenAI, BGE, Sentence-BERT) encode the full geometry of all language. For a narrow domain like legal contracts or medical literature, most of that variance is noise. Fitting PCA on the corpus and projecting everything into that subspace:

Reduces dimensionality — fewer dimensions = faster approximate nearest-neighbour search
Improves recall — domain-relevant directions are amplified; irrelevant ones are discarded
Zero search downtime — re-indexing uses an atomic collection swap; old index serves queries until the new one is ready

Raw (unprojected) embeddings are always preserved in SQLite, so PCA can be rebuilt at any time as the corpus grows.

The underlying research (Giri, 2026) shows that PCA-32 fitted on a medical corpus achieves MAP 0.9203 vs. a baseline of 0.8750 — a +5.2% improvement — using no fine-tuning whatsoever. Key findings from the paper that inform this implementation:

Fit on documents only, not queries — joint fitting degrades performance.
Domain-directed axes are necessary — PCA on a generic corpus does not transfer.
Gains strengthen with corpus diversity — the more varied the documents, the more headroom PCA has to find meaningful directions.

Installation

pip install -e ".[dev]"

Dependencies: numpy, scikit-learn, joblib, faiss-cpu

Quick start

import numpy as np
from domain_vector_optimizer import DomainVectorStore, FaissBackend
from domain_vector_optimizer.core.domain import DomainConfig
from domain_vector_optimizer.core.enums import PCAStrategy

# 1. Create the store
store = DomainVectorStore(
    backend=FaissBackend(index_dir="./faiss_indices"),
    db_path="./metadata.db",
    model_dir="./pca_models",
)

# 2. Create a domain
domain = store.create_domain(
    name="legal",
    embedding_dim=1536,
    config=DomainConfig(
        pca_strategy=PCAStrategy.EXPLAINED_VARIANCE,
        pca_variance=0.95,       # retain 95% of variance
        min_docs_for_pca=100,
    ),
)

# 3. Add documents
store.add_document(
    domain_id=domain.id,
    doc_id="contract-001",
    embedding=np.array([...]),   # shape: (1536,)
    content="Force majeure clause...",
    metadata={"year": 2023, "type": "contract"},
)

# Bulk add
store.add_documents(domain.id, [
    {"id": "doc-1", "embedding": vec1, "content": "...", "metadata": {"k": "v"}},
    {"id": "doc-2", "embedding": vec2, "content": "..."},
])

# 4. Build PCA — fits on all domain embeddings, then atomically re-indexes
store.build_pca(domain.id)

# Check what happened
info = store.pca_info(domain.id)
print(info["pca_dim"])                    # e.g. 312 (from 1536)
print(info["total_explained_variance"])   # e.g. 0.951

# 5. Search — query is automatically projected through domain PCA
results = store.search(
    domain_id=domain.id,
    query_embedding=np.array([...]),
    top_k=10,
)
for r in results:
    print(r.doc_id, r.score, r.content, r.metadata)

API reference

`DomainVectorStore`

DomainVectorStore(backend, db_path, model_dir)

Parameter	Type	Description
`backend`	`VectorStoreBackend`	Vector index backend (e.g. `FaissBackend`)
`db_path`	`str`	Path to SQLite file for raw embeddings + metadata
`model_dir`	`str`	Directory for persisted PCA model files

Domain management

Method	Description
`create_domain(name, embedding_dim, config, description)`	Create a new domain; returns `Domain`
`get_domain(domain_id)`	Fetch domain by ID
`list_domains()`	List all domains
`delete_domain(domain_id)`	Delete domain, all its documents, PCA models, and index

Document management

Method	Description
`add_document(domain_id, doc_id, embedding, content, metadata)`	Add or update a document
`add_documents(domain_id, documents)`	Bulk add (list of dicts with `id`, `embedding`, optional `content`/`metadata`)
`get_document(domain_id, doc_id)`	Fetch a document
`delete_document(domain_id, doc_id)`	Remove from store and index

PCA

Method	Description
`build_pca(domain_id, n_components=None)`	Fit PCA and atomically re-index all documents
`pca_info(domain_id)`	Returns status, dimensions, explained variance, fitted doc count

Search

Method	Description
`search(domain_id, query_embedding, top_k, filters)`	Nearest-neighbour search; query is auto-projected if PCA is built

`DomainConfig`

DomainConfig(
    pca_strategy="explained_variance",  # or "fixed_dim"
    pca_n_components=256,               # used when strategy="fixed_dim"
    pca_variance=0.95,                  # used when strategy="explained_variance"
    min_docs_for_pca=10,                # minimum docs required before build_pca succeeds
    distance_metric="cosine",           # "cosine" | "l2" | "ip"
    batch_size=512,                     # streaming batch size for re-indexing
)

PCA strategies

Strategy	Config key	Behaviour
`fixed_dim`	`pca_n_components=256`	Always reduce to exactly N dimensions (capped to `min(N, n_samples-1, n_features)`)
`explained_variance`	`pca_variance=0.95`	Let sklearn choose the fewest components that explain ≥95% of variance

Project structure

src/domain_vector_optimizer/
├── store.py                  # DomainVectorStore — main entry point
├── core/
│   ├── domain.py             # Domain, DomainConfig
│   ├── document.py           # Document, SearchResult
│   └── enums.py              # PCAStatus, PCAStrategy, DistanceMetric
├── storage/
│   ├── base.py               # RawStore ABC
│   └── sqlite_store.py       # SQLite implementation
├── pca/
│   ├── engine.py             # PCAEngine + PCAModel
│   └── model_store.py        # joblib-based model persistence
└── index/
    ├── manager.py             # Atomic swap re-indexing
    └── backends/
        ├── base.py            # VectorStoreBackend ABC
        └── faiss_backend.py   # FAISS implementation (in-memory + disk)

Running tests

pytest tests/ -v

73 tests across three layers:

File	What it covers
`tests/unit/test_pca_engine.py`	Fit strategies, dimension capping, transform dtype, cluster ordering
`tests/unit/test_document_store.py`	Domain/doc CRUD, embedding byte roundtrip, streaming, cascade delete
`tests/unit/test_faiss_backend.py`	All metrics, upsert/update, deletion, disk persistence
`tests/integration/test_end_to_end.py`	Full lifecycle, PCA rebuild, cluster separation quality

Design notes

Raw embeddings are always kept. SQLite stores the original float32 arrays so PCA can be re-fitted any time (e.g. after the corpus doubles).
Atomic re-indexing. build_pca creates a new FAISS collection, populates it fully, then switches the domain's active_collection pointer — searches never hit a half-built index.
String → int ID mapping. FAISS requires integer IDs; the backend maintains a bidirectional map and handles upserts as delete + re-add.
Cosine similarity is implemented as L2-normalise + inner product (standard practice; avoids a separate normalised index type).
Pluggable backend. Implement VectorStoreBackend (6 abstract methods) to swap in Qdrant, Pinecone, or any other store.

Detailed architecture

See ARCHITECTURE.md for component diagrams, data flow, and scaling considerations.

Citation

If you use this library, please cite the underlying research:

@misc{giri2026domainpca,
  author       = {Giri, Sandeep},
  title        = {Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.20320367},
  url          = {https://doi.org/10.5281/zenodo.20320367}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/domain_vector_optimizer		src/domain_vector_optimizer
tests		tests
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
pyproject.toml		pyproject.toml
raw_prompt.md		raw_prompt.md
requirements.txt		requirements.txt
skills.md		skills.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Domain Vector Optimizer

How it works

Installation

Quick start

API reference

`DomainVectorStore`

Domain management

Document management

PCA

Search

`DomainConfig`

PCA strategies

Project structure

Running tests

Design notes

Detailed architecture

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Domain Vector Optimizer

How it works

Installation

Quick start

API reference

DomainVectorStore

Domain management

Document management

PCA

Search

DomainConfig

PCA strategies

Project structure

Running tests

Design notes

Detailed architecture

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`DomainVectorStore`

`DomainConfig`

Packages