Skip to content

terno-ai/domain_vector_optimizer

Repository files navigation

Domain Vector Optimizer

A domain-specific vector store that improves embedding search quality using PCA. By fitting a PCA model on a domain's own documents, embeddings are projected into a lower-dimensional subspace that concentrates domain-relevant variance — resulting in faster ANN search and better retrieval recall.

Based on: Giri, S. (2026). Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study. Zenodo. https://doi.org/10.5281/zenodo.20320367


How it works

General-purpose embedding models (OpenAI, BGE, Sentence-BERT) encode the full geometry of all language. For a narrow domain like legal contracts or medical literature, most of that variance is noise. Fitting PCA on the corpus and projecting everything into that subspace:

  • Reduces dimensionality — fewer dimensions = faster approximate nearest-neighbour search
  • Improves recall — domain-relevant directions are amplified; irrelevant ones are discarded
  • Zero search downtime — re-indexing uses an atomic collection swap; old index serves queries until the new one is ready

Raw (unprojected) embeddings are always preserved in SQLite, so PCA can be rebuilt at any time as the corpus grows.

The underlying research (Giri, 2026) shows that PCA-32 fitted on a medical corpus achieves MAP 0.9203 vs. a baseline of 0.8750 — a +5.2% improvement — using no fine-tuning whatsoever. Key findings from the paper that inform this implementation:

  • Fit on documents only, not queries — joint fitting degrades performance.
  • Domain-directed axes are necessary — PCA on a generic corpus does not transfer.
  • Gains strengthen with corpus diversity — the more varied the documents, the more headroom PCA has to find meaningful directions.

Installation

pip install -e ".[dev]"

Dependencies: numpy, scikit-learn, joblib, faiss-cpu


Quick start

import numpy as np
from domain_vector_optimizer import DomainVectorStore, FaissBackend
from domain_vector_optimizer.core.domain import DomainConfig
from domain_vector_optimizer.core.enums import PCAStrategy

# 1. Create the store
store = DomainVectorStore(
    backend=FaissBackend(index_dir="./faiss_indices"),
    db_path="./metadata.db",
    model_dir="./pca_models",
)

# 2. Create a domain
domain = store.create_domain(
    name="legal",
    embedding_dim=1536,
    config=DomainConfig(
        pca_strategy=PCAStrategy.EXPLAINED_VARIANCE,
        pca_variance=0.95,       # retain 95% of variance
        min_docs_for_pca=100,
    ),
)

# 3. Add documents
store.add_document(
    domain_id=domain.id,
    doc_id="contract-001",
    embedding=np.array([...]),   # shape: (1536,)
    content="Force majeure clause...",
    metadata={"year": 2023, "type": "contract"},
)

# Bulk add
store.add_documents(domain.id, [
    {"id": "doc-1", "embedding": vec1, "content": "...", "metadata": {"k": "v"}},
    {"id": "doc-2", "embedding": vec2, "content": "..."},
])

# 4. Build PCA — fits on all domain embeddings, then atomically re-indexes
store.build_pca(domain.id)

# Check what happened
info = store.pca_info(domain.id)
print(info["pca_dim"])                    # e.g. 312 (from 1536)
print(info["total_explained_variance"])   # e.g. 0.951

# 5. Search — query is automatically projected through domain PCA
results = store.search(
    domain_id=domain.id,
    query_embedding=np.array([...]),
    top_k=10,
)
for r in results:
    print(r.doc_id, r.score, r.content, r.metadata)

API reference

DomainVectorStore

DomainVectorStore(backend, db_path, model_dir)
Parameter Type Description
backend VectorStoreBackend Vector index backend (e.g. FaissBackend)
db_path str Path to SQLite file for raw embeddings + metadata
model_dir str Directory for persisted PCA model files

Domain management

Method Description
create_domain(name, embedding_dim, config, description) Create a new domain; returns Domain
get_domain(domain_id) Fetch domain by ID
list_domains() List all domains
delete_domain(domain_id) Delete domain, all its documents, PCA models, and index

Document management

Method Description
add_document(domain_id, doc_id, embedding, content, metadata) Add or update a document
add_documents(domain_id, documents) Bulk add (list of dicts with id, embedding, optional content/metadata)
get_document(domain_id, doc_id) Fetch a document
delete_document(domain_id, doc_id) Remove from store and index

PCA

Method Description
build_pca(domain_id, n_components=None) Fit PCA and atomically re-index all documents
pca_info(domain_id) Returns status, dimensions, explained variance, fitted doc count

Search

Method Description
search(domain_id, query_embedding, top_k, filters) Nearest-neighbour search; query is auto-projected if PCA is built

DomainConfig

DomainConfig(
    pca_strategy="explained_variance",  # or "fixed_dim"
    pca_n_components=256,               # used when strategy="fixed_dim"
    pca_variance=0.95,                  # used when strategy="explained_variance"
    min_docs_for_pca=10,                # minimum docs required before build_pca succeeds
    distance_metric="cosine",           # "cosine" | "l2" | "ip"
    batch_size=512,                     # streaming batch size for re-indexing
)

PCA strategies

Strategy Config key Behaviour
fixed_dim pca_n_components=256 Always reduce to exactly N dimensions (capped to min(N, n_samples-1, n_features))
explained_variance pca_variance=0.95 Let sklearn choose the fewest components that explain ≥95% of variance

Project structure

src/domain_vector_optimizer/
├── store.py                  # DomainVectorStore — main entry point
├── core/
│   ├── domain.py             # Domain, DomainConfig
│   ├── document.py           # Document, SearchResult
│   └── enums.py              # PCAStatus, PCAStrategy, DistanceMetric
├── storage/
│   ├── base.py               # RawStore ABC
│   └── sqlite_store.py       # SQLite implementation
├── pca/
│   ├── engine.py             # PCAEngine + PCAModel
│   └── model_store.py        # joblib-based model persistence
└── index/
    ├── manager.py             # Atomic swap re-indexing
    └── backends/
        ├── base.py            # VectorStoreBackend ABC
        └── faiss_backend.py   # FAISS implementation (in-memory + disk)

Running tests

pytest tests/ -v

73 tests across three layers:

File What it covers
tests/unit/test_pca_engine.py Fit strategies, dimension capping, transform dtype, cluster ordering
tests/unit/test_document_store.py Domain/doc CRUD, embedding byte roundtrip, streaming, cascade delete
tests/unit/test_faiss_backend.py All metrics, upsert/update, deletion, disk persistence
tests/integration/test_end_to_end.py Full lifecycle, PCA rebuild, cluster separation quality

Design notes

  • Raw embeddings are always kept. SQLite stores the original float32 arrays so PCA can be re-fitted any time (e.g. after the corpus doubles).
  • Atomic re-indexing. build_pca creates a new FAISS collection, populates it fully, then switches the domain's active_collection pointer — searches never hit a half-built index.
  • String → int ID mapping. FAISS requires integer IDs; the backend maintains a bidirectional map and handles upserts as delete + re-add.
  • Cosine similarity is implemented as L2-normalise + inner product (standard practice; avoids a separate normalised index type).
  • Pluggable backend. Implement VectorStoreBackend (6 abstract methods) to swap in Qdrant, Pinecone, or any other store.

Detailed architecture

See ARCHITECTURE.md for component diagrams, data flow, and scaling considerations.


Citation

If you use this library, please cite the underlying research:

@misc{giri2026domainpca,
  author       = {Giri, Sandeep},
  title        = {Domain-Focused PCA on Text Embeddings Improves Semantic Retrieval: A Medical Domain Study},
  year         = {2026},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.20320367},
  url          = {https://doi.org/10.5281/zenodo.20320367}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages