Skip to content

Migrate storage from better-sqlite3 + sqlite-vec to libSQL with native vector search (local + external sqld/Turso) #435

@arabold

Description

@arabold

Summary

Replace the better-sqlite3 driver and the sqlite-vec extension with libSQL (the libsql npm binding) and libSQL native vector search. This removes the Node-version ABI lock imposed by better-sqlite3 and unlocks an optional external database backend (self-hosted sqld or Turso Cloud) using the same code path as local file storage.

The user chooses the backend via a connection string:

  • Local (default): embedded libSQL against a local file — same single-binary experience as today.
  • External: a remote sqld/Turso URL (+ auth token). Whether that is self-hosted sqld or managed Turso is entirely the user's choice.

Motivation

  • Node version lock: better-sqlite3 compiles against the V8 ABI (NODE_MODULE_VERSION), so its prebuilt binary is pinned per Node major. This is why the project is pinned to Node 22. libSQL's libsql binding uses N-API (napi-rs), giving ABI-stable prebuilds across Node majors. (The better-sqlite3 maintainers have left the N-API request — issue N-API Support WiseLibs/better-sqlite3#271 — unanswered since 2019, so this will not be fixed upstream.)
  • Extension wall on remote: Remote sqld/Turso connections cannot dynamically loadExtension, so sqlite-vec's vec0 virtual table cannot be used against an external server. Native libSQL vectors are built into the engine and work identically across local, self-hosted sqld, and Turso Cloud.
  • Multi-process concurrency: Today multiple processes against a shared SQLite file (usually read-only) is unstable and unsupported. An external sqld (or embedded replicas) gives a robust path: one server serializes writes; clients connect over the wire instead of sharing a file + -shm.

Goals

  • Single storage driver (libsql) covering local file, remote sqld/Turso, and (optionally) embedded replica, selected by connection string.
  • Vector search implemented with native libSQL vectors (F32_BLOB columns + vector_distance_cos / optional vector_top_k DiskANN index), no runtime extension required.
  • Preserve current hybrid search behavior (vector + FTS5 with RRF ranking) and search quality.
  • One-time, non-destructive migration of existing local databases (no re-embedding).

Non-Goals

  • Switching to the async @libsql/client. We intend to use the synchronous libsql binding (better-sqlite3-compatible API) to avoid an await-everywhere refactor.
  • Adopting the rewritten Turso DB (Rust) engine — it does not support SQLite loadable extensions and is still beta. Out of scope.
  • Multi-writer concurrency / BEGIN CONCURRENT. Not part of standard libSQL.

Background: why native vectors are an acceptable swap

The current sqlite-vec (0.1.x) path is exact brute-force KNN + over-fetch + post-filter (global top-k via MATCH … k=?, then JOIN to documents/pages/versions/libraries and filter by library/version; over-fetch is the existing vectorSearchMultiplier). libSQL native gives the same shape:

  • Unindexed ORDER BY vector_distance_cos(embedding, ?) LIMIT k is exact and supports arbitrary WHERE filtering (parity with today).
  • Optional vector_top_k('idx', q, k) JOIN … WHERE … adds an approximate DiskANN index for scale — a capability sqlite-vec 0.1.x does not have.

We currently only use cosine distance, which native libSQL supports (vector_distance_cos). Acceptable trade-offs: distance functions limited to cosine/L2/dot/jaccard; DiskANN is approximate if adopted.

Proposed architecture

  1. Connection abstraction: new Database(path | url, options) where a libsql:// / http(s):// URL + auth token selects the external backend, and a filesystem path selects local. Storage path resolution stays for the local case.
  2. Connection-aware pragmas: applyMigrations currently uses db.pragma(...) heavily (WAL, foreign_keys, busy_timeout, mmap_size, etc.). libsql does not support pragma(); convert to db.exec("PRAGMA …"), and only apply file-level pragmas (WAL/mmap/cache) for the local backend (no-ops / server-managed on remote).
  3. Native vector schema: store embeddings in a native vector column (e.g. documents.embedding F32_BLOB(<dim>), dimension still configurable) instead of the documents_vec vec0 virtual table. Maintain an optional libsql_vector_idx index. Remove the documents_vec table and its insert/update/delete triggers.
  4. Rewrite hybrid search query in DocumentStore.ts to use native vector functions while keeping FTS5 + RRF intact. Preserve over-fetch/multiplier semantics.

Vector data migration (the important detail)

documents.embedding is the source of truth (migration 011 rebuilds documents_vec from it via json_extract). So existing vectors can be converted to the native representation without re-embedding.

Two distinct paths, because remote cannot replay the legacy vec0 migrations:

  • Existing local DB (in-place upgrade): a new migration converts documents.embedding into the native vector column, builds the vector index, and drops documents_vec + triggers. sqlite-vec must still be loadable at this step (dropping a vec0 virtual table requires the module registered) → keep sqlite-vec bundled through the transition, then retire it in a later release.
  • Fresh DB / external sqld: must not replay migrations 000/003/011 (they CREATE VIRTUAL TABLE … USING vec0, which fails where extensions can't load). Introduce a squashed baseline schema that creates the native-vector schema directly, so brand-new local and all remote databases never touch vec0.

Answering the design question directly: "keep the extension bundled but stop using it" works for the one-time local conversion, but is not sufficient by itself for the external-sqld target — that requires the squashed baseline so vec0 migrations are never replayed remotely.

Risks / validation

  • Search-quality parity: re-run npm run evaluate:search (and the vector e2e suites) against the native implementation. Re-tune RRF weights / over-fetch if distance semantics differ. Make this a gate before merge.
  • Vector binary format compatibility: confirm the embedded libsql engine and the target sqld/Turso server agree on the F32_BLOB / vector32() encoding and dimensionality so a DB created locally works when pointed at a server (and vice-versa).
  • API gaps in libsql: pragma(), backup(), serialize() are unsupported. We don't use backup/serialize in code today (only mentioned in docs), but verify. Audit prepared-statement typings and BigInt handling (Statement<[bigint]>, getById).
  • WAL/-shm assumptions: docs/concepts/data-storage.md and applyMigrations assume local WAL semantics; document that these apply to the local backend only.
  • sqlite-vec version drift: project is on sqlite-vec 0.1.9; only relevant during the transition for the local conversion path. Plan its removal once the squashed baseline + conversion are shipped.

High-level task breakdown

  • Add libsql dependency; introduce a backend/connection abstraction (local path vs remote URL + auth token) with config + docs.
  • Convert applyMigrations pragmas to exec("PRAGMA …") and make file-level pragmas local-only.
  • Add a squashed baseline schema using native vectors for fresh/remote DBs (no vec0).
  • Add an in-place conversion migration for existing local DBs (documents.embedding → native vector column + index; drop documents_vec + triggers); keep sqlite-vec loadable for this step.
  • Rewrite the hybrid vector+FTS RRF query in DocumentStore.ts to native vector functions; preserve over-fetch semantics; add optional libsql_vector_idx.
  • Update ensureVectorTable / model-change / dimension-reconciliation logic for native columns.
  • Update docs: ARCHITECTURE.md, docs/concepts/data-storage.md, README (configuration for external backend), embedding/benchmarking guides.
  • Validate: evaluate:search, vector-persistence/vector-search e2e, docker e2e; add an e2e against a local sqld (and document Turso Cloud usage).
  • Follow-up (separate PR): remove sqlite-vec and its bundled native binaries once the transition window closes; lift the Node 22 version pin.

Out-of-scope follow-ups (worth tracking separately)

  • Embedded-replica mode for fast local reads syncing from a primary (great fit for the "multiple read-only processes" pattern).
  • Evaluating DiskANN (vector_top_k) for large corpora once parity on exact search is established.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions