A lightweight text embedding API designed as a drop-in replacement for the OpenAI embeddings endpoint.
Built with FastAPI and fastembed, LocalEmbed is optimized for running local document processing and vector pipelines securely on your own infrastructure.
- OpenAI SDK Compatible: Natively mirrors the
/v1/embeddingsschema. Point your existing OpenAI client tolocalhostand it just works. - Privacy First: 100% local execution. No data ever leaves your network.
- Zero-Latency Starts: Automatically pre-loads your default model into memory on server boot.
- Container-Native: Multi-stage Docker build utilizing
uvfor a minimal, highly optimized runtime footprint. - CPU + GPU Ready: Published Docker images for both CPU (
latest) and NVIDIA GPU (latest-gpu) deployments.
- Docker (Recommended)
- For GPU deployment: NVIDIA GPU + drivers + NVIDIA Container Toolkit
- Python 3.12+ (for local development)
LocalEmbed uses optional environment variables for configuration. Create a .env file in the root directory:
- Copy the sample environment file from here:
cp .env.sample .env
- Open the
.envfile and set your desired configurations (likeDEFAULT_EMBEDDING_MODELorHF_TOKEN).
Environment variables:
DEFAULT_EMBEDDING_MODEL: model to preload on startupHF_TOKEN: optional, useful to avoid model download rate limitsMODEL_CACHE_LIMIT: max number of models kept in memory (LRU eviction)EMBEDDING_THREADS: CPU threads for embedding computationBATCH_SIZE: number of inputs processed per batchUSE_GPU: settrueto force CUDA provider in local/non-GPU-image runs
The easiest and recommended way to run LocalEmbed is using the pre-built Docker image from Docker Hub.
docker run -d --pull=always --name localembed --env-file .env -p 8000:8000 heshinth/localembed:latestdocker run -d --pull=always --gpus all --name localembed-gpu --env-file .env -p 8000:8000 heshinth/localembed:latest-gpuThe compose file includes environment variables directly within it.
Download the docker-compose.yml file from here
You can edit the file to configure it, then simply run:
docker compose up -dDownload the docker-compose.gpu.yml file from here, then run:
docker compose -f docker-compose.gpu.yml up -dFor a release tag like v0.1.3, published image tags are:
- CPU:
latest,0.1.3,0.1 - GPU:
latest-gpu,0.1.3-gpu,0.1-gpu
The API will be available at: http://localhost:8000/v1.
If you want to run the application natively without Docker:
-
Install dependencies for CPU mode:
uv sync --extra cpu
-
Run the FastAPI development server:
fastapi dev app/main.py
For local GPU mode:
-
Install dependencies for GPU mode:
uv sync --extra gpu
-
Start with GPU provider enabled:
USE_GPU=true fastapi dev app/main.py
GET /v1/health— Health checkPOST /v1/embeddings— Generate text embeddings using local models (OpenAI API compatible)GET /v1/models— List supported and ready-to-use embedding models
LocalEmbed supports all dense text embedding models provided by fastembed.
You can view the full list of supported models in the FastEmbed Documentation, or programmatically query your running instance via the API:
GET http://localhost:8000/v1/modelsSince the /v1/embeddings endpoint is OpenAI API compatible, you can easily use the official openai Python package to interact with it just like the real OpenAI API:
from openai import OpenAI
# Initialize the client pointing to the local base URL
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-no-key-required"
)
# Generate an embedding
response = client.embeddings.create(
input=["Hello, world!"],
model="BAAI/bge-small-en-v1.5" # Replace with any supported model
)
print(response.data[0].embedding)This project is licensed under the MIT License - see the LICENSE file for details.