LLM Test Platform

A self-hosted web platform for comparing LLM performance across multiple models and providers. Define evaluation tasks, run them against different models, and get side-by-side comparisons with AI-powered scoring.

Features

Multi-Provider Support — OpenAI, Anthropic, and any OpenAI-compatible API
Dual Evaluation Modes — Simple chat completion + Agentic (ReAct loop with Docker sandbox)
Real-time Streaming — SSE-based streaming output with thinking/reasoning chain display
AI Scoring — Automatic 5-dimension scoring (accuracy, completeness, coherence, creativity, instruction-following) with multi-judge support
Side-by-Side Compare — Compare outputs from different models on the same task
API Key Encryption — AES-encrypted at rest, decrypted only at runtime
15+ Built-in Templates — Ready-to-use evaluation prompt templates
Agent Trace — Full tool-call trace and iteration history for agentic runs

Tech Stack

Layer	Technology
Backend	NestJS 11 + TypeScript
Database	SQLite via Prisma 7 + libSQL
Frontend	Vue 3 + Vite + Element Plus
State	Pinia
Runtime	Node.js 18+

Quick Start

Prerequisites

Node.js 18+
npm 9+
Docker (optional, for agentic sandbox mode)

Setup

# Clone and install
git clone https://github.com/heekei/llm-test.git
cd llm-test
npm install

# Configure
cd server
cp .env.example .env
# Edit .env — set ENCRYPTION_KEY (generate with: openssl rand -hex 32)

# Initialize database
cd ..
npm run prisma:migrate
npm run prisma:generate

Run

# Terminal 1: Backend (http://localhost:3000)
npm run dev:server

# Terminal 2: Frontend (http://localhost:5173)
npm run dev:web

Add a Provider

Open http://localhost:5173/providers
Click "Add Provider"
Fill in: Name, API Base URL, API Key, Adapter Type (openai/anthropic)
Click "Fetch Models" to pull available models from the API

Run Your First Evaluation

Go to http://localhost:5173/tasks
Click "New Task" and pick a template or write your own prompt
Open the task detail page, add target models, and hit "Run"
Watch streaming results, then use AI Scoring for automated evaluation

Project Structure

llm-test/
├── server/                 # NestJS backend
│   ├── src/
│   │   ├── agent/          # Agentic mode: tools, Docker sandbox, ReAct loop
│   │   ├── common/         # EncryptionService (AES key encryption)
│   │   ├── llm/            # Adapter pattern (OpenAI & Anthropic)
│   │   ├── models/         # Fetch & cache model lists from providers
│   │   ├── prisma/         # Database service
│   │   ├── providers/      # Provider CRUD
│   │   ├── runs/           # Execution, SSE streaming, scoring
│   │   └── tasks/          # Task CRUD, templates
│   └── prisma/             # Schema & migrations
├── web/                    # Vue 3 frontend
│   └── src/
│       ├── api/            # HTTP client & API modules
│       ├── components/     # Reusable Vue components
│       ├── composables/    # SSE stream consumer
│       ├── data/           # Built-in eval templates
│       ├── router/         # Vue Router config
│       ├── stores/         # Pinia state management
│       └── views/          # Page-level components
└── package.json            # npm workspaces root

API Overview

Method	Path	Description
GET	`/api/providers`	List providers
POST	`/api/providers`	Add provider
GET	`/api/models/:providerId`	Fetch & cache models from provider
GET	`/api/tasks`	List tasks
POST	`/api/tasks`	Create task
POST	`/api/tasks/:id/run`	Run task against targets (SSE)
GET	`/api/runs`	List all runs
PATCH	`/api/runs/:id/score`	Set manual score
POST	`/api/runs/:id/ai-score`	Trigger AI scoring
GET	`/api/tasks/:taskId/compare`	Get comparison data for a task

Architecture

Evaluation Modes

Simple (mode: "simple") — Single chat completion, streaming text output
Agentic (mode: "agentic") — ReAct loop with tool execution in Docker sandbox
- Built-in tools: bash, python, read_file, write_file, web_request
- Full trace of every iteration, tool call, and result

SSE Streaming Flow

POST /api/tasks/:id/run
  → creates TaskRun records
  → GET /api/runs/stream/:runId (SSE)
    → created → delta/thinking → complete/error → done

AI Scoring

Calls a judge LLM to score outputs across 5 weighted dimensions. Supports multiple judges per run. Scores stored as JSON array.

Encryption

API keys encrypted with AES-256-CBC at rest. Key derived from ENCRYPTION_KEY environment variable. Decrypted only when making API calls.

Roadmap

Benchmark dataset support (pre-defined test sets)
Batch evaluation with concurrency control
Export results as CSV/JSON
User authentication & multi-tenancy
More built-in agents and tools
PostgreSQL support
Dark mode

Development

See server/README.md and web/README.md for development guides.

# Type check
npm run typecheck

# Lint (server only)
cd server && npm run lint

# Database GUI
npm run prisma:studio

License

MIT — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.vscode		.vscode
server		server
web		web
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTING.zh-CN.md		CONTRIBUTING.zh-CN.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
TEMPLATES.md		TEMPLATES.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Test Platform

Features

Tech Stack

Quick Start

Prerequisites

Setup

Run

Add a Provider

Run Your First Evaluation

Project Structure

API Overview

Architecture

Evaluation Modes

SSE Streaming Flow

AI Scoring

Encryption

Roadmap

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Test Platform

Features

Tech Stack

Quick Start

Prerequisites

Setup

Run

Add a Provider

Run Your First Evaluation

Project Structure

API Overview

Architecture

Evaluation Modes

SSE Streaming Flow

AI Scoring

Encryption

Roadmap

Development

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages