A self-evolving agent sidecar that intercepts LLM requests, runs iterative planning cycles grounded in personal semantic memory, and writes preference facts back to a vector store after every confirmed session — so every future generation is shaped by every past correction.
Every LLM has the same two failure modes — and they compound each other.
1. It assumes instead of asking. When you send a coding task or a personalized writing request, the model picks the most statistically likely interpretation and commits to it immediately. It doesn't pause to ask whether you want idiomatic or explicit code, terse or expressive prose, a particular structure or tone. It guesses — and you spend the next several turns correcting a direction it should have clarified upfront.
2. It forgets everything you tell it. Every tweak you make in a session — the phrasing you corrected, the structure you rejected, the style you pushed it toward — vanishes the moment the context window closes. There is no persistence within a session beyond the visible thread, no persistence across sessions, and no global memory of who you are or how you work. You re-teach the model the same preferences every time.
Story Drafter breaks both failure modes at the service layer.
Story Drafter sits between your frontend and the main LLM as a FastAPI sidecar. It exposes two endpoints:
POST /draft— runs a LangGraph pipeline that retrieves personal preference facts from ChromaDB via mem0, assembles a structured context, and generates a draft using a configurable drafter LLM.POST /confirm— receives the final approved draft and the full drafting exchange, extracts new preference facts via a background LLM call, and writes them back to ChromaDB asynchronously.
The client drives an iterative loop: call /draft, collect user feedback, call /draft again with the extended message history — repeat until the draft is approved, then call /confirm. Each iteration appends the previous draft and feedback as new turns, giving the drafter full context of the refinement trajectory.
This is not prompt engineering. This is not fine-tuning. This is a Continuous Personalization Loop via Interactive Grounding — the paradigm emerging at the frontier of agentic AI research:
- PAHF (Personalized Agents from Human Feedback) — every correction in the drafting loop is a live preference signal. No model retraining. No dataset curation. The memory updates itself from the interaction.
- Self-Evolving Memory — the agent treats its own vector store as mutable. After each confirmed session, a background extraction step reflects on the full drafting exchange and patches the semantic profile of the user.
- Interactive Grounding — instead of one-shot generation, the service surfaces candidates and refines them over multiple turns. Each refinement step closes the gap between the model's prior and the user's actual intent — and that delta is what gets persisted.
Client (any frontend)
│
├─► POST /draft
│ │
│ ▼
│ LangGraph pipeline
│ ├─ retrieve_memories(user_id, query)
│ │ └─► ChromaDB (mem0) — semantic search → ranked preference facts
│ │
│ └─ generate_draft(context)
│ ├─ [system] agent_prompt + <meta_prompt> blocks
│ ├─ [user] <user_preferences> (injected from mem0)
│ ├─ [user] <system_prompt> (from messages[0].system)
│ ├─ [user] <conversation> (full message history, XML-tagged)
│ └─ [user] task_prompt (planning or drafting instruction)
│
│ ← { "draft": "..." }
│
│ [client appends draft + feedback to messages, loops]
│
└─► POST /confirm
│
▼
FastAPI BackgroundTask
└─► mem0 LLM — extract discrete preference facts from drafting exchange
└─► ChromaDB — upsert facts under user_id
└─► available on next /draft call
Each call to /draft receives the full message history including all prior draft/feedback turns appended by the client. The LangGraph node assembles:
[system] agent_prompt
+ <tag>meta_prompt</tag> for each configured meta prompt
[user] <user_preferences>
<item>retrieved preference fact</item> ...
</user_preferences>
<system_prompt>messages[0].system content</system_prompt>
[user] <conversation>
<user>...</user>
<assistant>...</assistant>
...
</conversation>
[user] task_prompt
mem0 queries ChromaDB using the last 3 non-system messages as the search vector. A fallback_model is wired via LangChain .with_fallbacks() for resilience.
The confirm payload contains only the drafting loop exchange (draft_loop_messages + final approved draft) — not the full conversation history. This keeps the mem0 extraction focused on stylistic and preference signals from the refinement session rather than narrative content.
The server returns {"ok": true} immediately. The mem0 LLM call and ChromaDB write happen in a BackgroundTask — the client is never blocked on memory consolidation.
| Layer | Technology |
|---|---|
| Drafter service | FastAPI + LangGraph (port 6677) |
| Preference memory | mem0 + ChromaDB (port 8100) |
| Observability | Arize Phoenix (port 6006) |
| LLMs | OpenRouter (any model) · configurable embedding provider |
1. Configure
cp .env.example .env
# set OPENROUTER_API_KEY
# set GOOGLE_APPLICATION_CREDENTIALS (for Google-based embedding model)2. Start services
docker compose up --buildChromaDB :8100 · Phoenix :6006 · Drafter :6677
3. Health check
curl http://localhost:6677/health/ready4. Call the API
# First draft
curl -s -X POST http://localhost:6677/draft \
-H 'Content-Type: application/json' \
-d '{
"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}],
"model": "<your-model-id>",
"agent_prompt": "...",
"task_prompt": "...",
"user_id": "user_123"
}'
# After client collects feedback and appends draft+feedback to messages, call /draft again
# On approval, confirm
curl -s -X POST http://localhost:6677/confirm \
-H 'Content-Type: application/json' \
-d '{
"messages": [...draft loop exchange...],
"final_draft": "...",
"feedback_history": ["feedback turn 1", "feedback turn 2"],
"user_id": "user_123",
"custom_instructions": "..."
}'5. Browse memory
.venv/bin/streamlit run scripts/dashboard.py| Variable | Default | Description |
|---|---|---|
MEM0_LLM_MODEL |
— | LLM for preference extraction (any OpenRouter model) |
MEM0_EMBED_MODEL |
— | Embedding model |
CHROMA_HOST |
localhost |
ChromaDB host |
CHROMA_PORT |
8100 |
ChromaDB port |
CHROMA_COLLECTION |
agent_prefs |
Collection name |
main.py FastAPI entry point (port 6677)
drafter/
agent.py LangGraph graph: retrieve_memories → generate_draft
memory.py mem0 wrapper: search_memories(), add_memory()
llm_utils.py make_llm() — OpenRouter and Google Gemini
models.py Pydantic schemas: DraftRequest, DraftResponse, ConfirmRequest
config.py Settings from .env
scripts/
dashboard.py Streamlit memory browser
See ARCHITECTURE.md for the full request/response flow and state machine.
Most "personalized AI" products are personalized at training time — a frozen snapshot of aggregate preferences baked into weights. Story Drafter is personalized at inference time, continuously, from individual corrections made in live sessions.
Every drafting loop tightens the model's prior on a specific user. The preference vector store grows denser with each session. That's not a feature. That's the architecture.