Semantic Retrieval: Cortez Remembers What Matters

Cortez's memory got a real retrieval engine — semantic search over a room's knowledge and its own history, so replies carry what's actually relevant instead of everything at once.

Cortez's memory got a real retrieval engine. Previously, every reply stuffed as much of a room's accumulated knowledge (“neurons”) into the prompt as would fit — expensive and imprecise. Now GroupGPT runs local embedding-based retrieval:

  • Semantic neuron search — a cross-room cosine-similarity search over a vector index surfaces the knowledge actually relevant to the current message, instead of dumping everything.
  • Private-room history search — a second index lets Cortez pull the most relevant prior messages from the room's own history into context.
  • Zero-token FAQ — common product questions (“what is GroupGPT”, “what can you do”, and friends) are answered from a curated, semantically-matched FAQ without calling the model at all — instant, free responses.

Under the hood it runs entirely locally: a compact MiniLM embedding model (384-dimensional) plus a reranker, backed by sqlite-vec virtual tables over a WAL-mode SQLite connection. Models warm at boot in a degraded-mode-safe path so a cold embedding model never blocks chat, and a backfill seeded the existing corpus. World chat and end-to-end-encrypted rooms are always excluded from indexing for privacy.

Why it matters

This is the difference between an assistant that gestures at everything it's ever been told and one that recalls the right thing at the right moment. It makes replies more relevant and dramatically cheaper, because the prompt only carries what's actually pertinent — and the FAQ answers the most-asked questions for free.