A faster hot path. The path from “user hits send” to “first token streams back” got a series of targeted optimizations:
- Parallelized pre-stream retrieval so the relevance lookups that feed the prompt run concurrently instead of serially.
- An LRU cache for query embeddings so repeated or similar lookups skip re-embedding.
- Cached prepared statements in the vector store to shave per-query database overhead.
- A token-budgeted conversation window that bounds the single biggest cost in the hot path — the running history sent to the model — instead of letting it grow unchecked.
- Smaller trims, too: a non-semantic tag was dropped, retrieved history messages were capped, and the per-request date block (~23 tokens on every message) was slimmed down.
A server that doesn't fall over. Two reliability fixes landed: whole-server crashes were stopped at the source, and the connection logic was hardened to tolerate many users behind one IP — important for offices, campuses, and shared networks where lots of legitimate people share a single address and were tripping rate limits meant for abusers.
Why it matters
Latency is felt on every single message — shaving the time-to-first-token makes the whole product feel quicker, and trimming per-message tokens lowers cost at scale. And a chat backend that crashes or locks out an entire office because they share an IP isn't a chat backend you can rely on. This pass made GroupGPT both faster and sturdier.