Anatomy of a “Database Disk Image Is Malformed” Bug

What happened

The backend began throwing the same error on nearly every database query:

database disk image is malformed

In SQLite terms that's SQLITE_CORRUPT (extended code 11) — the kind of message that makes your stomach drop, because it usually means a data file has been physically damaged. Rooms, accounts, messages: anything that touched the main database failed.

But there was a strange tell. Redeploying the server fixed it — for a while. Then, under real traffic, the same errors came roaring back. A genuinely corrupt file doesn't heal itself on restart. So the file probably wasn't corrupt. Something at runtime was producing a malformed image, and a fresh deploy reset whatever that was.

The root cause: two connections, one file, two journal modes

GroupGPT recently shipped semantic retrieval — the feature that lets the assistant recall the most relevant knowledge and history instead of stuffing everything into the prompt. It's powered by a local vector store built on sqlite-vec and better-sqlite3.

To save space, that vector store had been pointed at the exact same SQLite file the rest of the app uses through Prisma. And on initialization it ran:

PRAGMA journal_mode = WAL

That single line was the bug. Here's the collision:

Prisma opened the file with SQLite's default rollback-journal mode.
The vector store opened the same file with a separate connection and switched it to WAL (write-ahead logging), which changes the on-disk format and spins up -wal and -shm sidecar files.
WAL's shared-memory coordination (-shm) relies on real memory-mapping and file-locking semantics. On the network/overlay filesystem backing the production volume, those semantics aren't reliable.

Two libraries, one file, conflicting journal modes, on a filesystem that can't safely run WAL across processes. Under concurrent read-and-write load the two connections saw inconsistent views of the file and SQLite reported exactly what it saw: a malformed image. A redeploy reset the WAL state and reopened both connections fresh — so it “worked again,” until load rebuilt the conflict.

The fix: give the vector store its own file

The vector store's tables are derived data — they can be rebuilt from the source records at any time — so they never needed to live in the primary database at all. The fix was to give the vector store its own dedicated SQLite file, sitting beside the main database but completely independent of it. WAL mode on a dedicated, single-writer file is perfectly safe; the problem was only ever the sharing.

Because retrieval was already written to degrade gracefully — falling back to keyword matching when the vector store has no answer — the chat kept working the entire time the new vector file was empty. Once the fix deployed (which also reset the main database's WAL state and brought production back), we re-embedded the existing corpus into the new file: roughly 1,300 messages and 75 knowledge entries. Semantic ranking was fully restored, and crucially, the heavy write load of that backfill ran without a single corruption error — the proof that the two databases are now properly separated.

Why it matters

The lesson here is broader than one pragma. Sharing a single SQLite file across two different libraries is fragile — each one assumes it controls the connection, the journaling, and the locking. And WAL mode and networked volumes don't mix, because WAL trades durability-via-rollback-journal for a shared-memory coordination scheme that quietly assumes a local disk. Separating the derived vector data onto its own file removed both hazards at once. The most reassuring part of a good fix isn't that the error stopped — it's watching the exact workload that used to break it run clean.