← Writing

The model has no memory and we keep pretending it does

Every time you start a new conversation with a language model, it wakes up with no idea who you are. No memory of what you decided last week. No record of the codebase you showed it three sessions ago. Clean slate, every time.

This is not a bug they haven’t gotten around to fixing. It’s the architecture. Transformers are stateless by design: each forward pass is independent, no persistent state between requests. The attention mechanism that makes them so capable at reasoning within a window is also why that window is the whole world. When it closes, everything’s gone.

I started building greflect, a multi-agent experiment, and this hit me almost immediately. The use case was simple on paper: an AI that remembers our conversations, understands context across sessions, can pull up what we decided about the authentication system three weeks ago. The kind of thing that sounds obvious until you realize nothing in the standard model stack actually does it. I kept bumping into the wall. You can have a long context window. Gemini 2.5 will give you a million tokens; Claude 3.5 at the time was already at two hundred thousand. But raw context length solves the wrong problem. You don’t need to stuff everything in. You need to retrieve the right piece.

why RAG sounds better than it is

Retrieval-augmented generation is the standard answer to the memory problem and it’s a genuinely good idea with a lot of room to disappoint you in practice.

The mechanic: you take your documents, conversations, code, whatever, and embed them: run them through a model that converts text into high-dimensional vectors where semantically similar things land near each other in the space. Store those vectors in a database that can do fast nearest-neighbor search. At query time, embed the question, find the closest chunks, stuff them into the prompt as context. The model “remembers” because you handed it the right page.

The demo works every time. You ask about something you indexed, it comes back with the right chunk, the model synthesizes a good answer. It feels like memory.

Then you build the real thing.

First problem: chunking. How you split documents into retrievable pieces matters enormously and there’s no clean default. Fixed-size windows break across sentence boundaries and lose coherence. Too large and retrieval is imprecise. Too small and the returned chunk has no surrounding context and the model is working with a fragment. The shape of the chunk has to match the shape of questions you expect to ask, and you don’t always know that shape in advance.

Second problem: the retrieval is fuzzy by design. Cosine similarity between embedding vectors doesn’t care about exact words, it cares about semantic neighborhood. This is the feature: “authentication system” might surface chunks that say “login flow” or “session handling” because they live nearby in vector space. But it also returns neighbors that just happen to be close for reasons you didn’t intend. The recall is noisy. Naive RAG retrieves candidates, not answers.

Third problem: the model will still make things up. Retrieved context being present doesn’t mean it’s used faithfully. If the retrieved chunk is incomplete or slightly off-topic, the model fills the gap with plausible-sounding confabulation. RAG reduces hallucination on known facts; it doesn’t eliminate it.

What actually helps: re-ranking. Retrieve a wider candidate set with fast approximate vector search, then score the candidates with a slower, more expensive cross-encoder that looks at query and passage together. The two-stage pattern, coarse retrieval and precise re-ranking, is where the gap between a demo and something you’d trust starts to close. Hybrid search helps too, fusing vector similarity with keyword matching so you don’t lose exact-string precision entirely.

retrieve(query, k=50) → rerank(query, candidates) → top_k(5) → generate

That’s the pattern. Not complicated but it adds latency and infrastructure, which is why most tutorials skip it.

what I actually learned

greflect isn’t done and the memory problem isn’t solved. What I have is a clearer picture of what “solved” would need to look like.

The delete problem is real and annoying. Updating a vector database when source documents change, or when a decision gets reversed, is not graceful. Embeddings don’t have clean overwrite semantics the way a row in a table does. You re-index, you mark stale, you hope retrieval surfaces the new version. It’s operational overhead that the “just embed your docs” framing glosses over.

The privacy question also keeps nagging at me. Everything you index is plaintext in a database. For personal memory across sessions that includes conversations, code decisions, context you’d consider sensitive. Building something you’d actually use means thinking about what you’re persisting and where, which most RAG tutorials treat as someone else’s problem.

Claude Code just went GA this month and the direction of travel is clear: more agents with longer task horizons, more need for persistent external memory, more of exactly this problem. The tools are getting better. The retrieval infrastructure is getting cheaper and easier to run. What isn’t getting better automatically is the design judgment: what to chunk, what to index, how to handle updates, when to trust a retrieval result and when to distrust it.

The model still wakes up with no memory. We’re just getting more sophisticated about what we hand it when it does.