← All posts
agentic-ai memento-vault architecture

Building a knowledge vault that costs zero LLM tokens at retrieval

· 7 min read

I kept losing context between Claude Code sessions. Decisions I’d made, patterns I’d discovered, gotchas I’d hit once and didn’t want to hit again. Every new session started blank. After a few months of this I built Memento Vault to capture that knowledge automatically.

The problem gets worse as you go deeper into agentic development. Once you’re running parallel agents in worktrees, orchestrating multi-step tasks across sessions, the amount of context that evaporates between runs compounds fast. A single developer might lose track. A team running multiple agents loses track of things they didn’t even know the agent discovered.

It’s an Obsidian-based system with atomic notes, automated triage, and a background consolidation agent. Open source, MIT licensed. The design choice I’d like to focus on in this post is the retrieval layer, because I went a different direction than most AI memory systems.

Why zero-cost retrieval matters

Most AI memory systems (Honcho, Zep, Mem0, MemGPT) spend thousands of tokens per retrieval call. They run an LLM to rerank results, summarize context, or decide what’s relevant. That adds up. If the agent checks memory frequently during a long session, retrieval costs start rivaling the primary task.

Memento (BM25 + vector) 0.1k tokens/call
Honcho / Zep / Mem0 (LLM rerank) 7k tokens/call
Retrieval cost per call. Memento uses zero LLM calls in the retrieval path.

There’s a context pollution problem too. Research on context rot shows some frontier models degrade with as few as 100 tokens in context (Chroma’s research confirms the pattern across 18 models). Every loosely relevant retrieval result dilutes the signal.

Memento doesn’t use an LLM in the retrieval path at all. It runs BM25 keyword search and vector similarity through QMD (a local search engine over markdown). Results go into the agent’s context as short pointers, 200-500 characters each. No summarization, no reranking.

The trade-off is precision. An LLM reranker would filter more intelligently. In practice, BM25 for exact term matches combined with vector search for semantic similarity catches what matters. The cost difference is zero per retrieval vs. thousands.

The three-hook architecture

Memento integrates with Claude Code through three hooks:

  • Session start loads the memory index and any notes tagged as relevant to the current project. Runs synchronously since the agent needs this before it does anything. This is the fast path: file I/O only, no search.
  • Per-prompt fires on each user message and searches the vault for related notes via BM25. Runs asynchronously so it doesn’t block the conversation. Results show up as system reminders on the next response. Around 800ms per call.
  • Per-file-read fires when the agent reads a file. It extracts keywords from the file path, searches the vault, and surfaces relevant decisions or gotchas before the agent modifies code it doesn’t have full context for. Directory-level caching keeps this cheap after the first hit.

Most memory systems use a single retrieval endpoint. Having three means the right context arrives at the right moment instead of everything dumping in at session start.

Atomic notes with epistemic metadata

Each note in the vault has YAML frontmatter:

title: "Description of what was learned"
type: discovery | decision | pattern | tool
tags: [relevant, tags]
certainty: 1-5
validity-context: "when this knowledge applies"
date: 2026-03-29

certainty ranges from 1 to 5. A discovery from a single session starts at 3. Confirmed across multiple sessions, it goes up. Turns out to be wrong, it goes down or gets removed. Retrieval weights high-certainty notes over speculative ones, with temporal decay (90-day half-life, certainty 4-5 immune).

validity-context scopes when the knowledge applies. “Only valid for PostgreSQL 15+” or “Applies to the React frontend, not the mobile app.” Without this, the agent applies patterns in the wrong context. I learned that one the hard way.

The Inception agent

After a session ends, a background agent called Inception runs over the vault. It clusters all notes using HDBSCAN on QMD embeddings. No LLM involved, just math. Then it finds new clusters that weren’t present in the last run and synthesizes pattern notes from them. The synthesis step is the only part that calls an LLM. Right now it runs on an OpenAI Codex subscription so the marginal cost is zero. The plan is to make this agent-agnostic.

In practice, individual session notes (“hit this bug in Apollo cache”) gradually consolidate into higher-level patterns (“TTL misconfiguration causes stale reads across three different caching layers”). Clustering with a synthesis step on top.

Anthropic is building something similar. Claude Code’s source accidentally shipped with a sourcemap in late March, and the TypeScript contains an autoDream service, a background sub-agent that consolidates memory files, prunes stale notes, resolves contradictions. Four phases: orient, gather signal, consolidate, prune. It triggers after enough sessions accumulate. The feature is behind a flag and not wired up for most users yet, but the direction is obvious. When it ships, it’ll do for MEMORY.md (the auto-memory files Claude Code harnesses use) what Inception does for the vault. I take it as validation that the approach is right, even if the implementation details differ.

The cost model scales with cluster count, not vault size. HDBSCAN on 550 notes finds around 38 clusters. Inception picks the top 10 and synthesizes those. A vault with 5,000 notes would find more clusters but still only run 10 LLM calls per session. About $0.04 per run at Haiku rates, $0 on the subscription.

Dedup keeps it from getting noisy. Three layers: a ledger of processed notes, title overlap detection for near-duplicates, and an LLM skip check for semantic duplicates with different titles.

How it compares

FeatureMemento VaultHoncho / Zep / Mem0
Per-retrieval LLM cost0 tokens1,600-7,000 tokens
Context injection size~50-125 tokens (~200-500 chars)1,600-7,000 tokens
StorageLocal Obsidian filesCloud database
ConsolidationBackground HDBSCAN + LLMReal-time LLM
Hook granularity3 hooks (start, prompt, file-read)1 endpoint
Cost at scaleZero (local search)Per-retrieval LLM cost

Honcho’s agentic retrieval (their Dialectic agent) hit 90.4% on LongMem S by using an LLM to decide what to search, how many times, and when to stop. That’s a good result. It also costs LLM calls on every query.

I benchmarked Memento against 30 real sessions (341 prompts, 362 file reads, 16 projects). Hooks inject about 139 tokens per session. A single concierge agent call (an on-demand search that queries the full vault when you ask it a question) costs roughly 72,500 tokens. 522x cheaper. Wall-clock overhead averages 13 seconds per session, about 1.1% of a 20-minute session.

Memento per-session overhead 139 tokens
Single concierge agent call 72500 tokens
Context injection: 139 tokens per session vs 72,500 for one concierge call. 522x cheaper.

On LongMemEval retrieval (500-question LongMemEval_S set, custom adapter, turn-level granularity): NDCG@10 = 0.892, MRR = 0.907, recall@5 = 0.909. End-to-end accuracy is 70% overall (100% single-session, 80% knowledge-update, 60% temporal reasoning, 40% multi-session). Multi-session and temporal are the weak spots. Good enough for what it costs.

Memento is less sophisticated at retrieval. It’s also 522x cheaper and runs locally. In a workflow where the agent checks memory dozens of times per session, that difference compounds.

What actually matters

I spent a lot of time on the retrieval pipeline. BM25 tuning, vector search, adaptive depth, benchmark numbers. Worth doing, but the retrieval layer is the replaceable part.

The vault has 800+ notes of domain knowledge, architectural decisions, patterns, gotchas. That accumulates regardless of what tool reads it. If I switched from Claude Code tomorrow, the vault comes with me. Markdown files in a folder. Any search tool can index them, any LLM can read them.

Overhead is 13 seconds per session and 139 tokens. Break-even is one prevented concierge call every 18 weeks. Cheap retrieval, expensive capture.


Next up: five revisions of the skill that kicks off every ticket at dala.care, and how each revision fixed a different class of problem I hadn’t anticipated.