Building a Hybrid Retriever for 16,894 Obsidian Files

Q: How does a local retriever connect to Claude Code?

A PreToolUse hook calls the retriever’s search() method with the current prompt, formats the top results as a context block with file paths and section headings, and injects the block into the conversation. The agent sees focused chunks, not raw files. A max_tokens parameter ensures the injected context fits within a budget.

March 01, 2026 27 min read

obsidian retrieval embeddings claude-code python sqlite information-retrieval

From the guide: Claude Code Comprehensive Guide

A grep through 16,894 markdown files takes 11-66 seconds depending on the term and returns hundreds of low-relevance matches. A vector search returns semantically related content but misses the exact function name you typed. A hybrid retriever that fuses both methods returns the right answer in 23 milliseconds (end-to-end, including query embedding) from a single 83 MB SQLite file with zero API calls.¹

The obsessive note-taker’s problem is not collection. The problem is retrieval. Obsidian makes capturing frictionless. Accumulate enough files and the vault becomes a write-only database: easy to add to, impossible to query. Search by filename works until filenames become meaningless. Full-text search works until the same keyword appears in 400 documents. Tags work until you forget to tag something.

An HN commenter asked for the full architecture behind the retrieval system I built for my Obsidian vault.² This post sits within the AI engineering hub alongside the signal scoring pipeline that feeds the vault and the Obsidian guide that provides the step-by-step companion. Here it is: the chunking strategy, the embedding model, the dual-index SQLite schema, the fusion math with real numbers, and the failure modes I found after querying the system hundreds of times.

TL;DR

The retriever combines FTS5 BM25 keyword search with Model2Vec vector similarity search, fused via Reciprocal Rank Fusion (RRF) into a single ranked list. Everything runs locally in one SQLite database: 49,746 chunks from 16,894 files in 83 MB. Full reindex takes four minutes. Incremental updates run in under ten seconds. The system integrates with Claude Code through hooks, giving the agent access to the vault’s knowledge without loading files into context. BM25 catches exact identifiers and function names. Vector search catches semantic matches across different terminology. RRF merges both without requiring score calibration. The honest trade-off: well-tagged shallow content can outrank poorly-structured deep content because BM25 rewards keyword density, not depth.

Key Takeaways

For note-takers with large vaults. In my experience, full-text search alone became unusable past a few thousand files — and existing Obsidian search plugins (Smart Connections, Omnisearch) index within the app, not as an external library that other tools can query.¹ Adding vector search on top of BM25 catches the queries where you remember the concept but not the keyword. The retriever runs entirely on SQLite with no external services, no GPU, and no API costs. Model2Vec embeds at CPU speed because the model is 30 MB of static word vectors, not a transformer.³

For developers building retrieval systems. RRF is the fusion method that requires the least tuning. The formula uses only rank positions, not raw scores, so you never need to calibrate BM25 scores against cosine distances. Start with k=60 and equal weights. Tune only after measuring failure cases on your own data. The sqlite-vec extension brings vector KNN search into SQLite without a separate vector database.⁴

For Claude Code users. The retriever runs as a library that hooks can call. A PreToolUse hook queries the vault before the agent starts working. The agent sees 2-3 KB of focused results with file path attribution instead of loading entire files. The integration keeps context windows small while giving the agent access to 16,894 files of knowledge.

Minimum viable version. The simplest starting point: create an FTS5 virtual table over your markdown files (BM25-only, no embeddings). Add sqlite-vec and Model2Vec when keyword search starts missing semantic matches. Add RRF fusion last. Each layer works independently. The full stack requires Python 3, one 30 MB model download, and pip install model2vec sqlite-vec. No GPU, no Docker, no external services. Total disk footprint for 16,894 files: 83 MB.

The knowledge topology embeddings post analyzes the shape of the embedding space this retriever produces, revealing which clusters retrieve well and which bridge regions need the most improvement. The seventeen thousand signals post describes how the vault reached its current scale.

Want the full operational guide? The Obsidian AI Infrastructure reference covers vault architecture, plugin configuration, MCP server setup, incremental indexing recipes, and troubleshooting — the step-by-step companion to the architecture deep-dive in this post.

Why Keyword Search Alone Fails at Scale

Full-text search breaks down at vault scale in predictable ways. FTS5 with BM25 ranking excels at exact matches: search for requestAnimationFrame and every file containing that exact token appears, ranked by term frequency and document length.⁵ Robertson and Zaragoza’s survey of probabilistic relevance models confirms BM25’s strength: the algorithm performs well on keyword-heavy queries with minimal parameter tuning.¹⁴ The failure mode is synonyms and concept matching. Search for “how to handle authentication failures” and BM25 returns every file mentioning “authentication” or “failures” individually, diluting the results with tangentially related content.

Vector search solves the synonym problem. Embed the query and find chunks whose embeddings sit close in vector space. “How to handle authentication failures” matches content about “login error recovery” and “session expiration handling” because the embedding captures semantic similarity across different terminology.⁶ Karpukhin et al. demonstrated with Dense Passage Retrieval (DPR) that dense embeddings outperform BM25 on open-domain question answering by 9-19% in top-20 accuracy, precisely because dense representations capture meaning beyond lexical overlap.¹⁵ The failure mode is the opposite: vector search misses exact identifiers. Search for the function name _rrf_fuse and vector search returns content about fusion and ranking algorithms but may rank the actual function definition below a conceptual explanation.

Neither method alone covers both failure modes. A single query illustrates the difference (not proof of superiority — aggregate evaluation requires a golden set, which the system does not yet have). The query “PostToolUse hook for context compression” returns different top-3 results from each method:

Rank	BM25 Only	Vector Only	Hybrid (RRF)
1	hook-stdlib.sh “PostToolUse Handler”	context-is-the-new-memory.md “Compression Layers”	context-is-the-new-memory.md “Compression Layers”
2	settings.json “PostToolUse Events”	token-budget-analysis.md “Context Engineering”	hook-stdlib.sh “PostToolUse Handler”
3	compress-output.sh “Tool Output Filter”	agent-memory-patterns.md “Retrieval Integration”	compress-output.sh “Tool Output Filter”

BM25 found the exact hook file and settings reference (keyword match on “PostToolUse”) but missed the conceptual context engineering note. Vector search found the compression strategy notes (semantic match on “context compression”) but missed the specific hook implementation. RRF promoted the notes that matter for both the concept and the implementation, placing the strategy note and the hook file in positions one and two.¹³

Research on MS MARCO passage ranking supports the pattern in web search benchmarks: hybrid retrieval consistently outperforms either BM25 or dense retrieval alone, with the largest gains on queries that contain both specific terms and abstract concepts.⁷¹⁶

The Architecture: Three Layers That Compound

The system has three independent layers. Each works without the others, but together they compound.

Layer 1: Intake. A 733-line Python scoring pipeline rates every incoming signal on four dimensions: relevance, actionability, depth, and authority. Signals scoring 0.55 or above route automatically to one of 12 domain folders. Signals between 0.40 and 0.55 queue for manual review. Below 0.40, the pipeline drops the signal. The pipeline has processed 7,771 signals across 14 months without manual tagging.¹ The intake layer determines what enters the vault. The retrieval layer makes it findable.

Layer 2: Retrieval. The hybrid search engine covered in detail below. The engine chunks every file at heading boundaries, embeds the chunks with Model2Vec, and indexes them in SQLite with both a vec0 table for vector KNN and an FTS5 virtual table for BM25. A query runs against both indexes simultaneously, and RRF fuses the results into a single ranked list.

Layer 3: Integration. Claude Code hooks that wire the retriever into the agent’s workflow. A hook fires on prompt submission, queries the vault for relevant context, and injects the top results into the conversation. The agent sees focused chunks with source attribution instead of raw file contents:

# Illustrative output (format matches production, content simplified)
## Relevant Memory Context

### OAuth Token Rotation (security-patterns)
Rotate tokens on 401 response. Store refresh token in keychain,
not environment variable. Implement retry with backoff...

### Session Expiration Handling (auth-architecture)
Three expiration modes: absolute (24h), sliding (30min idle),
refresh (7d with rotation). Hook into 401 interceptor...

Each result carries the section heading and source project, capped at a 500-token budget to avoid context bloat.

The retriever also enables a second integration point: a PostToolUse hook that compresses tool outputs before they enter the conversation. Raw tool output contains timestamps, ordering artifacts, and verbose formatting that vary between runs. The retriever replaces the raw dump with a stable, focused subset. The agent never sees the noise, only the relevant extract. A side benefit: because the retriever’s output is deterministic for the same query (same index state produces the same ranked results), the compressed output helps prompt caching. Repeated queries against unchanged data produce identical context blocks, and the CLI’s automatic prompt caching reuses the cached prefix.

The broader infrastructure story explains how hooks, skills, and agents compose into a programmable layer around the model.

The layers are decoupled by design. The intake scoring knows nothing about embeddings. The retriever knows nothing about signal routing rules. But intake ensures the vault contains high-quality content, retrieval surfaces the right subset for any query, and integration delivers that subset to the agent without context bloat. I wrote about the theoretical framing of context as the critical resource. The retriever is the practical implementation.

Chunking: Where Retrieval Quality Starts

Chunking determines the granularity of search results. Chunk too large and vector search returns entire files where only one paragraph is relevant. Chunk too small and the embedding loses context needed for semantic matching. Research on RAG pipelines confirms that chunk size has a larger impact on retrieval quality than model choice for most use cases, with 200-500 token chunks performing best for paragraph-level retrieval tasks.¹⁸

The chunker splits at H2 (##) heading boundaries, preserving markdown structure.⁸ A note about OAuth token rotation with three H2 sections becomes three chunks, each self-contained enough for the embedding to capture its meaning. The indexer stores the heading text and parent note title as metadata alongside each chunk, providing context for BM25 matching even when the chunk text itself is sparse.

# chunker.py: H2 splitting with heading context
MIN_CHUNK_CHARS = 30
MAX_CHUNK_CHARS = 2000

def _split_at_headings(body):
    sections = []
    current_heading = ""
    current_lines = []
    for line in body.split("\n"):
        if line.startswith("## "):
            if current_lines:
                text = "\n".join(current_lines).strip()
                if text:
                    sections.append((current_heading, text))
            current_heading = line[3:].strip()
            current_lines = []
        else:
            current_lines.append(line)
    if current_lines:
        text = "\n".join(current_lines).strip()
        if text:
            sections.append((current_heading, text))
    return sections

The chunker splits sections exceeding 2,000 characters further: first at H3 boundaries, then at paragraph breaks. It drops sections under 30 characters. The chunker also skips Related, See Also, Links, and References sections, which are typically lists of wiki-links rather than searchable content.

Two design choices matter for retrieval quality. First, the indexer stores the heading context string ("OAuth Token Rotation | note | security, authentication") in a separate column and indexes it in FTS5 with a lower weight (0.3) than the chunk text (1.0). BM25 still matches on the heading when the chunk body does not contain the search term, but the heading match scores lower than a body match. Second, the chunker extracts frontmatter tags and note type and includes them in the heading context, so a search for “security” matches notes tagged with security even when the body text uses different terminology.

Embedding: 30 MB Model, Zero API Calls

The embedding model is Model2Vec’s potion-base-8M, a static word embedding model with 7.6 million parameters producing 256-dimensional vectors.³ On the MTEB benchmark suite, potion-base-8M achieves 89% of all-MiniLM-L6-v2’s performance (50.03 vs 56.09 average) at up to 500x the inference speed, making it practical for indexing large corpora on consumer hardware.⁹¹⁷ One caveat: the model’s MTEB Retrieval sub-score is notably lower (31.71) than its Classification (64.44) or STS (73.24) scores. MTEB’s retrieval benchmarks test document-level ranking on web corpora, not paragraph-level matching on homogeneous markdown chunks. The gap matters less when chunks are short, topically focused, and written in a consistent vocabulary. Unlike transformer-based embedding models, Model2Vec does not run attention layers over the input. The model distills a sentence transformer’s knowledge into static token embeddings, producing vectors through weighted averaging rather than sequential computation.⁹

Why do static embeddings work for this use case? Short markdown chunks (200-400 words on average) contain concentrated vocabulary about a single topic. The weighted average of those token vectors lands in a meaningful region of the embedding space because there is little off-topic dilution. In practice, a 2,000-word document covering three different subjects tends to produce a blurred centroid that sits between topic clusters rather than within one. A chunk about OAuth token rotation, by contrast, produces a vector that clusters tightly with other authentication content. Static embeddings trade contextual disambiguation (the word “bank” in “river bank” vs “bank account”) for raw speed. In a personal knowledge base where each chunk covers one concept, the ambiguity penalty is small and the paper reports up to 500x inference speedup.⁹

# embedder.py: lazy-loading Model2Vec in a dedicated venv
DEFAULT_MODEL = "minishlab/potion-base-8M"
EMBEDDING_DIM = 256

class Model2VecEmbedder:
    def __init__(self, model_name=DEFAULT_MODEL):
        self._model_name = model_name
        self._model = None

    def _ensure_model(self):
        if self._model is not None:
            return
        _activate_venv()  # Add memory venv to sys.path
        from model2vec import StaticModel
        self._model = StaticModel.from_pretrained(self._model_name)

    def embed_batch(self, texts):
        self._ensure_model()
        vecs = self._model.encode(texts)
        return [v.tolist() for v in vecs]

The practical consequence: a full reindex of 16,894 files completes in four minutes on an Apple M3 Pro. Incremental indexing (only changed files, detected by mtime comparison) runs in under ten seconds on a typical day’s edits.¹

The model runs in an isolated virtual environment at ~/.claude/venvs/memory/ to avoid dependency conflicts with the rest of the toolchain. The embedder lazy-loads the model on first use, not at import time, so importing the module costs nothing when the retriever falls back to BM25-only mode.

Why not a larger model? Two reasons. First, the 256-dimensional vectors keep the SQLite database at 83 MB for 49,746 chunks. Higher-dimensional vectors (768 or 1,024) would triple or quadruple the database size for marginal quality improvement on short markdown chunks.¹⁰ Second, API-based embeddings (OpenAI’s text-embedding-3-small at $0.02 per million tokens, for example) introduce latency, cost, and a network dependency for a system that should work offline.¹¹ The full vault re-embed costs roughly $0.30 at API prices, trivial in isolation but the real cost is the round-trip latency multiplied by 49,746 chunks and the privacy implication of sending personal notes to an external API.

A model hash mechanism tracks embedding compatibility. The indexer stores a hash derived from the model name and vocabulary size. If the model changes, incremental indexing detects the mismatch and triggers a full reindex automatically.

The SQLite Schema: Three Tables, One File

The entire index lives in one SQLite file (vectors.db, 83 MB) using WAL mode for concurrent read safety.¹² Three tables serve different purposes:

-- Chunk content and metadata
CREATE TABLE chunks (
    id INTEGER PRIMARY KEY,
    file_path TEXT NOT NULL,
    section TEXT NOT NULL,
    chunk_text TEXT NOT NULL,
    heading_context TEXT DEFAULT '',
    mtime_ns INTEGER NOT NULL,
    embedded_at REAL NOT NULL
);

-- FTS5 for BM25 search (content-synced to chunks)
CREATE VIRTUAL TABLE chunks_fts USING fts5(
    chunk_text, section, heading_context,
    content=chunks, content_rowid=id
);

-- sqlite-vec for vector KNN search
CREATE VIRTUAL TABLE chunk_vecs USING vec0(
    id INTEGER PRIMARY KEY,
    embedding float[256]
);

The FTS5 table uses a content-sync pattern: it references the chunks table directly rather than storing a duplicate copy of the text.⁵ One gotcha: content-sync tables do not propagate deletes automatically. The indexer must issue explicit INSERT INTO chunks_fts(chunks_fts, rowid) VALUES('delete', ?) commands before removing rows from the chunks table, or the FTS5 index becomes silently inconsistent. Column weights in BM25 queries assign 1.0 to chunk text, 0.5 to section headings, and 0.3 to heading context:

# vector_index.py: BM25 search with column weights
bm25(chunks_fts, 1.0, 0.5, 0.3) as score

The sqlite-vec extension stores 256-dimensional float vectors as packed binary data and supports KNN queries with cosine distance.⁴ Python’s struct.pack serializes the vectors:

def _serialize_vector(vec):
    return struct.pack(f"{len(vec)}f", *vec)

The schema handles graceful degradation by design. If sqlite-vec fails to load (missing extension, incompatible platform), the retriever falls back to BM25-only search. The vec_available property tracks whether vector search is operational.

Reciprocal Rank Fusion: The Math That Makes It Work

RRF merges two ranked lists without requiring score calibration.⁷ Why not combine the raw scores directly? BM25 returns negative relevance scores (more negative = more relevant in SQLite’s FTS5 implementation) while cosine distance returns values between 0 and 2. Comparing these scales requires normalization that is sensitive to query distribution. RRF sidesteps the problem entirely by using only rank positions, not scores. The formula assigns each document a score based on where it appeared in each list:

score(d) = Σ (weight_i / (k + rank_i))

Where k is a constant (60 in the implementation, following the original Cormack et al. paper⁷), rank_i is the document’s rank in result list i, and weight_i is an optional per-list multiplier (default 1.0 for both).

Query: "how does the review aggregator handle disagreements"
        │
   ┌────┴────┐
   ▼         ▼
 BM25      Vector
(FTS5)   (sqlite-vec)
   │         │
   │  rank   │  rank
   ▼         ▼
 ┌───────────────┐
 │  RRF Fusion   │
 │ score = Σ 1/  │
 │   (k + rank)  │
 └───────┬───────┘
         ▼
   Top-K Results

Enable JavaScript to use the interactive RRF calculator with editable ranks and scenario presets.

Here is a worked example with real ranks. Consider a query: “how does the review aggregator handle disagreements.” Five chunks surface in the combined results:

Chunk	BM25 Rank	Vec Rank	BM25 RRF	Vec RRF	Fused Score
review-aggregator.py “Disagreement Resolution”	3	1	1/63 = 0.0159	1/61 = 0.0164	0.0323
deliberation-config.json “Review Weights”	1	8	1/61 = 0.0164	1/68 = 0.0147	0.0311
code-review MOC “Multi-Agent Review”	7	2	1/67 = 0.0149	1/62 = 0.0161	0.0310
jiro-artisan.sh “Review State Machine”	2	12	1/62 = 0.0161	1/72 = 0.0139	0.0300
quality-loop.md “Evidence Gate”	-	3	0	1/63 = 0.0159	0.0159

The first chunk wins because it ranks well in both lists. BM25 matched “review,” “aggregator,” and “disagreements” in the text. Vector search matched the semantic concept of conflict resolution in code review. The second chunk ranked first in BM25 (exact keyword match on “review” in the config file) but eighth in vector search (the config JSON is semantically sparse). RRF pulled it down appropriately. The last chunk appeared only in vector results, so it received an RRF score from one source only.

# retriever.py: RRF fusion core
RRF_K = 60

def _rrf_fuse(self, bm25_results, vec_results,
              bm25_weight=1.0, vec_weight=1.0):
    scores = {}
    for rank, r in enumerate(bm25_results, start=1):
        cid = r["id"]
        if cid not in scores:
            scores[cid] = {"rrf_score": 0.0, ...}
        scores[cid]["rrf_score"] += bm25_weight / (self._rrf_k + rank)
        scores[cid]["bm25_rank"] = rank

    for rank, r in enumerate(vec_results, start=1):
        cid = r["id"]
        if cid not in scores:
            scores[cid] = {"rrf_score": 0.0, ...}
        scores[cid]["rrf_score"] += vec_weight / (self._rrf_k + rank)
        scores[cid]["vec_rank"] = rank

    return [SearchResult(chunk_id=cid, **data)
            for cid, data in scores.items()]

The default candidate pool is 30 results from each source before fusion, producing up to 60 candidates. The retriever returns the top 10 fused results. An optional max_tokens parameter truncates results to fit within a token budget, estimating at 4 characters per token.

Indexing: Full and Incremental

The indexer supports two modes. Full reindex clears the database and rebuilds from scratch. Incremental indexing compares file modification times (mtime_ns) against stored timestamps and only re-processes changed files.¹

# index_vault.py: incremental detection
stale = index.get_stale_files(vault_mtimes)  # mtime changed or new
deleted = index.get_deleted_files(vault_paths)  # no longer in vault

Embedding runs in batches of 64 texts to amortize Model2Vec’s overhead.⁸ A progress counter prints every 500 files during full reindex. A SIGINT handler enables graceful shutdown, finishing the current file before stopping.

The configuration file uses an allowlist model to control folder indexing. The vault has 22 allowed folders and 5 permanently excluded folders (personal health notes, career documents, Obsidian internal directories).²⁰ The indexer processes only files within allowed folders and skips everything else.

One critical design choice: the indexer runs a credential filter on every chunk before storing it. Personal notes contain API keys, bearer tokens, database connection strings, and private keys pasted during debugging sessions. The credential filter matches 21 vendor-specific patterns (OpenAI keys, GitHub PATs, AWS access keys, Stripe tokens, and 17 others) plus 11 generic detectors for database URLs, JWTs, bearer tokens, password assignments, and high-entropy base64 strings.²⁰ The filter replaces matched content with [REDACTED:pattern-name] tokens and logs which patterns fired but never logs the secret itself.

# chunker.py: credential filtering before storage
cleaned_text, scan_result = clean_content(sub_text)
if not scan_result.is_clean:
    logger.info("Scrubbed %d credential(s) from %s [%s]",
                scan_result.match_count, file_path, sub_heading)

Indexing personal notes without credential filtering would create a searchable database of secrets. The filter runs before embedding, so the vector representations never encode credential patterns. A query for “API key” returns notes that discuss API key management, not notes that contain actual keys.

What Goes Wrong: Honest Failure Modes

After hundreds of queries against the production index, four failure patterns are clear.

Keyword-dense shallow content outranks deep content. A short note tagged security, authentication, oauth with a three-sentence summary scores higher in BM25 than a 2,000-word deep dive on OAuth implementation that uses the terminology once in the introduction and then switches to specific protocol details. BM25 rewards term frequency relative to document length, a property that Robertson and Zaragoza documented as the “term frequency saturation” component of the algorithm.⁵¹⁴ The shallow note has higher keyword density. RRF partially corrects for the issue because vector search ranks the deep content higher (the embedding captures the semantic depth), but the shallow note still appears in the fused results when it probably should not.

Structured data indexes poorly. JSON configuration files, YAML frontmatter blocks, and code snippets with variable names produce low-quality BM25 matches. A search for “review configuration” matches every JSON file with a review key. Vector search handles structured data slightly better because the embedding captures the key-value relationships, but structured content is fundamentally harder to chunk than prose. Flattening JSON to key-path: value pairs before embedding would improve retrieval quality for configuration-heavy notes.

Chunk boundary splits context. The chunker splits a paragraph that spans the boundary between two H2 sections into two chunks. Each chunk contains half the explanation. Neither chunk embeds well because the embedding lacks the full context. The chunker mitigates the issue with heading context (carrying the parent heading into the metadata), but the body text still loses continuity at the boundary. Overlapping windows would help but increase the chunk count and database size.

Temporal relevance is invisible. The retriever has no notion of recency. A note from 14 months ago about an early architecture decision ranks equally with a note from yesterday about the current implementation. For a knowledge base that evolves, newer notes often supersede older ones. The retriever does not know that.

What Comes Next: The Expansion Roadmap

Five additions would address the failure modes and extend the system’s capabilities.

Learning-to-rank re-ranking layer. After RRF fusion, a lightweight re-ranker could adjust scores based on metadata signals: note recency, tag relevance to the query domain, link density (highly-linked notes are often more authoritative). The re-ranker would run on the fused top-30 results, not the full corpus, keeping latency under the 23ms baseline.

Query intent classification. Different queries need different retrieval strategies. An exact identifier lookup (_rrf_fuse) should weight BM25 heavily. A conceptual question (“how does review handle disagreements”) should weight vector search. A lightweight classifier that adjusts bm25_weight and vec_weight per query would improve precision without changing the fusion architecture.

Temporal decay. Weight recent notes slightly higher for queries about current state. A decay function applied post-fusion would reduce the score of chunks from files last modified more than N months ago. The mtime_ns timestamp already exists in the schema; the decay needs only a weighting function in the retriever.

Evaluation harness with golden queries. The system currently has no automated quality measurement. A set of 50-100 curated query-answer pairs would enable retrieval quality regression testing: run the test suite after any change to chunking, embedding, or fusion parameters and verify that recall@10 does not degrade. The BEIR benchmark demonstrated that retrieval systems can vary by 20+ points in nDCG@10 across different query distributions, making domain-specific evaluation essential.¹⁹ Without a golden set, improvements are anecdotal.

Cross-note relationship indexing. Obsidian wiki-links ([[note-name]]) encode explicit relationships between notes. The current system ignores link structure entirely. Indexing link targets as metadata would let the retriever boost chunks from notes that many other high-scoring notes link to, similar to PageRank for the vault.

The embedding space topology analysis I ran on the full vault reveals where these improvements would have the most impact. Dense clusters (AI tooling, security) already retrieve well because terminology is consistent. Sparse bridge regions between clusters are where the retriever struggles most, and where relationship indexing and intent classification would provide the largest gains.

FAQ

Why SQLite instead of a dedicated vector database?

The entire retrieval stack runs in one file with zero external dependencies. SQLite’s WAL mode handles concurrent reads from multiple Claude Code sessions. The sqlite-vec extension adds vector KNN search without requiring a separate Pinecone, Weaviate, or Qdrant instance.⁴ At 49,746 chunks, query latency is 23ms.¹ A dedicated vector database would add operational complexity (hosting, backups, authentication) for a single-user knowledge base that fits in 83 MB.

Why Model2Vec instead of OpenAI embeddings or a larger model?

Three reasons: latency, privacy, and cost. Model2Vec runs locally at CPU speed with no network call.³ Personal notes never leave the machine. API-based embeddings would cost roughly $0.30 per full reindex for the current vault size,¹¹ negligible in isolation, but the round-trip latency across 49,746 chunks and the privacy exposure of personal content are the actual costs.

What is Reciprocal Rank Fusion and when should you use it?

RRF requires no training data, no score calibration, and no hyperparameter tuning beyond the constant k.⁷ A learned fusion model would require labeled relevance judgments for training, which do not exist for a personal knowledge base. RRF is the fusion method with the lowest barrier to producing useful results. Use RRF when combining ranked lists from retrieval methods that produce incompatible score types.

How does a local retriever connect to Claude Code?

A PreToolUse hook calls the retriever’s search() method with the current prompt, formats the top results as a context block with file paths and section headings, and injects the block into the conversation. The agent sees focused chunks, not raw files. A max_tokens parameter ensures the injected context fits within a budget.

How do you prevent secrets from being indexed in a retrieval system?

Run a credential filter on every chunk before storage. The filter in this system matches 21 vendor-specific patterns and 11 generic detectors for JWTs, bearer tokens, and private keys.²⁰ It replaces matched content with [REDACTED:pattern-name] tokens and runs before embedding, so vector representations never encode credential patterns.

References

Author’s production data. 49,746 chunks, 16,894 files, 83.56 MB SQLite database, 7,771 signals processed across 14 months. Query latency (23ms) measured via time.perf_counter() in retriever.py, wrapping the full search path: BM25 lookup, query embedding via Model2Vec, vector KNN search, and RRF fusion. grep -rl measured at 11-66 seconds depending on term frequency (Apple M3 Pro, APFS). Full reindex measured at ~4 minutes on Apple M3 Pro. Incremental measured at <10 seconds for typical daily changes. FTS5-only search became unusable for the author above ~3,000 files due to keyword collision rates. ↩↩↩↩↩↩
HN thread: “Stop Burning Your Context Window”. Comments from danw1979 and tclancy requesting a detailed write-up. ↩
Model2Vec: Distill a Small Fast Model from any Sentence Transformer. Minish Lab, 2024. The potion-base-8M model uses static word embeddings distilled from a sentence transformer, producing 256-dimensional vectors without running attention layers. ↩↩↩
sqlite-vec: A vector search SQLite extension. Alex Garcia, 2024. Provides vec0 virtual tables for KNN vector search within SQLite, using the same query interface as standard tables. ↩↩↩
SQLite FTS5 Extension. SQLite documentation. FTS5 provides full-text search with BM25 ranking, content-sync tables, and configurable column weights via the bm25() auxiliary function. ↩↩↩
Reimers, N. and Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP, 2019. Foundational work on dense semantic similarity for text retrieval, establishing the vector search approach used in hybrid retrieval systems. ↩
Cormack, G.V., Clarke, C.L.A., and Buettcher, S. Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods. SIGIR, 2009. Introduces RRF with k=60 as a parameter-free method for combining ranked lists that outperforms trained fusion models. ↩↩↩↩
Author’s implementation. chunker.py splits at H2 boundaries in the _split_at_headings function, with fallback to H3 then paragraph splitting for sections exceeding 2,000 characters. MIN_CHUNK_CHARS=30, MAX_CHUNK_CHARS=2000. index_vault.py embeds in batches of 64 (BATCH_SIZE=64). ↩↩
van Dongen, T. et al. Model2Vec: Turn any Sentence Transformer into a Small Fast Model. arXiv, 2025. Describes the distillation approach producing static embeddings from sentence transformers with 50-500x inference speedup. ↩↩↩
Author’s measurement. 256-dim vectors at 49,746 chunks produce 83 MB SQLite. Extrapolating to 768-dim vectors: ~215 MB. To 1024-dim: ~280 MB. Marginal quality improvement on short markdown chunks (avg 200-400 words) does not justify the storage and latency increase. ↩
OpenAI Embeddings Pricing. text-embedding-3-small: $0.02 per million tokens. Estimated vault cost per full reindex: ~$0.30 based on average chunk length of ~200 tokens. ↩↩
SQLite Write-Ahead Logging. SQLite documentation. WAL mode allows concurrent readers with a single writer, suitable for the retriever’s read-heavy access pattern. ↩
Author’s query trace. Ran “PostToolUse hook for context compression” against BM25-only, vector-only, and hybrid modes. Results captured from retriever.py with method field tracking which search path produced each result. ↩
Robertson, S. and Zaragoza, H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 2009. Survey of the BM25 family of ranking functions and their theoretical foundations. ↩↩
Karpukhin, V. et al. Dense Passage Retrieval for Open-Domain Question Answering. EMNLP, 2020. Demonstrated that learned dense representations outperform BM25 by 9-19% on open-domain QA benchmarks, establishing dense retrieval as a complement to lexical search. ↩
Luan, Y. et al. Sparse, Dense, and Attentional Representations for Text Retrieval. TACL, 2021. Analysis of hybrid sparse-dense retrieval on MS MARCO, showing consistent improvements over single-modality approaches. ↩
MTEB: Massive Text Embedding Benchmark. Muennighoff, N. et al., 2023. potion-base-8M scores 50.03 average MTEB vs 56.09 for all-MiniLM-L6-v2 (89.2% retention). Per-task breakdown: Classification 64.44, Clustering 32.93, Retrieval 31.71, STS 73.24. Source: Model2Vec results. ↩
Gao, Y. et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv, 2024. Survey of RAG architectures including analysis of chunking strategies and their impact on retrieval quality. ↩
Thakur, N. et al. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS, 2021. Demonstrates high variance in retrieval performance across domains, underscoring the need for domain-specific evaluation. ↩
Author’s configuration and credential filter implementation. memory-config.json defines 22 allowed_folders and 5 excluded_always entries. credential_filter.py defines 21 vendor-specific CREDENTIAL_PATTERNS (OpenAI through Turnstile) plus 9 generic single-line patterns (DB URLs, bearer tokens, JWTs, passwords, secrets, API keys, auth tokens, base64 secrets) and 2 multiline patterns (RSA/SSH private keys, PGP keys). Total: 32 patterns. ↩↩↩