Topology of a Second Brain: What 15,000 Signals Look Like in Embedding Space

15,800 notes. 49,746 chunks. Each chunk a 256-dimensional vector. I ran UMAP on the full dataset, projected it to three dimensions, and rotated the result slowly on screen. My second brain had a shape, and the shape revealed something the notes themselves never told me: my intellectual work clusters around three dense hubs (Claude Code, design systems, AI research) connected by thin bridges of intersection notes, surrounded by a sparse halo of orphaned signals that connect to nothing.

The shape of your knowledge tells you where you think, where you avoid thinking, and where your ideas have room to collide. The same context architecture that structures agent behavior structures human knowledge.

TL;DR

Embedding spaces give knowledge bases a spatial structure that reveals intellectual topology. Kat (@poetengineer__) demonstrated three topologies for Obsidian vaults: centralized (one core idea connecting everything), decentralized (clustered topic hubs), and distributed (edges between ideas labeled by semantic relationships).¹ My 15,800-file vault with 49,746 chunks exhibits a decentralized topology with three dominant clusters. Pesce et al.’s work on phase transitions in neural network pruning provides a mathematical framework for understanding when simplification (curation, archiving, filtering) crosses a threshold that breaks the knowledge structure’s function.² Below: what embeddings capture, three knowledge topologies with real vault data, how to diagnose your own topology, and an interactive explorer built from my actual vault.

What Embeddings Actually Capture

A text embedding converts a passage of text into a list of numbers. The tokenization visualizer post covered how text becomes tokens. Embeddings go further: tokens become coordinates in a high-dimensional space where distance corresponds to meaning.

Two passages about “Claude Code hooks for context injection” sit close together in embedding space. A passage about “Claude Code hooks” and a passage about “iOS SwiftUI navigation” sit far apart. The distance is not keyword overlap. Two passages can share zero words and still land near each other if they discuss the same concepts. Two passages can share many words (“the system processes the data”) and land far apart if the surrounding context differs.

My vault uses Model2Vec’s potion-base-8M model: 7.6 million parameters producing 256-dimensional embeddings.³ The model distills from a larger sentence transformer (bge-base-en-v1.5) and achieves roughly 90% of all-MiniLM-L6-v2’s performance while running as a static model – orders of magnitude faster on both CPU and GPU. Each of the 49,746 chunks in my vault becomes a point in 256-dimensional space.

256 dimensions are impossible to visualize directly. Dimensionality reduction techniques like UMAP project the high-dimensional structure down to 2D or 3D while preserving local neighborhoods.⁴ Points that were close in 256 dimensions remain close in 3 dimensions. The global structure is approximate, but the clusters are real.

Three Topologies of Knowledge

Kat’s exploration of Obsidian note embeddings identified three distinct knowledge topologies.¹ Each topology reflects a different intellectual structure, and each has different failure modes.

Centralized: One Core Idea Connecting All

In a centralized topology, most notes connect through a single dominant theme. The embedding space shows one dense cluster at the center with thin tendrils extending outward. A developer who writes exclusively about React would see this topology: React is the hub, and every note on testing, state management, deployment, and tooling connects through it.

Strength: Deep expertise in the central domain. Search works well because most queries land in the same neighborhood.

Failure mode: Brittleness. If the central theme becomes irrelevant (a career change, a technology sunset), the entire knowledge structure loses its organizing principle. Notes that only make sense in relation to the center become orphaned.

Decentralized: Clustered Topic Hubs

In a decentralized topology, notes form several distinct clusters connected by bridge notes. My vault exhibits this topology with three dominant hubs:

Cluster	Chunks	% of Total	Key Themes
AI & ML	~13,100	26%	Claude Code, agent architecture, LLM research
Design	~7,200	14%	UI systems, typography, color science, visual design
Development	~5,100	10%	FastAPI, SwiftUI, web engineering, databases
Inbox (unprocessed)	~13,700	28%	Raw signals, unsorted captures

The remaining 22% distributes across Inspiration, Productivity, Science, and smaller categories.

Strength: Resilience. Losing one cluster does not destroy the others. Interdisciplinary connections form at the boundaries between clusters, generating the most novel insights.

Failure mode: Fragmentation. If bridge notes between clusters are too thin, the clusters become intellectual silos. My vault has a thin bridge between Design and Claude Code (notes about designing agent UIs, prompt interface patterns) but almost no bridge between Design and raw Development (backend architecture notes rarely connect to visual design). The gap is a blind spot: I think about design and I think about backend engineering, but I rarely think about them together.

Distributed: Edges Labeled by Relationships

In a distributed topology, connections between notes carry semantic labels describing how ideas relate. Kat’s implementation used an LLM to generate edge labels between neighboring notes.¹ Instead of anonymous proximity, each connection has a description: “contradicts,” “extends,” “provides evidence for,” “applies in different domain.”

Strength: Navigability. A distributed topology answers not just “what’s related?” but “how is it related?” The labeling enables higher-order reasoning: finding notes that contradict a thesis, not just notes that mention it.

Failure mode: Cost. Generating edge labels for every connection pair scales quadratically. For my vault’s 49,746 chunks, exhaustive edge labeling would require approximately 1.2 billion LLM calls. Practical implementations label only edges within a similarity threshold.

Phase Transitions: When Simplification Breaks Structure

Pesce, He, and Caldarelli studied phase transitions in neural network pruning and found a sharp threshold: networks exhibit “a transition from a cooperative, functional phase to a disordered phase with collapsed performance.”² Below the threshold, removing connections barely affects function. At the threshold, function collapses abruptly. The transition follows scaling laws consistent with second-order critical behavior – the same mathematics that describes ice melting into water.

The parallel to knowledge curation is direct. My signal scoring pipeline reduced the Inbox from 14,771 notes to 5,886 through a relevance threshold. The reduction improved search quality: fewer low-relevance results, tighter clusters, faster retrieval. But was signal lost? Did the simplification cross a phase transition threshold?

The pruning research suggests the answer depends on connectivity, not quantity. Removing isolated nodes (notes with no semantic neighbors) has negligible impact on the network’s function. Removing bridge nodes (notes that connect otherwise-separate clusters) can collapse the structure even if the removed notes seem individually unimportant.

My triage pipeline raised the relevance threshold from 0.30 to 0.40. The 60% reduction in inbox size was measured by count. I did not measure the impact on topology. A phase-transition-aware curation strategy would:

Identify bridge notes before filtering (notes with high betweenness centrality in the similarity graph)
Exempt bridge notes from relevance filtering regardless of their individual scores
Monitor cluster connectivity metrics after each curation pass
Alert when a curation step reduces inter-cluster bridge density below a threshold

# Sketch: bridge note detection before curation
def identify_bridge_notes(embeddings, threshold=0.7):
    """Find notes that connect otherwise-separate clusters."""
    from sklearn.neighbors import NearestNeighbors
    nn = NearestNeighbors(n_neighbors=10, metric='cosine')
    nn.fit(embeddings)
    distances, indices = nn.kneighbors(embeddings)

    # Bridge score: how many of a note's neighbors are from
    # different clusters than the note itself
    bridge_scores = []
    for i, neighbors in enumerate(indices):
        own_cluster = labels[i]
        cross_cluster = sum(1 for n in neighbors if labels[n] != own_cluster)
        bridge_scores.append(cross_cluster / len(neighbors))

    return bridge_scores

Diagnosing Your Knowledge Topology

You don’t need 15,000 notes to analyze your knowledge topology. Any collection of 100+ notes with embeddings will reveal structure. Three diagnostic questions:

1. How Many Clusters Exist?

Run k-means or DBSCAN on your embeddings and count distinct clusters. Fewer than 3 suggests centralized topology. Between 3 and 8 suggests decentralized. More than 8 may indicate either a genuinely distributed topology or insufficient curation (many clusters means many topics, which may mean no depth in any of them).

2. How Dense Are the Bridges?

For each pair of clusters, count the notes that have nearest neighbors in both clusters. A bridge density below 2% of the smaller cluster’s size indicates a potential silo. My Design-to-Development bridge is approximately 1.4% – below the threshold, confirming the blind spot I observed.

3. What Percentage Is Orphaned?

An orphaned note has no neighbor within a cosine similarity threshold (typically 0.7). Orphaned notes are not necessarily bad – they may represent genuinely novel ideas. But an orphan rate above 15% suggests either inconsistent capture (notes that don’t match your knowledge domain) or embedding quality issues.

My vault orphan rate: approximately 8%. Most orphans are raw Inbox captures that haven’t been processed into structured notes. The orphan rate drops to 3% when excluding the Inbox, indicating that processed notes integrate well into the existing topology.

Embedding Space Explorer: This interactive component displays 500 sample notes from a 15,800-file vault projected into 3D space. Points are color-coded by topic cluster. Hover to see note titles and nearest neighbors. Adjust the similarity threshold to reveal connection topology — low thresholds show only tight clusters, high thresholds reveal bridge connections between topics.

What the Clusters Reveal

The visualization above uses 500 randomly sampled chunks from my vault. The clusters map to real intellectual neighborhoods.

The AI & ML hub (26% of chunks) is the densest cluster. Claude Code architecture, agent design patterns, LLM research papers, and prompt engineering techniques form a tight neighborhood. The density reflects volume: I read and capture more AI/ML content than any other category. Density also creates a search quality advantage – queries in this domain return highly relevant results because the embedding space is well-populated.

The Design hub (14%) sits at a distance from AI & ML. Typography systems, color science, UI component patterns, and visual design references form their own cluster. The separation is appropriate: design and AI engineering use different vocabularies, different reasoning frameworks, and different evaluation criteria. But the separation also means that queries like “how should agent output be formatted for developer review” fall in the gap between the two clusters, returning results from one side or the other but rarely from the intersection.

The Development hub (10%) overlaps more with AI & ML than with Design. FastAPI patterns, database design, and SwiftUI architecture share conceptual vocabulary with the AI engineering notes (both discuss code, architecture, testing). The vocabulary overlap produces a blended zone where DevOps-for-agents and infrastructure-for-AI notes live.

The Inbox halo (28%) surrounds everything. Raw captures, unsorted signals, and unprocessed bookmarks form a sparse cloud with weak connections to established clusters. The signal scoring pipeline that reduced the Inbox from 14,771 to 5,886 notes eliminated mostly from this halo: notes with low similarity to any established cluster.

The Inspiration cluster (6%) occupies a position between Design and Inbox. Kinetic typography references, motion design studies, and visual art captures form a loose neighborhood. The cluster exists because I capture visual inspiration consistently but rarely process those captures into structured notes. The cluster reveals a pattern: I consume visual inspiration broadly but produce design work narrowly. The gap between consumption and production is visible in the topology as a cluster with high incoming density (captures) but low outgoing connections (notes that build on the inspiration).

Cross-cluster bridges are the most interesting features. The thinnest bridge connects Design and Development: approximately 1.4% of the smaller cluster’s notes have nearest neighbors in both clusters. Compare that to the AI-to-Development bridge at 8.3%, which reflects how much of my development work involves AI infrastructure. The bridge density predicts where novel work emerges. My boids-to-agents post came from a bridge note that connected emergent behavior research (AI & ML cluster) with flocking algorithm implementation (Development cluster). Without the bridge, those two bodies of notes would never have collided.

A second embedding database exists alongside the vault: the toolchain search DB with 4,518 chunks across 653 files.⁵ The toolchain topology is radically different: a single dense cluster (Claude Code configuration) with small satellite clusters for testing, hooks, and skills. The monoculture topology works for a toolchain because a toolchain has a single purpose. A knowledge vault with a monoculture topology would be a warning sign.

Reshaping Your Topology

The topology is not fixed. Four deliberate actions reshape knowledge structure.

Write bridge notes. If two clusters lack connections, write notes that explicitly connect concepts across them. My Design-to-AI bridge is thin because I rarely write about designing agent interfaces. A note titled “UX Patterns for Agent Output” that cites both design principles and agent architecture research would create a bridge point.

Detect orphans. Run a monthly orphan scan and decide: integrate, archive, or delete. Orphaned notes that represent nascent ideas should be connected through bridge notes to existing clusters. Orphaned notes that represent one-time references can be archived.

Monitor after curation. Before and after any bulk curation (deleting, archiving, filtering), measure cluster connectivity. If inter-cluster bridge density drops, the curation removed bridge notes that should be preserved.

Read at the boundaries. The most valuable reading targets are not deeper into your densest cluster. They are at the edges between clusters. A paper that bridges AI engineering and visual design will generate more novel connections than another paper deepening the already-dense AI cluster.

Key Takeaways

Embedding spaces give knowledge bases a shape. That shape reveals intellectual topology: where you concentrate attention, where you avoid it, and where ideas connect across domains.
Three topologies have different failure modes. Centralized is brittle. Decentralized fragments without bridge notes. Distributed is expensive to maintain but richest for navigation.
Phase transitions make curation non-linear. Removing notes below a threshold barely affects structure. At the threshold, function collapses. Bridge notes must be identified and protected before any bulk curation.
The Inbox halo is the curation frontier. Raw captures form a sparse cloud around established clusters. Signal scoring filters the halo, but the topology reveals whether filtering preserved or destroyed bridge connections.
Read at the boundaries. The highest-value notes connect clusters, not deepen them. Orphan detection and bridge density metrics guide reading priorities.

FAQ

What are text embeddings and how do they represent knowledge?

Text embeddings convert passages of text into lists of numbers (vectors) in a high-dimensional space where distance corresponds to semantic meaning. Two passages about similar topics sit close together regardless of whether they share words. A 256-dimensional embedding model like potion-base-8M converts each text chunk into 256 coordinates. When applied to an entire knowledge base, the collection of vectors forms a spatial structure where clusters, bridges, and gaps reveal the intellectual topology of the content.

How can I visualize my Obsidian vault's embedding space?

Generate embeddings for your notes using a sentence embedding model (Model2Vec's potion-base-8M is fast and free), then project the high-dimensional vectors to 2D or 3D using UMAP. Store the embeddings in a database (SQLite with the vec extension works well), run UMAP projection, and visualize with any 3D plotting library. The resulting point cloud reveals your vault's cluster structure: dense regions where you write frequently, sparse gaps between topics, and bridge zones where different domains intersect.

What is a phase transition in knowledge curation?

A phase transition in knowledge curation is a threshold where removing notes causes the knowledge structure to collapse abruptly rather than degrade gradually. Research on neural network pruning shows that networks maintain function as connections are removed, until a sharp threshold where performance collapses. The same dynamic applies to knowledge bases: removing isolated, low-value notes has minimal impact, but removing bridge notes that connect clusters can fragment the topology even if those notes seem individually unimportant. Phase-transition-aware curation identifies and protects bridge notes before filtering.

How many notes do I need for meaningful topology analysis?

Meaningful cluster structure emerges from approximately 100 notes with embeddings. Fewer than 100 notes may not form distinct clusters. Between 100 and 500 notes reveals basic topology (2-4 clusters). Between 500 and 5,000 notes reveals nuanced structure with bridge zones and orphan patterns. Above 5,000 notes, the topology stabilizes and additional notes deepen existing clusters more than they create new ones. The key metric is not total count but cluster diversity: do your notes span at least three distinct topic areas?

References

Kat (@poetengineer__), “Exploring shapes of thoughts: extracted my Obsidian notes’ embeddings and arranged them as a 3D network using 3 different topologies,” posted on X, February 2026. Three topologies: centralized, decentralized, distributed with LLM-labeled edges. ↩↩↩
Pesce, Diego, Yang-Hui He, and Guido Caldarelli, “Phase Transitions in Neural Networks Pruning,” arXiv:2602.15224, February 2026. arxiv.org. Sharp transition from cooperative/functional phase to disordered phase, scaling laws consistent with second-order critical behavior. ↩↩
MinishLab, “Model2Vec: Fast State-of-the-Art Static Embeddings,” 2024. github.com/MinishLab/model2vec. potion-base-8M: 7.6M parameters, 256-dim embeddings, ~90% of all-MiniLM-L6-v2 performance. ↩
McInnes, Leland, John Healy, and James Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” arXiv:1802.03426, 2018. arxiv.org. Preserves global structure better than t-SNE with superior runtime performance. ↩
Author’s semantic memory system. Model2Vec + sqlite-vec + FTS5 BM25 + RRF hybrid search across 49,746 chunks. Modules: embedder.py, vector_index.py, chunker.py, retriever.py in ~/.claude/lib/memory/. ↩
Author’s signal scoring pipeline. Reduced Inbox from 14,771 to 5,886 notes (60% reduction) through relevance threshold tuning. Documented in The Signal Scoring Pipeline. ↩
Author’s vault topology analysis. 500-point random sample from 49,746 chunks, topic classification by vault directory structure, PCA projection to 3D for interactive visualization. ↩