Signal Scoring Pipeline: Deterministic Knowledge Triage

Q: How does the LLM triage stage work?

The optional --llm-triage flag sends only ambiguous signals (0.30-0.55 composite score) to Claude for evaluation. The LLM receives the signal&rsquo;s title, description, tags, and URL, then returns a score adjustment (-0.20 to +0.20) and a domain classification. The adjustment can push a signal above the 0.55 auto-write threshold or below the 0.30 skip threshold. Cost is approximately $0.02-0.04 per signal. Running LLM triage on only the ambiguous subset (420 of 7,700) costs $8-17 instead of $150-300 for the full set.

Most knowledge management is vibes. You save a note because it “feels important.” Six months later you have 7,000 notes and no idea which ones matter. I built a deterministic scoring pipeline that tells me.

The system is 733 lines of Python. It scores every incoming signal across four weighted dimensions, computes a composite score, and routes the signal to one of 12 domain folders, an inbox for manual review, or the void. No tagging by hand. No “review later” that never happens. The algorithm decides.

TL;DR

A weighted composite score (relevance 35%, actionability 30%, depth 20%, authority 15%) produces a 0.0-1.0 rating for each signal. Routing uses three thresholds: >= 0.55 auto-writes to a domain folder, >= 0.30 queues for manual review, < 0.30 silently skips. Over 7,700 notes processed across 14 months, with Development (2,837) and Design (1,709) dominating the distribution. Most interesting flaw: the depth dimension measures metadata richness, not content quality, so a well-tagged tweet about a flower photo scores 0.85.

The Problem With Vibes

I use Obsidian as a knowledge base. Signals arrive from RSS feeds, Twitter bookmarks, GitHub stars, newsletters, and manual capture. Before the pipeline, every signal went to a single inbox folder. An inbox of 400+ unprocessed notes accumulated within two months.

Standard advice (“review weekly, tag as you go, use a folder system”) assumes the review happens. It doesn’t. Inboxes become write-only: items enter but never leave. Knowledge you captured is functionally identical to knowledge you never captured. Clay Shirky framed the problem precisely: “It’s not information overload. It’s filter failure.”⁸

I needed a system that could evaluate 7,700+ notes faster than I can read them, using criteria I define once and apply uniformly. Not a recommendation engine. A scoring algorithm.

The Composite Score

The scoring formula is a weighted linear combination of four dimensions, a standard approach in multi-criteria decision analysis (MCDA):⁹

composite = (
    relevance     * 0.35 +
    actionability * 0.30 +
    depth         * 0.20 +
    authority     * 0.15
)

Each dimension produces a float between 0.0 and 1.0. The formula rounds the composite to three decimal places. Weights reflect a deliberate priority order: what matters to me (relevance) outweighs what I can use (actionability), which outweighs how rich the metadata is (depth), which outweighs how trustworthy the source is (authority).¹

I considered and rejected two additional dimensions. Recency (newer signals score higher) was dropped because knowledge management is not a news feed — a 2019 essay on API design is no less valuable than a 2026 one. Social proof (likes, retweets, stars) was dropped because popularity correlates with appeal, not with utility to my specific workflow. Four dimensions are enough to route signals accurately. Adding more would increase tuning complexity without meaningfully improving routing quality.

The Four Dimensions

Relevance (35%): Interest Matching

Relevance answers a single question: does this signal relate to something I care about? A hand-curated keyword dictionary with 43 entries maps each interest to a score from 0.15 (nft — barely relevant but I want to see what’s happening) to 1.0 (claude code, swiftui — core to my daily work). The scoring blends the best match with the average of all matches:

# 60% best match, 40% average of all matches
return min(1.0, best_score * 0.6 + avg_score * 0.4)

Zero-match items get a baseline of 0.25, not 0.0. The system penalizes an unknown topic less harshly than an irrelevant one. The baseline is the most frequently tuned parameter: too high and irrelevant content floods the inbox, too low and genuinely new interests get filtered out before I see them.²

Actionability (30%): Learning Potential

Actionability answers: can I do something with this? A theoretical essay about SwiftUI’s architecture is relevant but not actionable. A tutorial that walks through building a SwiftUI component is both. The dimension matches against 22 action-oriented keywords: tutorial, guide, how-to, build, github.com. URLs get special treatment:

if "github.com" in url:
    hits += 2  # Repositories are inherently actionable
if "/docs" in url or "/tutorial" in url:
    hits += 1

Scoring is step-function based, not linear: 0 hits → 0.10, 1 hit → 0.40, 2 hits → 0.60, 3+ hits → min(1.0, 0.70 + hits * 0.05). The step function rewards the presence of actionability signals more than their quantity. One tutorial link is worth more than the difference between three and four keywords.³

Depth (20%): Metadata Richness

Depth is purely structural. It measures the presence and length of fields, not content quality:

Signal	Score
Has title	+0.20
Has description	+0.20
Description > 50 chars	+0.15
Description > 150 chars	+0.15
Has any tags	+0.15
Has 3+ tags	+0.10
Has URL	+0.05
Maximum	1.00

Depth is the dimension I trust least. A tweet about a flower photo with a full description and four tags scores depth 0.85. Rich metadata, irrelevant content. Depth proxies for “the source provided structured data,” which correlates with but does not guarantee quality.⁴

Authority (15%): Source Credibility

Authority starts at a baseline of 0.40 and adjusts by source type:

if source in ("twitter", "x"):        score = 0.50
elif source in ("blog", "newsletter"): score = 0.60
elif source in ("github", "docs"):     score = 0.70

A domain allowlist overrides upward (never downward): github.com, anthropic.com, apple.com, arxiv.org, docs.python.org, and others set authority to at least 0.75. The allowlist encodes a judgment that these sources earn higher trust by default.

Threshold Routing

Three routing bins determine what happens to each scored signal:

THRESHOLD_AUTO_WRITE = 0.55   # → domain folder
THRESHOLD_INBOX      = 0.30   # → 00-Inbox (manual review)
# Below 0.30 → silently skipped

The pipeline writes signals scoring >= 0.55 directly to one of 12 domain folders, inferred from tag and title matching. Mid-range signals (0.30-0.55) go to an inbox for manual review. Anything below 0.30 never reaches the vault. The 0.55 threshold emerged from empirical tuning: at 0.50, too many marginally relevant signals auto-routed to domain folders and diluted their usefulness. At 0.60, the inbox grew too large as borderline-useful signals missed auto-routing. The 0.30 floor ensures that any well-structured signal from a known source type reaches at least the inbox for human review.

The 0.30-0.55 range is the “ambiguous zone” where the system is least confident. An optional --llm-triage flag sends these signals to Claude for evaluation, which can adjust the composite score by up to ±0.20, potentially shifting a signal across the auto-write threshold. Claude only sees ambiguous signals, never high-scoring or low-scoring ones. Spending API tokens on signals the deterministic scorer already handled would be a waste.⁵

Domain inference uses a voting system. Each tag maps to a domain, each keyword in the title adds a vote. The domain with the most votes wins. Ties break by dict ordering (effectively arbitrary). Default fallback: “Inspiration.”

The Results

After 7,700+ notes processed over 14 months:

Domain	Notes	% of Total
Development	2,837	36.8%
Design	1,709	22.2%
Inspiration	565	7.3%
Claude-Code	414	5.4%
AI-Tools	414	5.4%
Productivity	346	4.5%
Ideas	296	3.8%
Science	231	3.0%
Health-Life	191	2.5%
Architecture	142	1.8%
Startups	26	0.3%
Tools	22	0.3%
Inbox	420	5.5%
Silently skipped (< 0.30)	~87	~1.1%

The table rows sum to 7,613 routed signals. The remaining ~87 items (scoring below the 0.30 inbox threshold) were silently skipped, bringing the total processed to 7,700+. The distribution reflects reality. I consume more development and design content than anything else. Inbox items (420) represent the ambiguous zone, signals the algorithm could not confidently auto-route.⁶

What the Algorithm Got Wrong

The Depth Trap

A nemophila photo tweet scored composite 0.36, relevance 0.25, actionability 0.10, depth 0.85, authority 0.50. It routed to the inbox because depth (0.85) and authority (0.50) compensated for near-zero relevance and actionability. Rich metadata, irrelevant content: a pretty photo of flowers.

The example illustrates the fundamental limitation of metadata-proxy scoring. Depth measures “the source provided structured data,” not “the content is valuable.” Twitter provides full descriptions and tags for every tweet. A well-tagged tweet about breakfast scores depth 0.85. Information retrieval research calls the underlying tension the precision/recall trade-off: optimizing for recall (catching everything relevant) inevitably admits false positives.¹⁰

The fix I considered and rejected: Reducing the depth weight from 0.20 to 0.10 would reduce false positives from well-tagged irrelevant content but would also penalize genuinely deep content from sources with sparse metadata. The current weight is a compromise. A more sophisticated approach would replace the structural depth check with a lightweight text quality heuristic — for example, counting unique nouns or measuring vocabulary diversity with a type-token ratio. I have not implemented this because the current system routes 93% of signals correctly, and the 7% that reach the inbox for manual review are a manageable load. The depth dimension is honest about its limitation: it measures structure, not substance.

The Relevance Baseline Problem

A 0.25 baseline for zero-match relevance means that any well-structured signal from a reasonable source scores at least 0.30 and lands in the inbox. The baseline creates a false positive floor: the inbox accumulates signals that are well-tagged and from reasonable sources but have nothing to do with my interests.

The actual fix: Periodic inbox review remains necessary. The pipeline reduces the review surface from 7,700 items to 420 (a ~95% reduction), but it cannot eliminate manual review for the ambiguous zone.

Implementation Notes

The pipeline runs as a CLI tool. Input is a JSON array of signals (from RSS, Twitter API, or manual entry). Output is Obsidian-compatible markdown files written to domain folders.

python triage.py --input signals.json --vault ~/obsidian-signals/
python triage.py --input signals.json --vault ~/obsidian-signals/ --llm-triage
python triage.py --input signals.json --min-score 0.60  # Stricter routing

Pre-filtering runs before scoring: URL deduplication against existing vault notes, empty content filtering, and a noise source blocklist. Duplicate notes and spam sources never reach the scoring stage.

The scoring functions are pure: no side effects, no API calls, no filesystem access. Each function takes a signal dict and returns a scores dict. The pure approach makes them testable in isolation and composable with the LLM triage stage, which only runs on the ambiguous subset.⁷

Minimal Scoring Function

The core scoring logic fits in a single function. Here is a simplified version that captures the essential pattern:

def score_signal(signal: dict, keywords: dict[str, float]) -> dict:
    """Score a signal across four weighted dimensions. Returns composite 0.0-1.0."""
    title = signal.get("title", "").lower()
    desc = signal.get("description", "").lower()
    text = f"{title} {desc}"

    # Relevance: best keyword match blended with average
    matches = [v for k, v in keywords.items() if k in text]
    relevance = (max(matches) * 0.6 + sum(matches) / len(matches) * 0.4
                 if matches else 0.25)

    # Actionability: step function based on action keyword hits
    action_words = {"tutorial", "guide", "how-to", "build", "example", "github.com"}
    hits = sum(1 for w in action_words if w in text)
    actionability = {0: 0.10, 1: 0.40, 2: 0.60}.get(hits, min(1.0, 0.70 + hits * 0.05))

    # Depth: metadata richness (structure, not quality)
    depth = sum([
        0.20 if signal.get("title") else 0,
        0.20 if signal.get("description") else 0,
        0.15 if len(desc) > 50 else 0,
        0.15 if len(desc) > 150 else 0,
        0.15 if signal.get("tags") else 0,
        0.10 if len(signal.get("tags", [])) >= 3 else 0,
        0.05 if signal.get("url") else 0,
    ])

    # Authority: source-based baseline
    source = signal.get("source", "").lower()
    authority = {"twitter": 0.50, "blog": 0.60, "github": 0.70, "docs": 0.70}.get(source, 0.40)

    composite = round(relevance * 0.35 + actionability * 0.30 + depth * 0.20 + authority * 0.15, 3)
    return {"relevance": relevance, "actionability": actionability,
            "depth": depth, "authority": authority, "composite": composite}

The production version adds the domain allowlist, URL-specific bonuses, and pre-filtering. But the scoring logic above demonstrates the complete pattern: four pure functions, one weighted sum, deterministic output.

Sample Dataset: Five Signals, Five Scores

Run the score_signal function above against these five signals to verify the scoring behavior. The keyword dictionary used: {"claude code": 1.0, "swiftui": 1.0, "fastapi": 0.85, "design": 0.60, "python": 0.75}.

#	Signal	Composite	Route
1	`{"title": "Building a SwiftUI Component Library", "description": "A step-by-step tutorial showing how to build reusable SwiftUI views with previews and documentation. Includes working example code and a GitHub repository at github.com/example.", "tags": ["swiftui", "ios", "tutorial", "components"], "source": "blog", "url": "https://example.com/swiftui-lib"}`	0.910	Auto-write
2	`{"title": "Python 3.13 Release Notes", "description": "Official release notes documenting new features, deprecations, and breaking changes in Python 3.13 including improved error messages and typing enhancements.", "tags": ["python", "release"], "source": "docs", "url": "https://docs.python.org/3.13/whatsnew"}`	0.578	Auto-write
3	`{"title": "Beautiful nemophila photos from Hitachi Seaside Park", "description": "Spring bloom season produces stunning blue flower fields stretching to the horizon. Best visited in late April through early May for peak color.", "tags": ["photography", "japan", "flowers", "travel"], "source": "twitter"}`	0.353	Inbox
4	`{"title": "Thread about tech hiring", "description": "", "tags": [], "source": "twitter"}`	0.232	Skipped
5	`{"title": "FastAPI Guide: Build Production APIs", "description": "A comprehensive tutorial showing how to build, test, and deploy a production FastAPI application. Covers authentication, rate limiting, background tasks, and CI/CD. Full example code on github.com.", "tags": ["fastapi", "python", "tutorial", "guide", "api"], "source": "blog", "url": "https://example.com/fastapi-prod"}`	0.872	Auto-write

Signal 1 arithmetic (step by step): Relevance: “swiftui” matches (1.0). One match, so 1.0 * 0.6 + 1.0 * 0.4 = 1.0. Actionability: “tutorial” + “build” + “example” + “github.com” = 4 hits → min(1.0, 0.70 + 4 * 0.05) = 0.90. Depth: title(0.20) + desc(0.20) + >50chars(0.15) + >150chars(0.15) + tags(0.15) + >=3tags(0.10) + url(0.05) = 1.00. Authority: “blog” → 0.60. Composite: 1.0*0.35 + 0.90*0.30 + 1.00*0.20 + 0.60*0.15 = 0.35 + 0.27 + 0.20 + 0.09 = 0.910.

Signal 1 scores high across all dimensions: exact keyword match (SwiftUI = 1.0), four action keywords (“tutorial,” “build,” “example,” “github.com”), full metadata, and a trusted source. Signal 2 lands just above the auto-write threshold: strong relevance (Python = 0.75) and high authority (docs = 0.70), but zero actionability (release notes are informational, not tutorial). Signal 3 is the nemophila example from the Depth Trap section: high depth (0.80) and moderate authority (0.50), but near-zero relevance and actionability produce a composite that lands in the inbox rather than auto-routing. Signal 4 has no description, no tags, and no keyword match — it falls below the 0.30 floor and is silently skipped. Signal 5 demonstrates how multiple keyword matches boost relevance: both “fastapi” (0.85) and “python” (0.75) hit, and five action keywords push actionability to 0.95.

Readers can paste the score_signal function and these five signals into a Python REPL to verify every number in the table.

Key Takeaways

For engineers building triage systems:

Deterministic scoring beats manual curation at scale. 7,700 notes in 14 months. Manual triage at approximately 25 seconds per signal (read title, scan description, decide folder) would require 53+ hours of review time. The pipeline processed them in minutes with a ~93% auto-routing rate (5.5% sent to inbox for manual review, ~1.1% silently skipped below the 0.30 threshold). Approximately 95% of all signals required no manual intervention at all (auto-routed + skipped).
Metadata proxies have known failure modes. Depth measures structure, not quality. Authority measures source, not accuracy. Both proxies work at aggregate scale but produce predictable false positives at the individual signal level. Acknowledging the failure modes is more honest than claiming the algorithm “works.”

For knowledge management practitioners:

Weighted composites expose your actual priorities. Weights of 35/30/20/15 are not arbitrary. They encode a specific judgment: relevance matters more than actionability, which matters more than metadata richness, which matters more than source credibility. Making weights explicit and tunable is the difference between a system and a habit.
The ambiguous zone is irreducible. Signals between 0.30 and 0.55 are genuinely ambiguous: the deterministic scorer cannot resolve them. LLM triage helps but does not eliminate the zone. Manual review of the ambiguous subset remains necessary.

Exercise: Tune your own weights. Start with equal weights (25/25/25/25). Process 50 signals. Review the 10 most surprising routing decisions — signals that went to the wrong place. For each, identify which dimension’s weight would have corrected the routing. Adjust one weight at a time, reprocess the same 50 signals, and compare. Retune no more than once per week.

FAQ

Why use weighted composite scoring instead of machine learning for knowledge triage?

Composite scoring is deterministic: the same input always produces the same score, and you can explain exactly why. An ML classifier might achieve higher accuracy but introduces opacity — you cannot easily explain why a signal scored 0.62 instead of 0.54. For personal knowledge management where trust in the system matters more than marginal accuracy gains, deterministic scoring earns that trust. The weights are explicit, tunable, and auditable. ML classifiers also require labeled training data, which means reviewing hundreds of signals manually before the system works at all — the exact manual review burden the pipeline exists to eliminate.

Can I tune the weights dynamically over time?

Yes, and you should. The weights started at 25/25/25/25 (equal) and evolved to 35/30/20/15 over six months based on observed failure modes. The key signal for retuning: when your inbox consistently contains a specific type of false positive (e.g., well-tagged irrelevant content), increase the weight of the dimension that would have filtered it (relevance, in that case). Retune no more than one dimension at a time, and run a week of signals through both old and new weights before committing.

How does the LLM triage stage work?

The optional --llm-triage flag sends only ambiguous signals (0.30-0.55 composite score) to Claude for evaluation. The LLM receives the signal’s title, description, tags, and URL, then returns a score adjustment (-0.20 to +0.20) and a domain classification. The adjustment can push a signal above the 0.55 auto-write threshold or below the 0.30 skip threshold. Cost is approximately $0.02-0.04 per signal. Running LLM triage on only the ambiguous subset (420 of 7,700) costs $8-17 instead of $150-300 for the full set.

Can weighted composite scoring work for support tickets, content moderation, or lead qualification?

The architecture is general. Composite scoring with weighted dimensions applies to any domain where items need deterministic routing: support ticket triage (urgency, customer tier, topic complexity, SLA status), content moderation (severity, confidence, context sensitivity, reporter credibility), lead qualification (fit score, engagement level, intent signals, company size), and anomaly detection (deviation magnitude, historical frequency, business impact, source reliability). The specific weights and thresholds change per domain, but the four-dimension weighted composite pattern transfers directly. The compounding engineering principle applies: once the scoring architecture exists, adapting it to a new domain requires only redefining the keyword dictionaries and retuning the weights.

Composite scoring applies to any domain where items need deterministic routing. My specific weights and thresholds encode personal priorities, but the architecture is general. For more on how accumulated knowledge creates nonlinear value, see Mental Compound Interest. Each pipeline component (scoring, routing, triage) is independently useful and compounds with the others, following the compounding engineering pattern.

I did not derive the weight distribution mathematically. I tuned it over six months of use. Initial weights were equal (25/25/25/25). Relevance increased to 35% after observing that high-depth, low-relevance signals (well-tagged irrelevant content) flooded the inbox. Actionability increased to 30% after observing that theoretical content with high relevance but no practical application accumulated without use. ↩
The 0.25 baseline for zero-match relevance is a deliberate design choice. Setting it to 0.0 would mean that any signal outside the curated keyword list scores maximum 0.65 (0 + action + depth + authority with no relevance contribution), making it nearly impossible for genuinely new topics to reach the auto-write threshold. ↩
I chose step-function scoring for actionability over linear scoring because actionability is closer to a boolean than a continuous variable. A tutorial is actionable. A news article about a tutorial is not. The step function captures this binary nature better than a gradient. ↩
I originally named the depth dimension “quality” and intended it to measure content richness. After observing that it measured metadata richness instead, I renamed it to “depth” to reflect its actual behavior. The name change is deliberate honesty about what the metric captures. ↩
LLM triage uses Claude Code CLI (claude --print --model opus) with a structured prompt that asks for a score adjustment (-0.20 to +0.20) and a domain classification. Author’s cost estimate: approximately $0.02-0.04 per signal. Running LLM triage on all 7,700 signals would cost $150-300. Running it on only the 420 ambiguous signals costs $8-17. ↩
Author’s domain distribution data as of February 2026. Counts reflect cumulative routing since December 2024. The distribution has been stable since month three, with Development and Design consistently accounting for 55-60% of routed signals. ↩
I chose pure scoring functions as a deliberate architectural decision. The alternative (scoring functions that check the filesystem for duplicates or call APIs for enrichment) would have been more accurate but untestable without mocking. The pure approach sacrifices some accuracy for testability and composability. ↩
Clay Shirky, “It’s Not Information Overload. It’s Filter Failure,” Web 2.0 Expo keynote, 2008. youtube.com/watch?v=LabqeJEOQyI. Shirky’s framing applies directly: an inbox with 400+ unprocessed notes is not an overload of information but an absence of filtering. See also Alvin Toffler, Future Shock, Random House, 1970, which coined “information overload” as the difficulty of making decisions when exposed to too much information. ↩
Weighted linear combinations are a standard technique in multi-criteria decision analysis (MCDA). The approach here is a simplified Weighted Sum Model (WSM), one of the oldest MCDA methods. Saaty, T.L., The Analytic Hierarchy Process, McGraw-Hill, 1980, provides a rigorous methodology for deriving weights from pairwise comparisons — a more formal approach than the empirical tuning used here. The simpler additive model used in this pipeline follows Fishburn, P.C., “Additive Utilities with Incomplete Product Set,” Journal of Mathematical Psychology, 4(1), pp. 104-110, 1967. ↩
The precision/recall trade-off is a fundamental concept in information retrieval. Increasing recall (catching more relevant items) necessarily admits more irrelevant items, reducing precision. The depth dimension optimizes for recall by rewarding any well-structured signal, which is why irrelevant but well-tagged content passes the threshold. See Manning, C.D., Raghavan, P., & Schütze, H., Introduction to Information Retrieval, Cambridge University Press, 2008, Chapter 8. nlp.stanford.edu/IR-book/ ↩