← Todos os Posts

Seventeen Thousand Signals

My Obsidian vault contains 17,913 signal notes. Each one is a research paper, blog post, security advisory, or community discussion that my scanner identified as potentially relevant to one of nine topics I track: AI safety, LLM agents, Claude/Anthropic, SwiftUI/iOS, design systems, creative coding, ML research, science, and security.

Of those 17,913 signals, I have read perhaps 200 closely. Another 500 influenced a decision, a blog post, or a design choice. The remaining 17,213 are noise that I scanned, scored, and filed without acting on.

The noise is not wasted. The noise is the instrument.

The Scoring Problem

Every signal gets a composite score from 0 to 1, weighted across four dimensions: relevance (does it match my topics), actionability (can I do something with it), depth (is there substance), and authority (is the source credible). Signals scoring above 0.55 get written to domain folders. Signals between 0.40 and 0.55 go to inbox. Below 0.40, they are skipped.

The thresholds are calibrated, not chosen. They emerged from months of scanning, reviewing what landed in each bucket, and adjusting until the signal-to-noise ratio felt right. 0.55 was too high initially (missed papers that turned out to be important). 0.30 was too low (the inbox filled with garbage). The current thresholds produce approximately 15-30 domain writes and 10-20 inbox items per scan across all topics.

The scoring system has biases I understand:

Research papers start at 0.75 authority. An arXiv paper with a matching category and keywords scores 0.75 before any content evaluation. This is deliberate: peer-reviewed research from relevant fields has baseline credibility that blog posts and HN discussions do not.

Security advisories start at 0.95 authority. A CVE from NVD or a GHSA from GitHub scores high regardless of content because the existence of a vulnerability advisory is itself the signal. The content is secondary to the fact.

HN discussions start at 0.55 authority. Community discussions are valuable for sentiment and discovery but unreliable for facts. A high-points HN story about a new paper is a discovery mechanism, not a source. The paper itself is the source.

These baselines encode my judgment about source reliability. A different person with different priorities would set different baselines. The baselines are not objective truth. They are a codified opinion about where trust comes from.

What the Noise Teaches

Most scans produce 80-100 domain writes and 20-40 inbox items. The majority are noise: papers I will never read, advisories for software I do not use, discussions about topics I track but do not act on.

The noise teaches three things:

The shape of the field. When ai-safety scans consistently return papers on mechanistic interpretability and RLHF, that tells me where the research community is focused. When llm-agents scans suddenly produce five papers on agentic code review in one week, that tells me a trend is forming. The individual papers may be noise. The frequency distribution is signal.

The baseline for surprise. A paper scoring 0.65 in the ai-safety topic is unremarkable. A paper scoring 0.91 is surprising. The surprise is meaningful only because I have a baseline of what 0.65 looks like. The noise establishes the baseline. The signal is the deviation from baseline.

The gaps in my coverage. When the LiteLLM supply chain attack happened, my scan-intel pipeline caught it through HN keyword matching. The pipeline did not have security advisory sources (NVD, OSV, GHSA) at the time. The gap was invisible until an incident fell through it. I expanded the pipeline to add three security advisory sources the following week. The noise from those new sources is teaching me what normal advisory traffic looks like. The next gap will be visible sooner.

The Expansion

The pipeline started with 6 sources. It now has 12:

Source Type What It Catches
arXiv API Research papers by category and keyword
Semantic Scholar API Academic papers with citation data
Hacker News API Community discussion with point-weighted relevance
HuggingFace Daily Papers API ML papers curated by HF community
Lobsters RSS Technical community discussion
Simon Willison Atom AI tooling commentary from a practitioner
Anthropic blog Scrape Official Anthropic announcements
Papers With Code Scrape Papers with implementations
Apple ML Research Scrape Apple’s ML research publications
NVD API CVEs with CVSS scoring (added March 2026)
OSV API Package-specific advisories for 15 monitored packages
GitHub Advisories CLI GHSA entries with alias cross-referencing

Each source added noise. Each source also caught something the others missed. The LangChain path traversal vulnerability appeared in GHSA but not in HN. The Claudini autoresearch paper appeared in arXiv 12 hours before it appeared in HN. The LiteLLM credential stealer appeared in OSV with the MAL-2026-2144 identifier that NVD did not yet carry.

The alias-based dedup system collapses cross-source duplicates. The same CVE appearing in NVD, OSV, and GHSA produces one signal note, not three. In the first live run, 6 out of 85 security signals were alias-deduped. The dedup rate will increase as the sources mature.

The Triage Discipline

Seventeen thousand signals demand a triage discipline. Mine is simple: scan the output, read the high scores, file the rest.

A typical scan takes 3 minutes to run and 2 minutes to review. I read every signal above 0.80 (usually 2-5 per scan). I skim the 0.60-0.80 range for surprises. I ignore everything below 0.60 unless a keyword catches my eye.

The scanning is habitual. Morning scan, evening scan. Some days produce 100+ domain writes (when a new arXiv batch drops). Some days produce zero (when the 7-day lookback window has been fully deduped). The variance is normal. The habit is constant.

The signals that matter most are the ones that change what I build or write. The Claudini paper (0.83) became a blog post. The LiteLLM supply chain attack (0.67 from HN, then confirmed via OSV at 0.62) became a blog post and two citation updates to existing posts. The LICA dataset (found manually, not by scan-intel) became a design taste engine plan. The SlopCodeBench paper (0.77) became a citation candidate for the compound context post.

Most signals do not become anything. They file silently into the vault, establish the baseline, and wait for the day when a new signal connects to an old one and produces an insight that neither signal contained alone.

The Vault as Memory

The vault is not a reading list. I do not intend to read the 17,213 signals I have not read. The vault is a queryable memory of what the field produced in the time I have been watching.

When I write a blog post about supply chain security, I can search the vault for every signal tagged “security” and “supply-chain” in the last 90 days. The search returns the LiteLLM attack, the Trivy compromise, the MCPTox benchmark, the Clinejection attack, and a dozen CVEs affecting AI infrastructure packages. Each one is a potential citation, a data point, or a counter-argument.

When I plan a new feature, I can search for signals related to the domain. The LICA dataset appeared in a scan-intel run as a 0.72 design-systems signal. I would not have found it through targeted search because I was not looking for graphic design datasets. The scan surfaced it because the keywords (“design systems,” “typography”) matched. The vault made the connection.

The 17,213 unread signals are not wasted effort. They are indexed context that I can query when I need it. The scanning is cheap. The indexing is automatic. The value is latent until the moment a question connects to an answer that was filed months ago.


FAQ

What tools do you use?

The scanner is a custom Python script (scan_intel.py, ~1,200 lines) that fetches from 12 sources, scores with a triage engine, deduplicates across three layers (URL, paper ID, advisory aliases), and writes markdown notes to an Obsidian vault. The vault uses Dataview for querying. Configuration is in JSON. State (seen IDs) is in JSON with 90-day pruning.

How much does this cost?

Zero. All sources are free-tier APIs or public RSS feeds. arXiv, Semantic Scholar, OSV, and the HN Algolia API require no authentication. NVD has a free tier with rate limits (5 requests per 30 seconds). GitHub advisories use the gh CLI which authenticates via your existing GitHub session.

How do you avoid information overload?

The scoring thresholds and the triage discipline. I spend 2 minutes per scan reviewing the output. Signals below 0.60 are filed without reading. The vault grows but my attention does not scale with it. The vault is a memory, not a reading assignment.

Can I use this system?

The architecture is portable: fetch from APIs, score with weighted criteria, deduplicate, write to a knowledge base. The specific sources, keywords, and thresholds are calibrated to my interests. You would need to define your own topics, keywords, and authority baselines. The scoring engine and dedup logic are domain-agnostic.

Artigos relacionados

The Handoff Document

A diagnosis that survived three code review corrections, two priority reorderings, and guided the correct implementation…

7 min de leitura

Overnight

Between midnight and 6am, Googlebot crawls 21,000 pages, Bingbot crawls 10,000, and the comprehensive check grinds throu…

7 min de leitura

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

8 min de leitura