Deep Research Agents Need Evidence Graphs

On May 15, 2026, Zhen Zhang and coauthors published Argus, a deep research agent system that treats research as evidence assembly rather than brute-force parallel search.1

That distinction matters.

Deep research agents can run many searches, open many pages, and write long answers. A long answer does not prove that the agent found the missing evidence. Parallel search can duplicate the same source cluster, push more excerpts into context, and still leave the hard part unsupported.

Deep research agents need evidence graphs. The agent should know which claims need support, which evidence pieces already exist, which pieces remain missing, and which final sentences depend on which sources.

TL;DR

Deep research agents should not measure progress by how many searches they ran or how much context they filled. They should measure progress by evidence coverage.

Argus gives the field a useful shape. Its Searcher gathers evidence traces for sub-queries, while its Navigator maintains a shared evidence graph, checks which pieces remain missing, dispatches more search work, and produces a source-traced final answer.1 That moves deep research away from “run more agents” and toward “assemble the missing proof.”

The same pattern appears across recent agent research. paper.json gives papers addressable claims and scope limits.2 ACDL gives agent context a formal description language.3 Exploration work argues that agents need verifiable checkpoints before acting.4 ARIS frames the central long-horizon research failure as plausible unsupported success.5 AgentForesight argues for online auditing before a decisive error cascades through a multi-agent run.6

The practical rule: every deep research answer should carry an evidence graph or a review packet that can show what the agent proved, what it inferred, and what remains unresolved.

Key Takeaways

For agent builders: - Track evidence as a graph of claims, sources, gaps, and dependencies. - Dispatch search work toward missing evidence instead of repeating broad queries.

For product teams: - Show source coverage, unresolved claims, and duplicated-search waste. - Let reviewers inspect why the final answer cites each source.

For researchers: - Separate evidence gathering from answer synthesis. - Evaluate coverage and traceability, not only final answer score.

For operators: - Treat a confident long report as unfinished until the evidence graph closes its important gaps. - Ask which claims still lack primary support before accepting the answer.

Why Does Parallel Search Stall?

Parallel search feels like progress.

Give ten agents the same research question, and the system produces motion. Agents search, summarize, compare, and return partial findings. The final synthesis can look thorough because the transcript contains many sources.

The failure hides inside redundancy.

Parallel Search Behavior Failure Mode
Many agents query similar terms Sources overlap instead of complementing each other.
Each agent follows the first promising trail Hard missing evidence stays untouched.
Context fills with excerpts The synthesizer loses room for reasoning over gaps.
Final answer merges summaries Unsupported claims can survive the merge.
Review starts at the final prose Reviewer must reverse-engineer evidence coverage.

Argus names that problem directly. The paper argues that deep research answers combine complementary evidence pieces, while parallel rollouts often duplicate rather than complete those pieces.1 More rollouts can push the aggregation context toward its limit without filling the missing parts.1

The lesson is not “never parallelize.” The lesson is “parallelize against a map.”

What Does Argus Add?

Argus splits deep research into two roles.

The Searcher collects evidence traces for a sub-query through ReAct-style interaction.1 The Navigator maintains a shared evidence graph, verifies which pieces remain missing, dispatches Searchers to gather those pieces, and reasons over the completed graph to produce a source-traced final answer.1

That role split changes the work object.

Old Work Object Argus Work Object
Search transcript Evidence trace
Source pile Shared evidence graph
Query fan-out Missing-piece dispatch
Final prose Source-traced answer
Broad synthesis Coverage-aware synthesis

The Navigator gives the agent a memory of what the answer still lacks. Without that layer, parallel workers can keep returning evidence for the same easy claim.

Argus also reports performance gains. With a 35B-A3B MoE backbone, the paper reports that Argus gains 5.5 points with a single Searcher and 12.7 points with eight parallel Searchers, averaged across eight benchmarks.1 The important detail is not only the score. The important detail is the architecture that makes extra searchers useful.

Searchers become useful because the Navigator points them at missing evidence.

What Should An Evidence Graph Track?

An evidence graph should represent the answer before prose hardens around it.

At minimum, it should track:

Node Type Purpose
Claim The sentence or sub-claim the answer wants to make.
Source The primary or secondary source that supports a claim.
Evidence The exact excerpt, table, figure, command output, or observation.
Gap A claim with weak, missing, stale, or indirect support.
Conflict Two sources or observations that disagree.
Scope limit A boundary that prevents overclaiming.
Definition A term whose meaning affects downstream claims.
Task decision A choice the agent made because of evidence state.

Edges matter more than nodes.

Edge Meaning
supports Evidence supports a claim.
limits Scope limit narrows a claim.
contradicts Source conflicts with a claim or source.
depends_on Claim needs another claim or definition.
missing_for Gap blocks a claim.
dispatches Navigator asks a Searcher to fill a gap.
used_in Final answer sentence depends on a source or evidence node.

The graph does not need academic graph-database ceremony. A JSON object, trace table, or review packet can work. The important property is inspectability: another reviewer can see why the answer says what it says.

Why Do Evidence Graphs Help Reviewers?

Reviewers need a smaller object than the whole transcript.

A deep research transcript can include dozens of tool calls, sources, summaries, retries, and notes. The reviewer usually wants answers to sharper questions:

  • Which final claims have direct support?
  • Which claims depend on secondary interpretation?
  • Which source appears more than once under different summaries?
  • Which missing question did the agent stop pursuing?
  • Which citation supports only background, not the key claim?
  • Which limitation should narrow the final answer?

An evidence graph gives reviewers that surface.

Reviewer Question Evidence Graph Answer
Where did the key claim come from? Claim node with supports edges.
Did the agent overstate the paper? Scope-limit edge attached to the claim.
Did workers duplicate effort? Multiple sources supporting the same easy node while gap nodes remain open.
Can the answer ship? No high-risk claim nodes remain unsupported.
What should another agent do next? Dispatch from unresolved gap nodes.

That shape pairs naturally with review packets. A final answer should not only give prose. It should give the evidence state that produced the prose.

How Does paper.json Fit?

Evidence graphs need better source objects.

If every academic paper enters the graph as one undifferentiated PDF, the graph still has coarse nodes. A claim node can link to a paper, but it cannot easily link to a sub-claim, scope limit, definition, or reproduction command.

paper.json improves the input layer. The proposal gives papers stable claim IDs, explicit does-not-claim lists, per-figure shell commands, and stable definition IDs.2 A research agent can use those IDs as graph nodes.

Paper Surface Evidence Graph Node
claims[].id Claim node.
does_not_claim[] Scope-limit node.
definitions[].id Definition node.
reproducibility.commands[] Evidence-production node.
Repository URL Source node.
Schema version Provenance metadata.

That connection matters for citation quality. The answer can cite C2 from a paper instead of vaguely citing the whole paper. The graph can also record that C2 carries a limitation from does_not_claim[].

Evidence graphs and agent-readable papers solve adjacent problems. The paper file makes evidence easier to address. The graph makes evidence easier to assemble.

How Does Context Description Fit?

Deep research agents also need to know what entered context and when.

ACDL, the Agentic Context Description Language, targets that problem at the prompt layer. The paper argues that agent systems lack a standard way to describe prompt composition and context dynamics, relying instead on prose, diagrams, or code inspection.3 ACDL gives systems constructs for role message sequences, dynamic content, time-indexed references, and conditional or iterative structure.3

An evidence graph should connect to context state.

Context Fact Evidence Risk
Source entered context before a claim Agent may quote or paraphrase it.
Scope limit did not enter context Final prose may overclaim.
Conflicting source arrived late Synthesis may ignore it.
Searcher saw only one branch Evidence trace may be narrow.
Navigator dispatched a new query Gap node caused targeted search.

Context shape affects evidence shape. A source cannot support the answer if the synthesizer never saw the relevant passage. A limitation cannot constrain the answer if no one put it into context.

Deep research systems need both objects: a context description and an evidence graph.

Why Does Exploration Matter?

Research agents can exploit too early.

“Look Before You Leap” names premature exploitation as a failure mode for LLM agents in unfamiliar environments.4 The paper introduces Exploration Checkpoint Coverage as a verifiable metric for whether agents discover key states, objects, and affordances before task execution.4

Deep research has the same shape. Agents can find one plausible trail and start answering before they understand the source space.

An evidence graph should preserve an exploration phase:

  1. Identify claim classes the answer will need.
  2. Map likely source types.
  3. Search for primary sources before commentary.
  4. Record missing source classes as gap nodes.
  5. Dispatch targeted searches for gaps.
  6. Only synthesize after important gaps close or receive explicit caveats.

That exploration phase prevents the agent from treating the first good source as the center of the answer.

The graph gives the agent a reason to keep looking: an open gap remains visible.

What Goes Wrong Without The Graph?

Long-running research agents can fail without looking broken.

ARIS frames the central failure as plausible unsupported success: a long-running agent produces claims with evidence support that remains incomplete, misreported, or inherited from its own framing.5 That failure can pass casual review because the final report looks polished.

AgentForesight attacks a related problem in multi-agent systems. It argues that a single decisive error can cascade through a long-horizon trajectory, while post-hoc attribution arrives too late to intervene.6 Its online auditor sees only the current prefix and must decide whether to continue or alarm before the full trajectory ends.6

Evidence graphs help with both problems.

Failure Graph Response
Plausible unsupported success Unsupported claim nodes remain visible.
Misreported source support supports edges can be checked against excerpts.
Inherited framing Scope and conflict nodes challenge the initial frame.
Cascading decisive error Gap or conflict nodes can trigger a pause before synthesis.
Post-hoc review overload Reviewer inspects graph state, not only final prose.

The graph does not guarantee truth. It gives truth a structure the team can audit.

What Should Deep Research Products Show?

Deep research products should expose evidence state.

A user should not only see a final answer with footnotes. The interface should show:

Surface User Value
Claim coverage Which claims have direct, indirect, or missing support.
Evidence graph How sources connect to answer sections.
Gap list Which questions remain unanswered.
Duplicate-source cluster Where search workers repeated effort.
Conflict list Which sources disagree.
Scope limits Which caveats constrain the answer.
Source trace Which search or read produced each evidence node.
Reviewer decision Keep, revise, block, or continue researching.

That interface gives users a way to steer the run. They can ask the agent to fill a specific gap instead of saying “research more.” They can reject a weak claim without discarding the whole answer. They can see when the agent has enough evidence to stop.

Good deep research UX should make missing evidence visible before final prose hides it.

What Should Teams Build First?

Start with a simple evidence table before building a graph engine.

Field Minimum Shape
Claim ID claim_01, claim_02, or imported paper claim ID.
Claim text The sentence the answer wants to support.
Source URL Canonical URL or paper ID.
Evidence excerpt Short source-backed passage or result.
Support type Direct, indirect, background, conflict, or missing.
Scope limit Caveat that narrows the claim.
Search trace Query, tool, timestamp, and agent role.
Status Supported, weak, conflicting, missing, or refused.

Then add dispatch:

  1. Before synthesis, list all high-value missing claims.
  2. Send each missing claim to a Searcher with a narrow query.
  3. Require the Searcher to return evidence or an explicit miss.
  4. Update the graph.
  5. Synthesize only from supported and caveated claims.

That first version can remain plain. A Markdown table can beat an invisible transcript if it forces the agent to show evidence coverage.

The Worthy Standard

Deep research agents should earn trust by showing their evidence structure.

More searches can help. More agents can help. Longer context can help. None of those inputs prove that the final answer covered the missing pieces.

A worthy deep research run should answer four questions:

  • Which claims did the agent try to prove?
  • Which sources support each claim?
  • Which gaps or conflicts remain?
  • Which final sentences depend on which evidence?

When those answers stay visible, users can review the work. When those answers disappear into polished prose, users must trust a summary without seeing the proof shape.

Deep research needs evidence graphs because research is not a search-count problem. Research is a missing-piece problem.

Quick Summary

Deep research agents need evidence graphs because parallel search can duplicate easy source clusters while important claims remain unsupported. Argus provides a strong pattern: a Searcher gathers evidence traces, while a Navigator tracks a shared evidence graph, dispatches work toward missing pieces, and produces a source-traced answer.1

The same lesson connects to adjacent research. paper.json improves paper-level source objects.2 ACDL describes how context enters agent systems.3 Exploration checkpoints make information gathering verifiable.4 ARIS and AgentForesight show why polished long-horizon outputs need evidence and online review before errors cascade.56

The operating rule is direct: do not ask a deep research agent only for an answer. Ask for the evidence graph that made the answer possible.

FAQ

What is an evidence graph for deep research agents?

An evidence graph connects claims, sources, excerpts, gaps, conflicts, scope limits, and final answer sentences. It lets reviewers see which evidence supports each part of a deep research answer.

Why is parallel search not enough?

Parallel search can duplicate sources and fill context without finding missing evidence. Deep research agents need a shared map of what the answer still lacks.

What did Argus contribute?

Argus split deep research into Searcher and Navigator roles. The Searcher gathers evidence traces, while the Navigator maintains a shared evidence graph, dispatches searches for missing pieces, and produces a source-traced final answer.1

How does paper.json relate to evidence graphs?

paper.json gives academic papers stable claim IDs, scope limits, definitions, and reproduction commands. Evidence graphs can use those IDs as precise nodes instead of citing a whole paper loosely.2

What should a product show users?

A product should show claim coverage, evidence links, unresolved gaps, duplicated search clusters, source conflicts, scope limits, and review decisions before asking users to trust the final prose.


References


  1. Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, and Xinyu Wang, “Argus: Evidence Assembly for Scalable Deep Research Agents,” arXiv:2605.16217v1, submitted May 15, 2026. Source for the Searcher/Navigator design, shared evidence graph, missing-piece dispatch, source-traced final answers, and reported score gains. 

  2. Arquimedes Canedo, “paper.json: A Coordination Convention for LLM-Agent-Actionable Papers,” arXiv:2605.16194v1, submitted May 15, 2026. Source for stable claim IDs, explicit does-not-claim lists, per-figure reproduction commands, stable definition IDs, and the need for agent-readable paper surfaces. 

  3. Noga Peleg Pelc, Gal A. Kaminka, and Yoav Goldberg, “A Language for Describing Agentic LLM Contexts,” arXiv:2605.01920v1, submitted May 3, 2026. Source for ACDL, context composition, context dynamics, role message sequences, dynamic content, time-indexed references, and the critique of informal context descriptions. 

  4. Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, and Fuli Feng, “Look Before You Leap: Autonomous Exploration for LLM Agents,” arXiv:2605.16143v1, submitted May 15, 2026. Source for premature exploitation, Exploration Checkpoint Coverage, and the Explore-then-Act framing. 

  5. Ruofeng Yang, Yongcan Li, and Shuai Li, “ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration,” arXiv:2605.03042v1, submitted May 4, 2026. Source for the plausible unsupported success failure mode in long-running research agents and the need for adversarial review of intermediate research artifacts. 

  6. Yiming Zhang, Pei Zhou, Jiahao Liu, Yifan Chen, Runzhe Yang, Zhenhailong Wang, Jiayi Pan, Chen Qian, Dong Li, and Heng Ji, “AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems,” arXiv:2605.08715v2, revised May 13, 2026. Source for decisive-error cascades, online auditing, trajectory-prefix review, and early alarm framing. 

Related Posts

AI Agent Skills Need Behavioral Audits, Not Pass Rates

AI agent skills can change behavior while pass rates stay flat. Behavioral audits compare traces, declared capabilities,…

14 min read

AI Agents Need Exploration Checkpoints

Exploration checkpoints let AI agents prove what they discovered before acting, reducing premature exploitation, brittle…

17 min read

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

11 min read