Research Papers Need Agent-Readable Claim Files

Q: What problem does paper.json try to solve?

paper.json targets recurring agent-reading failures: wrong sub-claim citations, scope overextension, hidden figure commands, and unstable definitions.1

Q: What should authors include first?

Authors should start with stable claim IDs, a does_not_claim[] section, stable definitions, exact reproduction commands, a repository URL, and a version for the claim file.

May 18, 2026 17 min read

ai agents research paper-json citations reproducibility ai-engineering

On May 15, 2026, Arquimedes Canedo proposed paper.json: a companion JSON file that lets a research paper expose stable claim IDs, explicit scope limits, per-figure reproduction commands, and stable definition IDs next to the PDF.¹

That small file points at a large problem.

Research agents now read papers, extract claims, cite sources, reproduce figures, build follow-up work, and summarize scope.¹ Prose still serves human readers. Prose alone gives agents too much room to cite the wrong sub-claim, generalize past the evidence, invent reproduction commands, or rebuild a definition from memory.

Research papers need agent-readable claim files. A paper should give agents a typed surface for what the paper claims, what the paper does not claim, how key terms work, and how evidence connects to figures and code.

TL;DR

Agent-readable claim files turn a paper from a prose-only artifact into a paper plus an addressable evidence surface. The PDF remains the human object. The claim file gives agents stable IDs, scope boundaries, definitions, and reproduction commands.

The paper.json proposal makes the case with a concrete schema and worked repo. The draft describes five conventions: stable claim IDs, an explicit does-not-claim list, exact per-figure shell commands, minimum viable compliance through one hand-written JSON file, and stable definition IDs.¹ The companion repository includes paper.json, schema.json, validator.py, resolve.py, the PDF, and the Typst source.²

The wider agent-research cluster supports the same direction. Argus treats deep research as evidence assembly rather than brute-force parallel search.³ ACDL gives agent contexts a formal description language.⁴ Exploration research shows agents need verifiable checkpoints before action.⁵ Agent-designed architecture work raises the stakes for paper-level reproducibility when agents generate scientific claims.⁶

The practical rule: publish the prose for humans and the claim file for agents.

Key Takeaways

For paper authors: - Add stable IDs for claims, definitions, theorems, figures, and follow-up work. - Write scope limits as first-class fields, not defensive prose hidden near the end.

For reviewers: - Check whether machine-readable claims match the paper, not only whether the schema validates. - Treat stale or exaggerated claim files as citation-risk defects.

For research-agent builders: - Fetch the claim file before summarizing, citing, reproducing, or building on a paper. - Cite claim IDs and definition IDs when a task depends on exact scope.

For journals and repositories: - Accept a low-friction file beside the PDF before asking authors to adopt a full platform. - Validate structure automatically and leave semantic review to humans and specialist agents.

Why Do Prose Papers Fail Research Agents?

Academic prose compresses evidence into a narrative.

That narrative helps humans. A careful reader can follow hedges, compare sections, infer which result supports which claim, and notice where the paper stops. Agents often process papers differently. They scan, chunk, retrieve, quote, summarize, and compose new artifacts under time and context limits.

That creates predictable failure modes.

Prose-Only Surface	Agent Failure
Claim appears inside a paragraph	Agent cites the wrong sub-claim or cites the whole paper.
Scope limit appears in discussion	Agent turns a bounded result into a general claim.
Figure command lives in a repo	Agent invents a plausible command or skips reproduction.
Definition appears once	Agent reconstructs the term inaccurately later.
Follow-up work sits in prose	Agent treats an open question as a proved result.

Canedo names several of those failures directly: sub-claims lack sub-paper citation handles, scope overextension passes through prose summaries, and figure commands often sit outside the paper in code repositories.¹

The fix does not require replacing the paper. The fix requires adding an interface that makes paper claims easier to address.

What Should A Claim File Contain?

An agent-readable claim file should expose the pieces that agents misuse most often.

Field	Agent Job
`id`	Name the paper with a stable slug.
`version`	Tell agents which claim surface they read.
`claims[]`	Let agents cite sub-claims by stable ID.
`does_not_claim[]`	Block scope overreach before a summary spreads.
`definitions[]`	Preserve author-written meanings for key terms.
`reproducibility.commands[]`	Give exact commands for figures, tables, or checks.
`follow_up_work[]`	Separate future work from evidence already shown.
`repository`	Give agents the canonical code and file location.
`schema`	Let tools validate structure before use.

The paper.json worked example includes a draft version, repository URL, author metadata, abstract, claims, scope exclusions, reproducibility commands, and schema-backed validation.² Its schema requires core fields such as id, title, version, status, authors, abstract, claims, does_not_claim, and reproducibility.²

Structure does not prove truth. Structure makes truth reviewable.

That distinction matters. The paper.json file explicitly says a passing validator cannot prove semantic correctness, completeness, or figure reproduction quality.² A stale claim file can harm more than no claim file because agents may trust a neat field over messy prose.

So the standard needs two layers:

Structural validation: Does the file parse, contain required fields, and preserve declared IDs?
Semantic review: Does the file faithfully represent the paper?

Authors can automate the first layer. Reviewers must own the second.

Why Do Stable Claim IDs Matter?

Agents cite too coarsely when the only addressable unit is the whole paper.

A paper may contain a method claim, an evaluation claim, a limitation claim, a benchmark claim, and a follow-up claim. A human reader can cite the paper and explain which part matters. An agent often turns that whole-paper citation into a vague authority token.

Stable claim IDs give agents a smaller target.

Citation Target	Result
Whole paper	“The paper shows X.”
Section heading	“The method section says X.”
Stable claim ID	“Claim C2 states X under scope limit Y.”

Canedo’s draft reports pilot evidence for claim-ID retrieval. In the harder conceptual retrieval condition, agents using JSON claims scored 1.20 out of 2 on average, while agents searching prose scored 0.60 out of 2.² The paper labels that result as pilot evidence, not proof at scale.²

That caution improves the proposal. The point does not depend on pretending the first pilot settled the field. The point asks authors to create a better review object.

Claim IDs let reviewers ask sharper questions:

Did the agent cite C1 or the whole paper?
Did the summary preserve the qualifier in C2?
Did the downstream system build on C3 without checking the command?
Did the agent confuse a definition ID with a result claim?

Those questions beat “did the summary sound right?”

Why Do Scope Limits Need Their Own Field?

Agents often overstate papers because limitations hide in prose.

A paper may say its benchmark covers five tasks, its method needs a specific environment, or its result does not generalize beyond a controlled setup. A human reader can hold that nuance. An agent summary can drop the qualifier after one rewrite.

An explicit does_not_claim[] field makes scope limits visible before reuse.

Hidden Scope Limit	Claim-File Shape
“We do not evaluate clinical safety.”	`does_not_claim: clinical safety`
“Our method assumes tool traces exist.”	`does_not_claim: trace-free operation`
“The pilot uses five examples.”	`does_not_claim: population-level proof`
“The command validates structure only.”	`does_not_claim: semantic correctness`

The paper.json proposal lists several exclusions for its own work. It does not claim C1, C2, or C3 as proved, does not claim the validator guarantees semantic correctness, does not claim the convention solves agent reading, and does not claim compatibility with every scholarly metadata standard.²

That list gives agents something useful: boundaries they can quote.

Scope fields also help evaluators. If an agent summary says “paper.json proves claim IDs improve agent citation accuracy,” the evaluator can compare the sentence against the does_not_claim[] field and flag overreach. Without a field, the evaluator must infer scope from prose.

Why Should Figure Commands Sit Beside Claims?

Reproduction often fails at the command boundary.

Many papers point to a repository. The exact figure command may live in a script, a Make target, a notebook, a README note, or no obvious place. An agent can search the repo and assemble a command that looks plausible. Plausible commands create dangerous confidence when they never ran.

An agent-readable claim file should list reproduction commands directly.

The paper.json worked example includes commands for generating the validator, validating paper.json against paper.typ, and compiling the Typst paper into a PDF.² Canedo’s draft reports pilot evidence that JSON-supplied reproduction commands improved figure-command retrieval versus prose methods sections that point to a repository.²

The command field should stay modest:

Requirement	Reason
Exact command	Prevents invented shell fragments.
Expected artifact	Lets agents check output shape.
Environment note	Avoids hidden dependency guesses.
Figure or table ID	Connects command to paper evidence.
Known non-goal	Prevents agents from treating a smoke check as full reproduction.

Agents should not treat a command field as success. The command field gives the agent a target to run, record, and report.

Where Do Definitions Fit?

Definitions can cause more damage than claims.

A wrong claim usually fails one sentence. A wrong definition infects every later sentence that uses the term. Agents that reconstruct definitions from prose can create a vocabulary that sounds internal to the paper while drifting away from the author’s meaning.

Stable definition IDs address that risk.

Canedo’s fifth convention gives definitions stable IDs, and the draft argues that author-written definitions should beat agent-reconstructed definitions for later reuse.¹ The repository’s resolver supports fragments such as #C1, #D1, #T1, and #F1, mapping IDs to claims, definitions, theorems, and follow-up items.²

That mechanism matters for downstream systems.

Downstream Task	Definition Risk
Literature review	Agent merges terms from two papers with different meanings.
Benchmark extraction	Agent treats a metric name as if every paper defines it alike.
Code generation	Agent implements the wrong object because the definition drifted.
Follow-up experiment	Agent optimizes for a term the author never meant.

Claim files should make terms addressable. Agents should quote or resolve definitions before applying them.

How Should Research Agents Use Claim Files?

Agents need a reading protocol.

Before summarizing or citing a paper, a research agent should:

Fetch the paper’s claim file when available.
Validate the file structure.
Resolve the requested claim, definition, figure, theorem, or follow-up ID.
Cross-check the resolved item against the PDF when the task carries real stakes.
Preserve scope limits in every summary.
Run reproduction commands only inside an appropriate sandbox.
Report command output, missing files, and failed checks as evidence.
Fall back to prose only when the claim file lacks the needed item.

That protocol should produce a review packet:

Packet Field	Evidence
Paper	Title, version, repository, and PDF URL.
Claim file	URL, version, schema status, and validation output.
Resolved IDs	Claim IDs, definition IDs, figure IDs, or follow-up IDs used.
Scope limits	Relevant `does_not_claim[]` entries.
Reproduction	Commands run, outputs, failures, and environment.
Human check	Any claim the agent could not verify from the file or PDF.

The goal is not more paperwork. The goal is fewer unsupported citations.

How Does The Wider Agent Research Cluster Point The Same Way?

Recent agent research keeps returning to the same theme: agents need structured evidence surfaces, not more ungrounded fluency.

Argus treats deep research as evidence assembly. The system uses a Searcher and Navigator, where the Navigator tracks a shared evidence graph and dispatches search work toward missing evidence pieces.³ That design reinforces the need for papers to expose evidence pieces that agents can assemble.

ACDL targets context descriptions. The authors argue that agent systems need a precise, readable language for how prompts and interaction history evolve across steps.⁴ Claim files do a parallel job at the paper layer: they describe how the paper’s claims, definitions, and commands should enter agent context.

Exploration research adds another angle. “Look Before You Leap” introduces Exploration Checkpoint Coverage, a verifiable metric for whether an agent discovers key states, objects, and affordances before acting.⁵ Research agents need the same discipline before citing or reusing a paper. They should discover claims, definitions, limits, and commands before they act.

AIRA raises the stakes. The AIRA-Compose and AIRA-Design paper reports multi-agent architecture search that proposes new foundation-model architectures and downstream gains over baselines.⁶ If agents can generate scientific design claims, papers describing those claims need machine-readable boundaries and reproduction hooks.

ARIS names a failure that fits the whole category: long-running research agents can produce plausible unsupported success when evidence support stays incomplete, misreported, or inherited from the executor’s framing.⁷ Claim files give research agents less room to inherit unsupported framing from prose alone.

The pattern is consistent. Serious research agents need explicit evidence objects.

What Can Authors Ship Now?

Authors do not need journal approval to start.

The first version can live beside the paper:

{
  "id": "my-paper",
  "title": "My Paper Title",
  "version": "0.1.0",
  "status": "draft",
  "repository": "https://github.com/example/my-paper",
  "claims": [
    {
      "id": "C1",
      "statement": "The method improves retrieval accuracy on benchmark X under condition Y.",
      "evidence": ["figure-2", "table-1"]
    }
  ],
  "does_not_claim": [
    "The method improves retrieval accuracy outside benchmark X."
  ],
  "definitions": [
    {
      "id": "D1",
      "term": "retrieval accuracy",
      "definition": "The percentage of queries whose top-ranked result matches the labeled answer."
    }
  ],
  "reproducibility": {
    "environment": "Python 3.11",
    "commands": ["python scripts/reproduce_figure_2.py"]
  }
}

The first file should answer five questions:

Which exact claims can agents cite?
Which claims should agents refuse to infer?
Which definitions must remain stable?
Which commands reproduce the evidence?
Which version of the claim surface did the agent read?

That minimum gives agents a safer starting point. It also gives reviewers a concrete diff when the paper changes.

What Should Reviewers And Platforms Check?

Reviewers should not rubber-stamp a valid JSON file.

They should compare the file against the paper.

Check	Failure
Claim parity	Claim file states more than the paper proves.
Scope parity	Key limitation appears in prose but not `does_not_claim[]`.
Definition parity	Definition in JSON conflicts with author wording.
Command parity	Command no longer reproduces the named artifact.
Version parity	PDF changed but claim file stayed stale.
ID parity	Paper mentions C1 or D1 that JSON lacks, or JSON declares orphan IDs.

Platforms can automate part of that work.

They can check JSON syntax, required fields, ID format, duplicate IDs, missing references, URL reachability, command presence, and version metadata. They can also ask an agent to compare the claim file with the prose and produce a review packet for humans.

Human review still decides meaning. Automation only makes drift visible.

What Should The Standard Refuse?

Agent-readable claim files should stay small enough to adopt and strict enough to matter.

Refuse three temptations.

First, refuse platform dependency. A file beside the PDF beats a new platform that no author adopts. Canedo’s draft argues that minimum viable compliance should require one hand-written JSON file, not new tooling or platform registration.¹

Second, refuse fake certainty. A schema can validate shape. It cannot prove semantic truth. Claim files should say what they prove, what they do not prove, and how reviewers can check drift.

Third, refuse hidden strategy. Agents need evidence handles, not private author prompts. A public claim file should expose claims, definitions, limits, and commands. It should not expose private peer-review notes, hidden evaluation rubrics, credentials, or unreleased data paths.

Good standards reduce ambiguity without demanding trust in secret machinery.

The Worthy Standard

The worthy paper does not only persuade a human reader. It gives future readers, agents, reviewers, and builders a way to reuse the work without stretching it.

An agent-readable claim file should make the paper easier to trust by making its boundaries easier to inspect.

The standard is simple:

Give every important claim an address.
Give every scope limit a field.
Give every key definition a stable ID.
Give every reproduced figure an exact command.
Give every agent a reason to cite the paper narrowly.

Research agents will keep reading papers. Authors can leave them to scrape prose, or authors can give them a surface built for evidence.

The second path creates better citations, safer summaries, and fewer plausible claims with no reliable anchor.

Quick Summary

Research papers need agent-readable claim files because agents already summarize, cite, test, and reuse academic work. Prose alone gives agents too much room to cite whole papers instead of sub-claims, overstate scope, invent commands, or drift on definitions.

paper.json offers a practical starting point: stable claim IDs, explicit scope exclusions, per-figure commands, minimum viable adoption through one JSON file, and stable definition IDs.¹ Its worked repository adds schema validation, a resolver, and a concrete example file.²

The best first version is small: claims, non-claims, definitions, reproduction commands, version metadata, and a repository link. The file should not replace the paper. It should make the paper safer for agents to read.

FAQ

What is an agent-readable claim file?

An agent-readable claim file is a structured file beside a paper that exposes claims, scope limits, definitions, reproduction commands, and related metadata in a format agents can retrieve and cite.

Does paper.json replace the PDF?

No. The PDF remains the human-readable paper. The claim file gives agents an addressable evidence surface so they can cite and test paper claims more safely.

What problem does paper.json try to solve?

paper.json targets recurring agent-reading failures: wrong sub-claim citations, scope overextension, hidden figure commands, and unstable definitions.¹

Does a passing schema prove the claim file is correct?

No. A schema can validate required fields, IDs, and structure. Human or specialist-agent review still needs to check whether the claim file faithfully represents the paper.

What should authors include first?

Authors should start with stable claim IDs, a does_not_claim[] section, stable definitions, exact reproduction commands, a repository URL, and a version for the claim file.

References

Arquimedes Canedo, “paper.json: A Coordination Convention for LLM-Agent-Actionable Papers,” arXiv:2605.16194v1, submitted May 15, 2026. Source for the companion JSON proposal, stable claim IDs, explicit does-not-claim list, per-figure shell commands, minimum viable compliance claim, stable definition IDs, and caution that the claims remain open hypotheses. ↩↩↩↩↩↩↩↩
Arquimedes Canedo, “paper-json,” GitHub repository, accessed May 18, 2026. Source for repository files including paper.json, schema.json, validator.py, resolve.py, paper.pdf, paper.typ, the worked example, schema-required fields, validation limits, reproduction commands, and fragment resolver behavior. ↩↩↩↩↩↩↩↩↩↩↩
Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, and Xinyu Wang, “Argus: Evidence Assembly for Scalable Deep Research Agents,” arXiv:2605.16217v1, submitted May 15, 2026. Source for Searcher/Navigator roles, shared evidence graph, missing evidence dispatch, and evidence assembly framing for deep research agents. ↩↩
Noga Peleg Pelc, Gal A. Kaminka, and Yoav Goldberg, “A Language for Describing Agentic LLM Contexts,” arXiv:2605.01920v1, submitted May 3, 2026. Source for ACDL, the need to describe agent context composition and dynamics, and the critique of informal prose, ad hoc diagrams, and code inspection as insufficient context descriptions. ↩↩
Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, and Fuli Feng, “Look Before You Leap: Autonomous Exploration for LLM Agents,” arXiv:2605.16143v1, submitted May 15, 2026. Source for premature exploitation, Exploration Checkpoint Coverage, and the Explore-then-Act paradigm. ↩↩
Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, and Yoram Bachrach, “Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design,” arXiv:2605.15871v1, submitted May 15, 2026. Source for multi-agent neural architecture discovery, 24-hour exploration, reported architecture families, and downstream accuracy/scaling claims. ↩↩
Ruofeng Yang, Yongcan Li, and Shuai Li, “ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration,” arXiv:2605.03042v1, submitted May 4, 2026. Source for the plausible unsupported success failure mode in long-running research agents and the need for adversarial review of intermediate research artifacts. ↩