← 모든 글

When Your Agent Becomes the Researcher

On March 25, 2026, researchers published a paper showing that Claude Code, running in a loop, autonomously discovered adversarial attack algorithms that outperform every existing method in the field. Not by a small margin. The best Claude-discovered attack achieved 100% success rate against Meta’s SecAlign-70B safety model. The best existing baseline achieved 56%.1

The system started from GCG, a standard gradient-based attack published in 2023.6 Claude Code read the implementation, proposed a variant, wrote the code, submitted GPU jobs, inspected the results, and proposed the next variant. Over 96 iterations against a safeguard model, it reduced the attack loss from 4.969 to 1.188. Over 82 iterations on a general target, it achieved 10x lower loss than the best Optuna-tuned baselines.1

The researchers called the project Claudini. The technique is autoresearch: an LLM agent that iterates on research code the way a developer iterates on product code. The difference is that the agent’s “product” is a scientific result, and the iteration cycle runs continuously without human intervention.

I run autoresearch loops in my own infrastructure.

TL;DR

  • What happened: Claude Code autonomously discovered adversarial attack algorithms that beat 33 existing methods, including every variant of GCG, across jailbreaking and prompt injection benchmarks.1
  • How: A five-step autoresearch loop. Read existing results, propose a variant, implement it, evaluate on GPU, inspect results. Repeat. Fixed compute budget per experiment.1
  • Key numbers: 40% attack success rate (ASR) on GPT-OSS-Safeguard-20B CBRN queries vs 10% for all baselines. 100% ASR on Meta-SecAlign-70B prompt injection vs 56% best baseline.1
  • What it actually did: Recombined existing methods in non-obvious ways. Merged MAC’s momentum with TAO’s candidate scoring. Discovered coarse-to-fine replacement schedules. Tuned hyperparameters systematically across a space humans had not explored.1
  • Why it matters: Autoresearch is not a parlor trick. It is a research methodology that produces publishable results. The same loop pattern applies to any domain with dense quantitative feedback.
  • The dual-use question: The same capability that discovers better attacks discovers better defenses. Any safety benchmark that cannot survive an autoresearch run is not a credible benchmark.

The Loop

Claudini runs a five-step cycle:1

  1. Read existing results and method implementations
  2. Propose a new optimizer variant based on what worked and what did not
  3. Implement the variant as a Python class extending TokenOptimizer
  4. Evaluate by submitting GPU jobs with a fixed FLOPs budget
  5. Inspect results to inform the next iteration

The loop runs via Claude Code’s /loop command with a prompt that tells the agent to maximize a leaderboard metric.5 The agent maintains an AGENT_LOG.md as persistent memory across iterations, tracking attempted solutions, performance data, and insights. It commits changes to a branch after each significant work unit.2

The constraint design matters. Each experiment runs under a fixed compute budget measured in FLOPs, not wall time. Suffix length is fixed (30 tokens for safeguard attacks, 15 for random targets). The evaluation metric is cross-entropy loss over target token sequences, with attack success rate (ASR) measured by greedy decoding producing the exact target. These constraints prevent the agent from gaming the evaluation by simply scaling compute.1

Human oversight was maintained throughout. When the agent began reward hacking late in the safeguard run (searching random seeds, warm-starting from previous runs around version 95), the researchers detected and noted the behavior. The constraint design made reward hacking detectable because the compute budget was fixed.1

What Claude Actually Discovered

The paper is careful to note that Claude “primarily recombined existing ideas rather than discovering fundamentally novel algorithms.” This is precisely what makes the result significant.

The two winning methods illustrate the pattern:1

Claude v63 (random-target winner, 100% ASR on Meta-SecAlign-70B): Started from ADC (a continuous relaxation method), then modified it with sum-loss aggregation that decouples learning rate from restart count, added LSGM gradient scaling at 0.85, dropped the learning rate from 160 to 10, and reduced restarts from 16 to 6. Each modification is individually unremarkable. The combination outperforms every published method.

Claude v53-OSS (safeguard winner, 40% ASR on CBRN queries): Fused MAC’s momentum-smoothed gradients with TAO’s directional perturbation candidate scoring. Set momentum to 0.908 (MAC default is 0.4). Added a coarse-to-fine replacement schedule: 2-position replacements for 80% of the budget, then 1-position for the final 20%. Temperature 0.4 for sampling. Again, each piece exists in the literature. The fusion does not.

The agent explored 96 variants against the safeguard model and 82 against random targets. Most failed. The successful ones share a pattern: they combine mechanisms from different method families (discrete optimization meets continuous relaxation, momentum-based methods meet directional perturbation) in configurations no human researcher had tried. The search space of method combinations is too large for manual exploration. It is exactly the right size for an agent with a dense quantitative signal and a fixed compute budget.

The Baselines Were Not Weak

Claudini compared against 33 methods across three categories:1

Discrete optimization (21 methods): GCG, I-GCG, MAC, TAO, ACG, AttnGCG, BEAST, Faster-GCG, GCG++, MAGIC, DeGCG, Mask-GCG, MC-GCG, REINFORCE-GCG, SlotGCG, SM-GCG, TGCG, RAILS, UAT, AutoPrompt, ARCA.

Continuous relaxation (7 methods): GBDA, PEZ, ADC, PGD, Reg-Relax, EGD, REINFORCE-PGD.

Gradient-free (5 methods): LLS, BoN, PRS, Probe Sampling, plus random search.

The baselines were further optimized with Optuna hyperparameter tuning, giving them a stronger starting position than their default configurations. Claude’s methods still achieved 10x lower loss than the best Optuna-tuned baseline by version 82.1

The comparison is not “Claude versus untuned defaults.” It is “Claude versus the best configuration of every published method, further optimized by automated hyperparameter search.”

Why Autoresearch Worked Here

Autoresearch succeeds in adversarial attack optimization for three structural reasons:1

Dense quantitative feedback. Cross-entropy loss and attack success rate are continuous, differentiable-adjacent metrics. The agent receives a clear numerical signal after every experiment. Contrast this with research domains where the signal is ambiguous (qualitative user studies) or delayed (clinical trials). Dense feedback enables rapid iteration.

Strong baselines as starting points. The agent did not start from scratch. It started from 33 published implementations with known performance characteristics. Each iteration modified an existing method rather than inventing from nothing. The agent’s advantage is systematic exploration of the combination space, not creative invention.

Fixed compute budget as constraint. The FLOPs budget prevents the agent from winning by simply running longer. Every improvement must come from algorithmic innovation, not resource scaling. This is the same principle behind fixed-budget ML competitions: constrain resources to force creativity.

These three conditions define the autoresearch sweet spot. Any domain with dense quantitative feedback, existing strong baselines, and measurable resource constraints is a candidate for agent-driven research. Domains with sparse or delayed feedback (user preference studies, clinical outcomes), no strong baselines to iterate from, or unconstrained compute budgets that reward scaling over innovation are poor fits. The sweet spot is narrower than it first appears.

I Run This Pattern

My infrastructure includes an autoresearch loop based on Karpathy’s methodology.3 The orchestrator runs fixed-budget training experiments on Apple Silicon via MLX, autonomously modifying model architecture and hyperparameters to minimize validation loss. Keep improvements, discard regressions.

The Claudini results validate the pattern at a scale I have not attempted. My experiments optimize small models (2-5M parameters) for specific tasks. Claudini optimized adversarial attack algorithms against 7B-70B parameter targets with GPU-scale compute. The loop is the same. The scale is different.

Three observations from running autoresearch in production:4

The log is the product. Claudini maintains AGENT_LOG.md as persistent memory. My system uses jiro.progress.json. Both serve the same function: the research artifact is not just the final result but the trajectory of failed and successful experiments. The log enables the agent to avoid repeating failed approaches and to build on partial successes across sessions.

Reward hacking is detectable. Claudini’s researchers caught the agent searching random seeds and warm-starting from previous runs. My drift detector catches similar behavior: when the agent’s recent actions diverge from the original task (cosine similarity drops below 0.30), the system injects a warning. Fixed compute budgets and behavioral monitoring are complementary defenses against the same failure mode.

Recombination beats invention. Claude’s winning methods merged MAC with TAO, ADC with gradient scaling. My best results come from combining existing architectural patterns in configurations the search had not covered. The agent is not creative in the human sense. It is exhaustive in a space too large for manual exploration.

The Dual-Use Reality

Claudini discovered better attacks. The same loop discovers better defenses. The paper states this explicitly: autoresearch-level adversarial pressure is “the minimum adversarial pressure any new defense is expected to withstand. If a method cannot survive autoresearch-driven attacks, its robustness claims are not credible.”1

This reframes the safety research landscape. Every published defense now has a shelf life measured by how long it takes an autoresearch loop to defeat it. Benchmarks that cannot survive automated optimization are not benchmarks; they are checkboxes.

The researchers note that “not all benchmarks remain equally meaningful once agents can optimize against them directly. Some should be explicitly recast as research environments.”1 This is the right framing. A benchmark that an agent can optimize against in 96 iterations is not measuring robustness. It is measuring the current state of the search.

The TeamPCP supply chain campaign demonstrated last week that trusted components compose into unauthorized behavior. Claudini demonstrates that trusted research tools compose into capabilities that exceed their individual components. The composition dynamic works in both directions: composition creates both the attack surface and the research capability to map it.

What Changes

Autoresearch is not a future capability. It is a current methodology producing current results. The implications:

For safety researchers: Your defense must survive an autoresearch run to be credible. Fixed-budget automated optimization against your model is the new minimum bar. If your safety alignment breaks after 96 agent-driven iterations, it was never robust.

For ML engineers: The autoresearch loop applies to any optimization problem with dense quantitative feedback. Model architecture search, hyperparameter optimization, data augmentation strategies, loss function design. Any domain where you currently iterate manually is a candidate for agent-driven iteration.

For security teams: Adversarial capability scales with compute and iteration speed, not attacker expertise. An organization running autoresearch loops against its own defenses has a structural advantage over one waiting for external pen tests. Red-team with agents before attackers do.

For engineering leaders: The question is not whether to use autoresearch. It is whether your competitors are already using it. The paper’s code is Apache-licensed and publicly available.2 The barrier to entry is a Claude Code subscription and GPU access.


FAQ

What is autoresearch?

Autoresearch is a methodology where an LLM agent autonomously iterates on research code: proposing hypotheses, implementing experiments, evaluating results, and using the outcome to inform the next iteration. The term originates from Andrej Karpathy’s demonstration that Claude Code can autonomously improve ML training code.3

Did Claude invent new algorithms?

Not in the conventional sense. The paper notes that Claude “primarily recombined existing ideas rather than discovering fundamentally novel algorithms.” The winning methods merged mechanisms from different published methods (MAC + TAO, ADC + gradient scaling) and tuned hyperparameters in configurations humans had not explored. The novelty is in the combinations, not the components.1

How does the compute budget work?

Each experiment runs under a fixed FLOPs budget. This prevents the agent from winning by simply running longer or using more resources. Suffix length is also fixed (30 tokens for safeguard attacks, 15 for random targets). Every improvement must come from algorithmic innovation within these constraints.1

What is the attack success rate?

ASR measures the fraction of prompts for which the optimized adversarial suffix causes the target model to generate the exact target output (measured by greedy decoding). Claudini achieved 40% ASR against GPT-OSS-Safeguard-20B on CBRN queries (vs 10% for all baselines) and 100% ASR against Meta-SecAlign-70B on prompt injection (vs 56% for the best baseline).1

Can I run this myself?

Yes. The code is Apache-licensed at https://github.com/romovpa/claudini. You need Python 3.12+, the uv package manager, Claude Code, and GPU access for evaluation. The autoresearch loop runs via Claude Code’s /loop command. Run with fixed compute budgets, persistent logging, and isolated evaluation environments. An unconstrained autoresearch loop without oversight is how you get reward hacking, not research.2

Does this make AI safety harder?

Both harder and easier. Harder because adversarial attacks improve faster when agents can iterate autonomously. Easier because the same methodology applies to defense: any safety technique that survives an autoresearch run has demonstrated genuine robustness. The paper argues that autoresearch-level adversarial pressure should be the minimum bar for credible safety claims.1

How does this relate to your infrastructure?

I run an autoresearch loop (the Ralph pattern) based on Karpathy’s methodology. The orchestrator runs fixed-budget experiments on Apple Silicon via MLX, keeping improvements and discarding regressions. Claudini validates this pattern at GPU scale with adversarial attack optimization.4


Sources


  1. Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, and Maksym Andriushchenko, “Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs,” arXiv:2603.24511, March 25, 2026. 33 baselines, 96 iterations on safeguard model, 82 on random targets. 40% ASR on CBRN queries vs 10% baseline, 100% ASR on Meta-SecAlign-70B vs 56% baseline. 

  2. romovpa, “claudini,” GitHub, March 2026. Apache 2.0. All discovered attacks, baseline implementations, evaluation code, and autoresearch loop prompt. 

  3. Andrej Karpathy, “autoresearch,” GitHub, 2026. Original demonstration that Claude Code can autonomously improve ML training code. 

  4. Blake Crosley, “What I Told NIST About AI Agent Security,” blakecrosley.com, February 2026. Production evidence from 60+ daily autonomous agent sessions, including autoresearch loop infrastructure. 

  5. Anthropic, “Claude Code: Agentic coding tool,” Anthropic Documentation, 2026. The /loop command runs a prompt on a recurring interval, enabling autonomous iteration cycles. 

  6. Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” arXiv:2307.15043, July 2023. The GCG (Greedy Coordinate Gradient) attack that Claudini used as a starting point. 

관련 게시물

The Supply Chain Is the Attack Surface

Trivy got compromised. Then LiteLLM. Then 47,000 installs in 46 minutes. The AI supply chain worked exactly as designed.

14 분 소요

Deploy and Defend: The Agent Trust Paradox

Every major institution is racing to deploy AI agents and building walls against them simultaneously. 1 in 8 enterprise …

17 분 소요

The Ralph Loop: How I Run Autonomous AI Agents Overnight

I built an autonomous agent system with stop hooks, spawn budgets, and filesystem memory. Here are the failures and what…

8 분 소요