The OODA Loop for Prompt Engineering: What Five Failures Taught Me

Colonel John Boyd’s OODA loop was developed for fighter pilot decision-making in the 1960s, where the pilot who completed the observe-orient-decide-act cycle faster gained a decisive advantage, regardless of aircraft capabilities. I discovered the same principle applies to prompt engineering — after five expensive failures taught me that writing the prompt is the least important step.¹

TL;DR

The OODA loop (Observe, Orient, Decide, Act) provides a systematic framework for prompt engineering that prevents the most common failure mode: acting (writing a prompt) before observing (understanding the full context). After building 44 skills — each a structured prompt with auto-activation logic — I’ve learned that the prompt itself accounts for roughly 20% of the outcome. The other 80% is observation (what context does the model need?), orientation (what type of task is the task?), and decision (what prompt pattern fits the task type?). The interactive builder below walks through the full OODA cycle. The result: prompts that succeed on the first attempt rather than requiring iterative refinement.

The Prompt That Failed Five Times

Before I learned to observe before acting, I wrote prompts like a developer writing code: jump straight to the solution.

Failure 1: The blog evaluator. My first attempt at a blog quality evaluation prompt: “Evaluate this blog post and give it a score from 1-10.” The model returned a vague paragraph with “7/10” and no actionable feedback. I iterated four times before realizing the problem was not the prompt wording — the problem was that I had not defined what “quality” meant.

The OODA fix: I spent 30 minutes observing my own evaluation process. I identified six specific dimensions I cared about: reader value (25%), technical accuracy (20%), educational quality (20%), writing quality (15%), factual integrity (15%), and SEO effectiveness (5%). The weighted rubric became the blog-evaluator skill, and every evaluation since then has produced consistent, actionable scores.²

Failure 2: The code reviewer. My first review prompt: “Review this code for bugs and security issues.” The model returned 15 findings, 12 of which were stylistic nitpicks. The signal-to-noise ratio made the review useless.

The OODA fix: I oriented the task as three separate sub-tasks (correctness, security, conventions) and built three dedicated reviewer subagents, each with restricted tool access and specific evaluation criteria. The correctness reviewer only flags logic errors. The security reviewer only flags OWASP vulnerabilities. The conventions reviewer only flags pattern deviations. Noise dropped to near zero because each prompt is narrowly scoped to one dimension.³

Failure 3: The translation prompt. “Translate this blog post to Korean.” The model translated but lost all markdown formatting, stripped footnote references, and rewrote technical terms that should have remained in English.

The OODA fix: I observed what “translate” actually meant for my use case: preserve markdown structure, preserve footnote numbering, keep code blocks untranslated, keep proper nouns in English, adapt idioms rather than transliterating. The constraint list became longer than the translation instruction. Each constraint eliminated a failure mode that “translate this” would have produced.⁴

The OODA Loop Applied to Prompting

Phase 1: Observe

Before writing a single word of the prompt, observe the problem space:

What is the actual task? Not the surface request, but the underlying need. “Summarize this document” might actually mean “extract the three decisions made in this meeting so I can follow up on action items.”

What does the model need to know? Enumerate the context required for a correct response. Missing context produces hallucination. Excessive context wastes tokens and can distract the model.

What does the output look like? Define the format, length, tone, and structure of the desired output before writing the prompt. Vague output expectations produce vague outputs.⁵

Observation checklist: - [ ] Actual task (not surface request) identified - [ ] Required context enumerated - [ ] Output format defined - [ ] Success criteria specified - [ ] Failure modes anticipated

Phase 2: Orient

Orient the task within the model’s capability space:

Task type classification. Is the task extraction (pulling information from provided text), generation (creating new content), transformation (converting between formats), analysis (evaluating and reasoning about content), or classification (categorizing input)?⁶

Each task type has established prompt patterns. My 44 skills reflect the pattern: the blog-evaluator skill uses analysis patterns (weighted rubric, structured scoring). The blog-writer-core skill uses generation patterns (style rules, constraint lists, example structures). The citation-verifier skill uses extraction patterns (pull claims, match against sources).

Complexity assessment. Can the task be completed in a single prompt, or does the task require decomposition? A rule of thumb: if the task requires more than three distinct cognitive operations, decompose.

My deliberation system takes decomposition further: when confidence is low (score below 0.70), the system spawns multiple agents to explore the problem independently, then ranks their responses by quality. Single-prompt complexity thresholds vary, but I decompose any task that mixes research, analysis, and generation.

Constraint mapping. What constraints apply? Token limits, output format requirements, factual accuracy needs, tone requirements, audience considerations. Each constraint becomes an explicit instruction in the prompt.

Phase 3: Decide

Based on the observation and orientation, decide on the prompt architecture:

Prompt pattern selection:

Task Type	Recommended Pattern	My Real Example
Extraction	Schema-guided extraction	Citation verifier: extract claims, match footnotes
Generation	Constraint list + examples	Blog writer: 14 mandatory style rules, tone guide
Transformation	Input-output pairs + preserve list	i18n translator: preserve markdown, code, footnotes
Analysis	Weighted rubric + structured output	Blog evaluator: 6 categories, weighted scoring
Classification	Labeled examples + decision tree	Content depth checker: 5 originality signals, score 0-5

Role assignment. Roles work when the task benefits from a particular perspective. My security-reviewer subagent receives the role “senior security engineer reviewing code for OWASP Top 10 vulnerabilities” — the role focuses the output on security concerns. Roles fail when the role contradicts the task (“You are a creative writer” for a factual analysis task).⁷

Phase 4: Act

Write the prompt using the decisions from Phase 3. The prompt follows a consistent structure:

[ROLE] (if applicable)
[CONTEXT] (the information the model needs)
[TASK] (the specific instruction)
[FORMAT] (the expected output structure)
[CONSTRAINTS] (restrictions and requirements)
[EXAMPLES] (if using few-shot)

The structure is not a template to fill mechanically. The structure is a checklist: have the observation, orientation, and decision phases produced enough clarity to write each section? If any section is unclear, return to the appropriate earlier phase.⁸

My Prompt Library: 44 Skills as Structured Prompts

My Claude Code skills system is essentially a prompt library organized by task type. Each skill follows the OODA structure:

---
description: FastAPI backend development patterns and conventions
allowed-tools: [Read, Grep, Glob, Edit, Write, Bash]
---
# FastAPI Development Expertise

## Project Structure
[CONTEXT: expected file layout, naming conventions]

## Route Patterns
[CONSTRAINTS: response format, error handling, dependency injection]

## Database Patterns
[CONSTRAINTS: SQLAlchemy 2.0+ async, Pydantic v2 models]

The skill description handles observation (when should the skill activate?). The allowed-tools field handles orientation (what capabilities does the task need?). The body handles decision and action (what patterns should the model follow?).⁹

The blog-writer-core skill encodes 14 mandatory style rules — constraints I discovered through failures:

Active voice only (“Engineers installed” not “was installed by”)
No “this” as subject (always specify the referent)
Every claim cited with a footnote
Code blocks tagged with language identifiers
No em dashes (use commas or semicolons)

Each rule exists because I published a post that violated it. Rule #1 came from the blog-quality-gate hook catching 7 passive voice sentences. Rule #3 came from publishing an uncited claim about a McKinsey statistic. The OODA observation phase identified the failure; the constraint in the prompt prevents recurrence.¹⁰

The Iteration Loop

The OODA loop is inherently iterative. After acting (sending the prompt) and observing the result:

Observe the output: What is correct? What is wrong? What is missing?
Orient the gap: Is the problem context (missing information), format (wrong structure), or capability (task too complex for single prompt)?
Decide the fix: Add context, adjust format instructions, or decompose the task.
Act with the revised prompt.

Each iteration cycle should change exactly one variable. Changing multiple prompt elements simultaneously makes it impossible to identify which change produced which effect.¹¹

My blog evaluation workflow follows the full iteration loop:

1. Lint (deterministic) → fix structural issues
2. Evaluate (LLM) → score on 6 dimensions
3. Critique (LLM) → identify specific improvements
4. Fix (LLM) → apply surgical improvements
5. Re-evaluate → verify score improved

Each step uses a different prompt optimized for its task type. The lint step uses extraction (find violations). The evaluate step uses analysis (score against rubric). The critique step uses generation (produce improvement suggestions). The fix step uses transformation (apply changes preserving structure). The chain produces better results than a single monolithic “improve this post” prompt.¹²

Key Takeaways

For engineers building AI features: - Apply the full OODA cycle before writing prompts; 5 minutes of observation and orientation saves 30 minutes of iterative prompt refinement - Classify the task type (extraction, generation, transformation, analysis, classification) before selecting a prompt pattern; each type has established patterns that outperform generic prompting - Build a prompt library organized by task type; my 44 skills represent validated prompt patterns that I reuse across projects

For product teams using AI daily: - When an AI output disappoints, diagnose whether the failure is in observation (wrong task identified), orientation (wrong approach), decision (wrong prompt pattern), or action (wrong prompt wording); the fix differs for each phase - Constraints prevent more failures than clever prompt wording; my blog writer’s 14 mandatory rules produce more consistent quality than any amount of “please write well”

References

Boyd, John R., “Destruction and Creation,” unpublished paper, 1976. ↩
Author’s evaluator skill. 6-category weighted rubric developed through iterative prompt failure. Located at ~/.claude/skills/. ↩
Author’s reviewer subagent architecture. Three specialized reviewers (correctness, security, conventions) with restricted tool access and narrow evaluation criteria. ↩
Author’s i18n translation system. Constraint-driven translation preserving markdown structure, footnotes, code blocks, and proper nouns across 6 languages. ↩
Anthropic, “Prompt Engineering Guide,” 2025. ↩
Wei, Jason et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” NeurIPS 2022. ↩
Shanahan, Murray et al., “Role Play with Large Language Models,” Nature, 623, 493-498, 2023. ↩
Anthropic, “Prompt Engineering Guide,” 2025. Prompt structure best practices. ↩
Author’s Claude Code skills system. 44 skills functioning as structured prompt library with OODA-aligned structure. ↩
Author’s writer-core skill. 14 mandatory style rules, each derived from a published quality failure. ↩
Zamfirescu-Pereira, J.D. et al., “Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts,” CHI 2023. ↩
Author’s quality pipeline. 5-step evaluate-fix-reevaluate loop using task-specific prompts at each stage. ↩