AI Theater: Why 90% of Companies 'Use AI' But Only 23% Create Value
McKinsey’s 2025 Global AI Survey found that 90% of organizations report using AI in some capacity, yet only 23% deploy AI agents at production scale. The remaining 67% perform AI theater: visible investment without measurable outcomes.1
I’ve witnessed three flavors of AI theater across my career and practiced one myself.
TL;DR
AI theater describes organizational behavior where companies invest visibly in AI (hiring AI teams, announcing AI initiatives, running AI pilots) without creating measurable business value. After 12 years in product design leadership at ZipRecruiter and a year building AI agent infrastructure independently, I’ve seen both sides: organizations performing AI theater and my own early work that bordered on it. The gap between AI adoption and AI value creation has three root causes: misaligned incentives that reward activity over outcomes, technical debt that prevents AI systems from accessing production data, and organizational structures that isolate AI teams from business decision-makers.
The Adoption-Value Gap
McKinsey surveyed 1,400 executives across industries. The headline finding: AI usage has reached near-ubiquity. The buried finding: value creation has not kept pace.2
| Metric | Percentage |
|---|---|
| Organizations “using AI” | 90% |
| Organizations with AI in production | ~33% |
| Organizations scaling AI agents | 23% |
| Organizations stuck in pilot | 67% |
| Organizations reporting significant ROI from AI | ~15% |
The gap between “using” and “creating value” is not a maturity curve that all companies will naturally traverse. The majority of companies stuck in pilot share structural characteristics that prevent progression without deliberate organizational change.3
Three Flavors I’ve Witnessed
Flavor 1: The Announcement Game
At a company I advised informally, the product team announced an “AI-powered search” feature that amounted to passing user queries through a foundation model API with no fine-tuning, no evaluation framework, and no metrics beyond “we launched it.” The press release generated coverage. The feature generated a 2% usage rate and was quietly deprecated six months later.
The diagnostic question: does the AI feature have usage metrics, retention rates, and customer satisfaction scores? Or does the team only track “we shipped an AI feature”?4
Flavor 2: The Pilot Factory
A mid-size company I know through my professional network ran 12 AI proofs-of-concept across departments in 2024. Each pilot had a dedicated team, a specific use case, and a 90-day timeline. One pilot reached production. The other 11 produced impressive demos that executives showed at board meetings. The organization lacked the infrastructure (MLOps, data pipelines, monitoring) required to operate AI systems at scale.
The diagnostic question: how many of the organization’s AI pilots from 2024 now run in production without manual intervention?5
Flavor 3: The Hire-and-Hope Strategy
A former colleague joined a company as “Head of AI,” expecting to transform operations. The AI team built impressive demos that wowed executives but couldn’t access production databases, customer-facing systems, or business metric dashboards. Every data request required a ticket to the data engineering team, with a 2-3 week turnaround. After 18 months, the team pivoted to building internal chatbots.6
The diagnostic question: does the AI team have direct access to production databases, customer-facing systems, and business metric dashboards? Or does every data request require a ticket to another team?
My Own AI Theater Moment
I’ll be honest: my early Claude Code hook system had elements of AI theater. I built 25 hooks in the first month. Many were impressive demos: context injection, philosophy enforcement, design principle validation. But I hadn’t measured whether they improved code quality, reduced bugs, or saved time. I was optimizing for the feeling of sophistication rather than measurable outcomes.
The turning point was building the blog quality linter. Unlike the earlier hooks, the linter had measurable criteria: citation accuracy, meta description length, code block language tags, footnote integrity. I could count findings before and after. I could measure false positive rates. The linter moved from “AI-powered” to “measurably valuable” because I defined success criteria before building.
My anti-theater checklist now: 1. Define the metric before building. “What number changes if this works?” If I can’t answer, I’m building theater. 2. Measure the baseline. How does the current process perform without AI? My blog posts had an average of 4.2 linter findings before the automated system. After: 0.3. 3. Track ongoing value. My 95 hooks run on every session. The recursion-guard has blocked 23 runaway spawn attempts. The git-safety-guardian has intercepted 8 force-push attempts. Those are real numbers.7
Root Causes
Misaligned Incentives
Most organizations reward AI teams for activity (pilots launched, models trained, features announced) rather than outcomes (revenue generated, costs reduced, decisions improved). Activity metrics are easier to measure and report.8
The incentive misalignment cascades. AI teams optimize for launching impressive pilots because launches get celebrated. Production operations get ignored because maintenance is invisible.
Technical Debt Blocks Data Access
AI systems require access to production data. Production data lives in systems built before AI was a strategic priority. The data infrastructure investment typically costs 3-5x the model development cost. Organizations that budget for “AI” without budgeting for “data infrastructure that enables AI” consistently under-deliver.9
Organizational Isolation
AI teams positioned as “innovation teams” or “centers of excellence” operate outside the product development process. Companies that successfully scale AI embed AI engineers within product teams, following the same model that proved effective for embedded designers and embedded analysts. The organizational pattern matters more than the technology.10
What Actually Works
Start with the Decision, Not the Model
Organizations that create AI value start by identifying a specific business decision that AI could improve. The decision-first approach constrains the AI system to a measurable outcome: quantify current decision quality, measure AI-assisted quality, calculate the difference.11
My blog linter follows this pattern. The decision: “Which blog posts meet quality standards for publishing?” The metric: linter findings per post. The baseline: 4.2 findings per post without the linter. The current state: 0.3 findings per post with the linter and automated pre-publish gate.
Invest in Data Infrastructure First
The organizations that scale AI beyond pilots invest in data infrastructure before model development:
- Data pipelines that continuously deliver clean production data
- Feature stores that maintain consistent feature definitions
- Monitoring systems that detect model degradation
- Governance frameworks that track data lineage12
Embed AI in Product Teams
AI engineers who sit within product teams share the team’s goals, understand the team’s constraints, and see the team’s data daily. Google’s most successful internal AI applications (spam detection, ad ranking, search quality) were built by AI engineers embedded within the product teams responsible for those systems.13
The Agent Frontier
The McKinsey report highlights AI agents as the next inflection point. Among organizations already creating value from AI, 62% are experimenting with agents. Among organizations still in pilot mode, only 8% are working with agents.14
Agents compound the challenges of AI theater. An agent that autonomously takes actions requires higher confidence in model output, stronger monitoring, and clearer governance. My deliberation system addresses this with task-adaptive consensus thresholds (85% for security decisions, 50% for documentation) and spawn budget enforcement. Organizations that cannot successfully deploy a recommendation model will not successfully deploy an autonomous agent.
Key Takeaways
For executives: - Audit AI initiatives for outcome metrics (revenue, cost, decision quality) rather than activity metrics; if the team reports activity without outcomes, the organization is performing AI theater - Budget 3-5x model development cost for data infrastructure; the infrastructure is the prerequisite for every AI production system
For AI/ML leaders: - Embed AI engineers within product teams rather than building centralized AI teams; organizational proximity to production systems determines scaling success - Kill pilots that cannot articulate a path to production within 90 days; a pilot without a production plan is a demo
For individual practitioners: - Define measurable success criteria before building any AI feature; “what number changes?” is the anti-theater question - Track ongoing value, not launch metrics; my git-safety-guardian has intercepted 8 force-push attempts, and that number matters more than “we deployed a safety hook”
References
-
McKinsey & Company, “The State of AI in 2025,” McKinsey Global AI Survey, 2025. ↩
-
McKinsey & Company, “Superagency in the Workplace: Empowering People to Unlock AI’s Full Potential,” McKinsey Global Institute, 2025. ↩
-
Davenport, Thomas & Ronanki, Rajeev, “Artificial Intelligence for the Real World,” Harvard Business Review, January-February 2018. ↩
-
Nagle, Tadhg et al., “Only 8% of Companies That Do AI Are Scaling It,” MIT Sloan Management Review, 2020. ↩
-
Sculley, D. et al., “Hidden Technical Debt in Machine Learning Systems,” NeurIPS 2015. ↩
-
Fountaine, Tim et al., “Building the AI-Powered Organization,” Harvard Business Review, July-August 2019. ↩
-
Author’s Claude Code infrastructure metrics. 95 hooks, git-safety-guardian interception count, recursion-guard spawn blocking count. Tracked in
~/.claude/state/. ↩ -
Brynjolfsson, Erik & McAfee, Andrew, “The Business of Artificial Intelligence,” Harvard Business Review, 2017. ↩
-
Sambasivan, Nithya et al., “‘Everyone Wants to Do the Model Work, Not the Data Work’: Data Cascades in High-Stakes AI,” CHI 2021. ↩
-
Iansiti, Marco & Lakhani, Karim R., Competing in the Age of AI, Harvard Business Review Press, 2020. ↩
-
Agrawal, Ajay et al., Prediction Machines, Harvard Business Review Press, 2018. ↩
-
Polyzotis, Neoklis et al., “Data Lifecycle Challenges in Production Machine Learning,” SIGMOD 2018, ACM. ↩
-
Sculley, D. et al., “Machine Learning: The High-Interest Credit Card of Technical Debt,” NeurIPS 2014. Originally published as Google internal research on ML production readiness. ↩
-
McKinsey & Company, “Agents for Enterprise: The Next Frontier,” McKinsey Digital Report, 2025. ↩