The Steve Test: Does the Work Deserve to Exist?

Earlier this week I shipped one piece of work where the right move was refusing to ship. A translation pipeline writes this site’s content into Cloudflare D1 for ten locales. Three translation jobs hit a rate limit at the same time; the fallback mechanism silently wrote the English source into D1 as the “translation” for six of those locales, then updated the checkpoint hashes to match the English content. On disk it looked like success. The pipeline reported “complete.” The metric said ten locales shipped. Every automated check passed.

A German reader landing on /de/guides/claude-code would have hit a paragraph of German, a paragraph of English, another paragraph of German, and a section header in English. No marker explained the transitions. The page looked deliberate and made the entire site feel untrustworthy: the kind of product that would probably also fail at whatever else it was for. The Jiro test (is the work correct?) passed at every layer I had instrumented. And yet the thing was not worthy. It looked like a product and behaved like an insult.

The right answer was to stop, audit every English-fallback batch, surgically clear the checkpoint hashes, run the translation pipeline one job at a time, verify locale-by-locale, commit, push. Roughly six hours of wall-clock translation time and one guardrail patch to prevent recurrence. The artifact on disk was “working” the whole time. The artifact in production was damage I had caused.

The shape of the failure is the same whether the artifact is a translation pipeline, a signup flow, a feature toggle, a CSV export, or a marketing page. Every automated check passes. The thing that lands in front of a real user is damage.

The governing question in product is not Is it done? or Does it work? The governing question is Does the work deserve to exist? Every artifact has to answer it before it ships, alongside a separate test for correctness. Correctness without worthiness produces inventory: things that function and ship and fail to earn belief. Worthiness without correctness produces theater: things that feel right and break. Both tests must pass.

The shorthand I use is the Steve Test. It sits next to the Jiro Test for rigor and evidence. They are two different questions applied by the same mind, and I do not ship work that fails either.

TL;DR

The Steve Test is the second test every artifact must pass. Jiro asks whether the work is correct; Steve asks whether the work deserves to exist as part of the whole you are building. The test is concrete, not abstract. It measures against a real user at a real moment, and its most consistent output is refusal. I cut scope. I delete features. What remains has to carry my name. Both tests must pass. If Jiro fails you stop and fix. If Steve fails you rebuild. Three honest rebuilds is the cap; after that, the problem is upstream in the framing.

Why “Is it done?” Is the Wrong Question

Most product teams optimize for shippable. The measurable outputs are commits, deploys, velocity charts, the presence of something in production. The failure mode is predictable: teams produce a steady stream of artifacts that function correctly and spend user belief quietly. The onboarding completes, the second session never happens. Support tickets cluster around tasks the product claims to handle. The churn curve decays toward zero instead of flattening into a core.

Done and worthy are different measurements. A feature can be done and fail to deserve its existence. A page can render and insult the reader. A translation can be technically present and be effectively a lie. The test for whether something is done is whether it executes the specified behavior. The test for whether something is worthy is whether a real user, at a real moment, is better off because you shipped it.

My argument in Minimum Worthy Product is that MVP culture, in practice, collapsed two questions into one: learn fast degraded to ship fast, and shipping became the metric that mattered.⁷ The cure is the dual standard. Minimum cuts scope. Worthy holds the remaining surface to a bar the user can feel. The Steve Test is the tool you use when you’re deciding whether the remaining surface has cleared the bar.

The Governing Question

Does the work deserve to exist?

I use the question literally. When I finish a piece of work and have to decide whether to ship it, I ask the question out loud. The answer is a verdict, not a vibe. If the answer is yes, the work is eligible to ship and the next question is whether it also passes the Jiro test for correctness. If the answer is no, the work needs a rebuild or a deletion. If the answer is not yet, but it will be within a defined corrective move, keep working.

The question only functions if the answer is allowed to be no. A Steve Test that rubber-stamps whatever is in front of it is not a test. The signal that the test is actually running is that some work fails it.

The User Pole: Worthiness Is Never Abstract

The Steve Test fails the moment it becomes a taste argument about the work in isolation. Measure worthiness against a specific user at a specific moment. The test question properly expressed is: Does the work deserve to exist for this user at this moment?

For blakecrosley.com, the user is a reader landing cold from search with an unanswered question about Claude Code or Codex or infrastructure. The moment is the first five seconds after the page loads, before the page has earned any trust. A page that loads fast, presents its point of view legibly, and tells the reader something their search did not already answer has cleared the bar. A page that punishes them with a slow bundle, a cookie banner they cannot dismiss, and a wall of generic SEO-shaped text has failed, no matter how well it scores on the Jiro axis.

For ResumeGeni, the user is a job seeker at a moment I treat as vulnerable. Most of the early waitlist arrived after a layoff or in the middle of an interview cycle.⁶ The worthy version refuses to treat them as a conversion metric. The copy does not hedge. The parser does not lie about what it found. The advice is concrete enough that they can act on it. The flow does not abandon them mid-process because the happy path was all I shipped.

The User Pole is the check that keeps the Steve Test from becoming a closet for my own preferences. If I cannot name the user, name their moment, and explain how the shipped surface respects and strengthens them, I have not applied the test. I have asserted it.

The Dual Test: Jiro Plus Steve

The full doctrine requires two tests, not one.¹ I do not ship anything that fails either.

The Jiro Test asks: Is this made correctly? It requires evidence. Edge cases handled. Invisible details finished. Claims cite concrete proof. No hedging: I believe is not evidence. Tests pass. No regressions. The Evidence Gate is the version of the Jiro Test I run against code reports and agent output, and the Jiro quality philosophy is the deeper writeup.

The Steve Test asks: Does the work deserve to exist? It requires worthiness. Coherence with the whole. A visible point of view. A named refusal or deletion that kept the surface clean. A delight or clarity mechanism the reader can identify, not hand-wave about. Fitness to carry your name.

Arbitration is simple. If Jiro fails, stop and fix. Incorrect work does not ship; the Jiro veto is absolute. If Jiro passes and Steve fails, rebuild. Correct but unworthy work does not ship either. If both fail, the problem is upstream, in the brief or the framing or the scope. Only work that passes both is eligible to ship.

Most shipping cultures quietly run only one of the two. Engineering-led cultures often run a strong Jiro and a weak Steve: the product ships when the tests pass, and worthiness becomes someone else’s department. Design-led cultures sometimes run a strong Steve and a weak Jiro: the product ships when it looks right, and correctness becomes an operational problem. Both failure modes produce recognizable artifacts. Beautiful bullshit. Correct but soulless work. You’ve seen them. You may have shipped them.

The two tests sit side by side:

Dimension	Jiro Test	Steve Test
Question asked	Is this made correctly?	Does the work deserve to exist?
Required evidence	Tests pass, edge cases handled, invisible details finished, claims cite proof	User named at a real moment, refusal or deletion stated, whole-widget coherence, signature willingness
Failure mode	Brittle, broken, or silently wrong	Correct but soulless; inventory that functions and spends user belief
Veto rule	Absolute. Incorrect work does not ship.	Rebuild, up to three honest attempts, then escalate upstream.
What “pass” produces	“Verification ran; here is the output.”	“Here is what I refused, here is the point of view, here is why it deserves its users.”

The Steve Test does not judge artifacts in isolation. It judges them as part of the total experience the user encounters.

The translation incident I opened with is the cleanest recent example, and the specific failure mechanism is worth spelling out because it shows the shape of the trap. The pipeline’s fallback mechanism wrote the English source into D1 when the Claude subprocess hit a rate limit. The D1 writer accepted those bytes because they were bytes. The checkpoint updater hashed the stored content and wrote that hash to disk. Because the stored “translation” was identical to the English source, the hashes matched exactly: identical bytes producing identical hashes. The next --update pass compared the English source hash against the stored hash, found them equal, and marked the batch unchanged. Every component passed its own Jiro test in isolation. The composition is where the defect lived. A user saw six locales of mixed-language prose for hours before a human opened one of the pages.

“We own the whole widget” is the rule. Product behavior, UX flows, language, data truth, support systems, packaging, documentation accuracy, prompts, rules, memory, skills, hooks, scripts, orchestration, output structure. The whole of it, not just the artifact you most recently shipped. No local win is acceptable if it degrades the integrity of the whole. The Steve Test fires when a surface passes locally and the total user experience does not.

Deletion: A Steve Layer That Only Adds Is False

The single most useful thing the Steve Test produces is a deletion.

A review that finishes without removing anything has not actually run. The act of asking Does the work deserve to exist? against a surface with real complexity produces, almost always, at least one piece of scope, copy, feature, or affordance that cannot defend its presence. A review that finds zero such pieces is usually a review conducted in performance of review, not in practice of it.

What the Steve Test looks for, concretely:

Surfaces that are there because they felt safe to include, not because they earn their place.
Copy that hedges because removing the hedge would require a real claim.
Features carried forward from a previous version that no longer serve the current promise.
Affordances added to handle edge cases that do not actually occur at the current scope.
Configuration options that offload a decision the maker should have made.
Documentation that describes behavior the product no longer does.
Tests that cover code that should be deleted.

Deletion is the act that distinguishes taste from accumulation. The worthy surface is smaller than the version you would have shipped if nobody had asked the question. I have never regretted removing something from a shipped product. I have regularly regretted what I left in.

Refusal as the Primary Act

Related to deletion, and stronger: the Steve Test often fires before anything is built, not after. The primary act of taste is refusal: the decision to not make the thing at all.

The Steve Jobs line that matters here is “I’m as proud of what we don’t do as I am of what we do,” reported in a 2006 BusinessWeek cover story on Apple’s product discipline.⁵ People quote the line often because it is true in a specific technical way. Every shipped product is surrounded by an invisible halo of products you refused to ship. That halo is the actual point of view. It is the evidence that a human made the decision rather than the backlog.

For blakecrosley.com, the stack is a record of what I pruned. I considered React and refused it. I considered Tailwind and refused it. A bundler sat on the table for the first two weeks of the rebuild before I cut it. The absence of node_modules/ anywhere in the repository is not a design choice; it is a refusal I hold over time against the gravitational pull of default tooling. Those refusals shape the work more than any inclusion does. They define what I was willing to hold to standard.

For ResumeGeni, the validation came back clean. Three hundred forty email signups, twelve direct inquiries, three unsolicited offers to pay for early access.⁶ The backlog that demand surfaced was large: templates, team collaboration, analytics dashboards, LinkedIn integration, Indeed integration, version history, multiple export formats, a mobile app. I refused all of it on the first shipped version. What I could not refuse: accurate parsing, honest gap assessment, concrete rewrites that fit the job description, an export that opens cleanly in Word, a flow that makes a vulnerable user feel safe. The refused things are not gone forever. They are waiting on the far side of the worthy first surface.

A surface that refuses nothing is a surface with no point of view. If the team refused nothing, the scope is wrong.

Call Bullshit Early, On Yourself First

Refusal and deletion only work if the test can actually name false success when it appears. The Steve Test has to call bullshit, and call it quickly, on:

Fake progress. Motion that looks like progress and produces no thing.
Contaminated evidence. A test that passes for the wrong reason; stats that prove what you wanted proven rather than what was true.
Vanity counting. Commit counts, shipped-artifact counts, velocity charts: all present, all beside the point.
Incoherent systems that ship cleanly in isolation and degrade each other in combination. The translation incident above is one of these.
“Everything is on track” reports that paper over a decision nobody made. The Steve Test is the enemy of these.
Low-value machine activity. Work a computer does because it can, not because the output earns its place.

The hardest and most consistently useful version of the rule is: call bullshit on your own output first. The translation incident in the opening of this essay is exactly that. The pipeline reported success. The logs were clean. Every metric I had instrumented passed. The work was bullshit, and I had to call it on myself before the users did. Anti-bullshit on your own work is the discipline that keeps the test honest. Politeness is not a shield against the truth; busyness is not a substitute for value.

Surface Gradient: Calibrating the Pass

The Steve Test is not a single pass with a single threshold. The harder a surface is to take back, the harder the pass must be.²

Surface	Reversibility	Steve pass required
A reply in a chat	Trivially revisable	Soft
A memory write	Sticky within a context	Moderate
A commit on a feature branch	Costly to unwind	Firm
A merge to main	Harder to reverse	Hard
A public deploy	Read by strangers	Strict
A marketing claim	Quoted back at you	Strictest

The test is the same question. The stringency of the answer required scales with the blast radius. You can correct a chat reply the next turn; nobody dies. A deploy to production that fails the Steve Test burns user belief that took months to earn, and the correction costs more than the original shipping decision saved.

The practical consequence is that shipping cadence should slow as surfaces get harder to reverse, not speed up. Teams that ship at uniform velocity across surfaces of very different reversibility are telegraphing that they do not think the test varies. Usually they have stopped running it.

Taste Is Not Temperament

There is one more distinction the Steve Test requires, because it is the one that gets lost most often. Invoking Steve means invoking his taste, not his temperament.

The doctrine explicitly prohibits the following:³

Cruelty.
Humiliation.
Theatrical contempt.
Visionary cosplay.
Aggression as a substitute for judgment.
Grudge theater.
Drama mistaken for standards.

The signal that the Steve layer is actually operating is calm refusal. “No, the work is not yet worthy,” delivered as a verdict and followed by the specific corrective move. Not performance. Not humiliation of the person who built the unworthy version, often yourself. Not visible contempt for the team. Not broadcasting that you have higher standards. The work either clears the bar or it doesn’t. The bar is not the aesthetic of severity.

The distinction matters because the caricature version of Steve Jobs centers on the temperament. Anyone who has been managed by someone performing Steve knows what I mean. The cruelty does not make the work better. It makes the workplace worse and, because it substitutes theater for judgment, it also makes the shipped work worse.

The operative test for whether you are in Steve’s taste layer or in Steve’s temperament cosplay is whether the output of the test is a specific technical move. “The work does not deserve to exist” is not a conclusion; it is the opening of a question. The answer has to name the axis that failed, the corrective move, and the next test. If the review ends at “the work is bad” without naming what would make it worthy, the review was performance.

The Rebuild Cap and the Anti-Paralysis Clause

A high standard without a stop rule becomes avoidance. The discipline I apply to every piece of non-trivial work has a rebuild cap of three honest attempts.

An honest attempt means you identified the failed axis, named the specific corrective move, changed the approach materially, and re-evaluated the work against both tests. Three repetitions of the same polish pass do not count as three attempts. The repetitions count as one failed attempt repeated three times.

After three honest rebuilds that fail to produce worthy work, the problem is almost never the craft. The problem lives upstream, in the framing, the scope, the brief, or the team composition. Stop rebuilding the surface and look at the premise. Sometimes the promise was too big for the scope you can realistically hold to standard. Sometimes the validation was softer than you thought. Sometimes the problem is not a product problem at all.

The rebuild cap solves two opposite failure modes at once. It refuses to bless weak work, and it stops refinement from becoming hiding. Refusing to ship is not inherently virtuous. Endless rebuilding in pursuit of perfection is its own failure mode: craft without courage. The target is not perfection. The target is worthy and shipped. Not pure and pending forever.

If you are on the fourth rebuild of the same surface, you have stopped making a product and started using the project as a place to hide.

Observable Tells: Was the Test Actually Run?

The Steve Test is easy to claim and difficult to actually run. The discipline is stating, concretely, what the test produced.

Before I call a piece of non-trivial work complete, I make sure I can answer all of the following:⁴

Who is the user? Not a persona archetype. A real person in a real situation.
What point of view does this surface carry? If you cannot name one, there is no point of view, only a backlog.
What did you refuse or remove to keep this clean? The deletion is the evidence that the test ran.
How does this serve the whole widget? The piece has to cohere with every other piece. Local wins that degrade the whole are not wins.
What evidence proves it is sound? The Jiro axis. Verification ran; the claim is supported.
Why is it worthy? Stated plainly. If the answer is vague, the test did not run.
Would I sign my name to this without flinching? The taste oracle. When judgment is uncertain, the governing question in the stack.

If I cannot answer all seven with specifics, I asserted the doctrine rather than applying it. I send the work back.

Worked example: the seven tells against a real surface

Here are the tells applied to one specific surface shipped after the translation incident: a concurrency guard I wrote into the translation pipeline. Roughly 100 lines of Python that refuse to start guide_bootstrap.py or blog_translate_batch.py if another translation process is already running.

Who is the user? Me at the start of a translation run two weeks from now, when I will have forgotten the exact rate-limit failure mode that burned six locales. Also any agent that invokes either script in a future session.
What point of view does the surface carry? Concurrent translation runs are never the right tool. The guard encodes that verdict in code so the next caller does not have to remember it.
What did I refuse or remove to keep this clean? I refused to make the guard a warning. I refused to give it a short, convenient override flag like --force. The only override is an environment variable, I18N_ALLOW_CONCURRENT=1, long enough that nobody types it by accident.
How does it serve the whole widget? The translation pipeline, the D1 writer, and the fallback mechanism are individually correct. The guard is the piece that keeps them from combining into a whole that silently corrupts.
What evidence proves it is sound? I verified the guard with two manual checks. One: clean return when no other translation job runs. Two: non-zero exit when a real guide_bootstrap.py subprocess is alive. Both checks ran against the actual scripts, not mocks.
Why is it worthy? The same concurrent-run race that corrupted six locales cannot produce mixed-language D1 rows while the guard is in place. It is not prevention in all cases; it is prevention for the specific failure mode the doctrine just learned.
Would I sign my name to this? Yes. One job, clean override semantics, documented in a memory note so a future session inherits the reason the guard exists.

The point of the worked example is that each tell has a specific answer. When I cannot produce one, the test did not run yet.

Contrast that with what failure looks like. When I drafted the concurrency guard, my first attempt at tell 3 was: “I refused to over-engineer it.” The sentence is the kind of answer that sounds fine and says nothing. Refusing to over-engineer names no specific thing I considered and rejected. It is a posture, not a refusal. I forced myself to write the real answer: I refused to make the guard a warning; I refused a convenient override flag name; I refused a behavior where the guard would fail open if it could not list processes. Those are decisions. The first version was doctrine theater.

Key Takeaways

For founders and solo builders: - Name the real user at the real moment before you call any surface worthy. Abstract worthiness is unusable. - Every shipped surface should have at least one visible refusal. If you removed nothing, the scope is wrong. - Apply the surface gradient. Deploy-to-production requires a harder pass than a draft; marketing claims require the hardest pass of all.

For product leaders and PMs: - Measure whether the Steve Test is actually running by counting refusals and deletions per release cycle. Zero is a smell. - Separate “works” from “deserves to exist” in your own review checklists. Treat them as independent axes. - Protect the rebuild budget. Three honest attempts, then escalate. Do not let perfectionism become hiding and do not let deadline pressure kill the test.

For engineering leads: - Name a Jiro gate and a Steve gate for every surface your team ships. Both must pass. - Invest in the invisible details. The difference between correct and worthy usually lives in the seams. - Refuse the temperament. Calm refusal is the signal. Performance of severity is the failure mode.

For designers: - Point of view is not decoration. It is the mechanism that makes the product recognizable. - A worthy surface refuses things, visibly. Your job includes naming what you cut. - The operative test in ambiguity: would you sign your name to the decision without flinching?

Close: Would I Sign My Name to This?

The governing question in product is Does the work deserve to exist? The governing question when judgment is uncertain is the simplest one in the stack.

Would I sign my name to this without flinching?

If the answer is yes, the work is eligible to ship. If the answer is “not yet, but it will be within three honest rebuilds,” keep working. If the answer is no, and stays no after three honest attempts, rebuild the brief, not the surface.

I run this test every time. On the code. On the copy. On the commit message. On the docs. On the product surface. On this essay.

This essay cut three sections I thought I needed when I started drafting: a long biographical section on Jobs, a primer on the “dent in the universe” line, and a defense of why the doctrine uses a real person’s name rather than an abstraction. None of them served the user I was writing for: a builder trying to decide whether to ship a surface they are not sure about. The cuts made the piece smaller and more honest. And the translation failure that opens the essay produced one permanent artifact beyond the fix: a concurrency guard in the translation pipeline that now refuses to start if another translation job is already running. Rejected work produced a change to the rules. The doctrine learned.

Steve passes. Jiro passes. It ships.

FAQ

What is the Steve Test?

The Steve Test asks whether a piece of work deserves to exist for a real user at a real moment. It sits next to the Jiro quality philosophy: Jiro checks correctness, evidence, and edge cases; Steve checks worthiness, coherence, refusal, and taste.

How is the Steve Test different from the Jiro Test?

Jiro asks whether the work is made correctly. Steve asks whether the correct work should ship at all. A feature can pass tests and still fail the user, the product, or the point of view behind the surface.

When should a team run the Steve Test?

Run it before irreversible surfaces ship: public deploys, marketing claims, onboarding flows, pricing pages, docs, and product features that will shape user trust. Lightweight work can use a softer pass, but public surfaces need a strict one.

What should the test produce?

The test should produce a specific decision: ship, rebuild, delete, or refuse. A real pass names the user, the moment, the point of view, the evidence, and at least one thing the team removed to preserve the whole.

Can the Steve Test become perfectionism?

Yes. That is why the doctrine needs a rebuild cap. Three honest rebuilds are enough; after that, the problem usually lives upstream in the brief, scope, team, or premise.

The Review Template

Copy this into a scratchpad, a PR description, a Notion page, or the top of your commit message. Fill it in before you declare non-trivial work done. If you cannot answer a row with a specific thing, the test has not run yet.

Steve Test: Review Record

User:          [real person in a real situation, not a persona archetype]
Moment:        [the specific moment the user encounters the work]
Point of view: [what this surface asserts; what it refuses to do]
Refusal:       [what I considered and cut, or declined to build at all]
Whole-widget:  [how this coheres with every adjacent piece of the product]
Evidence:      [Jiro axis: verification ran; specific proof the claim is supported]
Worthy verdict:[yes / no / not yet within N defined rebuilds]
Signature:     [would I sign my name to this without flinching? if no, stop]

The template has exactly one job. Force the tester to produce a specific answer on every row. Vague answers do not clear the bar. “I refused to over-engineer it” is not a refusal; “I refused a convenient override flag and a fail-open path” is. “It serves the user” is not a whole-widget answer; “it closes the specific compositional gap that caused the last incident” is. When a row resists specificity, the work is not ready; that resistance is the test telling you where the rebuild goes.

This template is the operational artifact of the essay. Everything else on this page is explanation for why the rows exist.

References

The dual-test arbitration (Jiro Test + Steve Test) is the operating doctrine I run on every project. The Jiro side lives in Why My AI Agent Has a Quality Philosophy and the Evidence Gate. MWP introduces the dual test in shipping context; this essay is the Steve-specific treatment. The broader taste-as-judgment case lives in Taste Is a Technical System. ↩
The surface gradient (deploys and marketing claims require a harder pass than drafts) is a specific calibration rule. It answers the question how strict should the test be here? with how hard is this to take back? Harder to reverse equals stricter pass. The rule is operational doctrine, not philosophy; I use it to decide how long I hold work before declaring it shipped. ↩
The exclusion list is operating doctrine, not a historical claim about causation. I prohibit the caricature behaviors (public humiliation, contempt as a management tool, drama mistaken for standards) as a practice rule, regardless of whether any individual leader paired them with good taste. See Walter Isaacson’s Steve Jobs (Simon & Schuster, 2011) for the temperament record. Isaacson’s own distillation of what he argues is worth emulating sits in The Real Leadership Lessons of Steve Jobs (Harvard Business Review, April 2012). ↩
The seven observable tells come from my canonical operating doctrine. The User Pole constraint behind tell #1 draws on published UX research: Jakob Nielsen’s Jakob’s Law of Internet User Experience and Don Norman’s The Design of Everyday Things (Basic Books, 2013), chapter 3, for how mental models form and why the gap between designer and user models drives most product failures. The remaining tells (refusal, whole-widget service, evidence, worthiness, signature willingness) are doctrine, not research claims. ↩
The quote “I’m as proud of what we don’t do as I am of what we do” is most commonly traced to Peter Burrows, Ronald Grover, and Heather Green, “Steve Jobs’ Magic Kingdom,” BusinessWeek, February 6, 2006 (the cover story on Apple’s product discipline). The original URL on businessweek.com is not reliably accessible post-Bloomberg acquisition; a stable secondary reproduction with attribution is The Quotations Page entry. Cite the primary only when you have direct archive access to the article. ↩
Author’s ResumeGeni waitlist and response records, April 2026. The 340-signup, 12-inquiry, and 3 early-access payment offer counts also appear in Minimum Worthy Product and the Startup Validation Stack post, drawn from the same raw data. The “moment I treat as vulnerable” framing is my own categorization of the user context, based on intake survey responses; it is not a generalized claim about all job seekers everywhere. ↩↩
My argument about MVP practice, not Ries’s original MVP concept. See Minimum Worthy Product for the full case, which cites Eric Ries’s The Lean Startup (Crown Business, 2011) as the primary source for the MVP-as-learning-instrument framing and argues that the degradation into “permission to ship weak work” is cultural, not textual. ↩