Evaluations: XCTest for Model Quality (macOS 27)

A normal unit test asserts that add(2, 2) returns 4, and if it does not, the build is red. An AI feature breaks that contract on the first line, because the same prompt can produce a different sentence every run, and “different” is not “wrong.” You cannot write #expect(summary == "the expected summary") against a model, because there is no single expected string. What you can measure is whether the output is good enough, often enough, against criteria you define. Apple’s new Evaluations framework gives you that measurement harness, with type-safe Swift APIs that run as part of your development workflow1. It is macOS 27.0 only: a developer tool you run at test time on your Mac, not a runtime feature you ship inside an app1.

The shape will feel familiar if you have written tests. You define a dataset, generate model responses, apply metrics, and aggregate results, then read which approach performed best and where individual responses fell short1. The mental model is XCTest for model quality: same loop (arrange a case, run it, assert something), different assertion (a graded metric instead of equality).

In session 298, Apple frames the core problem this way: generative-AI features break a contract fundamental to software testing, because the same input can produce different outputs, which makes unit tests insufficient25.

Watch: Meet the Evaluations framework (WWDC26) Why an equality assertion fails an AI feature: the same input can produce different outputs.

TL;DR / Key Takeaways

  • The Evaluations framework is macOS 27.0 only, a developer-tooling layer for measuring intelligence-powered features, run as part of your testing workflow on a Mac1.
  • Evaluator is a closure-based evaluator you write inline; the Evaluation protocol is the type you implement to run a system under test against a dataset and apply evaluators23.
  • Metric carries a named result through factory methods (passing, failing, scoring, ignore), and ScoreDimension names a graded axis for a model-as-judge evaluator45.
  • ModelSample is a general-purpose evaluation sample; SampleGenerator is an actor that generates samples from a language model as an async stream; results land in a DataFrame with typed columns including a responseColumn678.
  • ToolCallEvaluator verifies agentic tool calls against a TrajectoryExpectation, with ArgumentMatcher defining how each argument is validated91011.
  • EvaluationTrait runs an evaluation inside a test and records the result as attachments, which is the bridge into a Swift Testing run12.

The post walks the API surface, explains why each piece exists, and ties it back to the Foundation Models tool-calling work it grades. Where I show a call whose exact signature Apple has not published, I mark it as illustrative and tell you to confirm against the docs.

The Evaluator Model

The framework’s core question is “how good was this output,” and it answers with a small vocabulary: an evaluator decides, a metric records, a score dimension grades, and a result aggregates.

The Evaluation protocol is the type you implement to define an evaluation, which runs your system under test against a dataset and applies evaluators to measure performance3. The declaration is minimal:

// macOS 27.0 beta
protocol Evaluation : Sendable

The protocol is Sendable because evaluation is concurrent work: many samples, run in parallel, the same way Swift Testing runs test cases in parallel by default13. You implement the protocol when you want a reusable, named evaluation. When you want something quick, you reach for Evaluator instead, which Apple describes as a closure-based evaluator for inline use without defining a custom type2:

// macOS 27.0 beta
struct Evaluator<Input> where Input : SampleProtocol,
    Input.ExpectedValue : Decodable,
    Input.ExpectedValue : Encodable,
    Input.ExpectedValue : Sendable

Apple’s discussion states the closure receives the input sample and the response, with access to both .value and .transcript2. The .value is the model’s typed output; the .transcript is the record of how it got there. An inline evaluator is a one-line #expect block for model quality: no subclass, just the judgment as a closure. The following call shape is illustrative; confirm the initializer against Apple’s docs:

import Evaluations

// Illustrative call shape — confirm against Apple's docs.
let nonEmpty = Evaluator<ModelSample<String>> { input, response in
    response.value.isEmpty ? .failing("empty output") : .passing()
}

What the closure returns is a Metric. Apple describes Metric as a named metric that carries a result value, and names the factory methods explicitly: passing, failing, scoring, and ignore each return a new Metric with the result stored inside4:

// macOS 27.0 beta
struct Metric

The four factories map onto the kinds of judgment an AI feature needs. A summarizer either includes the required fact or it does not, so it gets passing or failing. A tone-grading rubric runs from 1 to 5, so it gets scoring. A sample you want out of the aggregate (malformed input, a known-bad fixture) gets ignore, which keeps the row in the dataset without polluting the statistics. The range from passing to graded scoring is what the framework promises: simple pass or fail checks through detailed scoring with model-as-judge patterns1.

The graded end of that range is where ScoreDimension earns its place. Apple defines it as a named scoring dimension for a model judge evaluator, where each dimension defines a name (used as the DataFrame column), an optional description, and a definition of what each score means5:

// macOS 27.0 beta
struct ScoreDimension

A single output can be good on one axis and bad on another. A drafted email can be factually correct and tonally wrong. ScoreDimension lets you score those axes separately (correctness, tone, concision) so the aggregate tells you which dimension dropped, not only that overall quality dropped. The dimension name becomes a column, which means the scores land in a structured table you can sort and compare, not a wall of prose.

That table is an EvaluationResult. Apple describes it as the results of running a model evaluation, a structure that contains the summary and detailed results from an evaluation run14:

// macOS 27.0 beta
struct EvaluationResult

The two-level shape (summary plus detail) is what makes evals actionable. The summary answers “did this prompt do better than the last one.” The detail answers “which samples fell short,” so you can open the worst rows and read what the model produced1.

Samples and Generation

An evaluation needs cases to run against, and the framework’s unit of a case is a sample. ModelSample is the general-purpose one. Apple describes it as a general-purpose language model evaluation sample that accepts string-based prompts and instructions6:

// macOS 27.0 beta
struct ModelSample<ExpectedValue> where ExpectedValue : Decodable,
    ExpectedValue : Encodable,
    ExpectedValue : Sendable

The generic ExpectedValue is the typed expectation: a string for a free-text task, a structured Codable type for one with a known answer. The Codable plus Sendable constraints match Evaluator’s Input.ExpectedValue, because the expectation has to serialize into the result table and cross concurrency boundaries during a parallel run. For multimodal prompts, Apple notes you create a custom conformance or use an initializer with a prebuilt prompt6.

Hand-writing every sample does not scale, which is why SampleGenerator exists, an actor that generates evaluation samples using a language model7:

// macOS 27.0 beta
actor SampleGenerator<SampleType> where SampleType : ModelSampleProtocol

It is an actor because it owns mutable generation state (accepted and rejected samples) that async iteration mutates, and actor isolation is Swift’s guardrail against data races on that state. Apple’s workflow: create a generator, configure its properties, then call it to produce new samples as an async stream; after iteration, you access all generated samples, or any the validator rejected7. The rejected bucket is the detail worth pausing on. A generator that silently dropped bad samples would hide its own failure rate; exposing the rejects lets you audit the dataset the way you would audit hand-written fixtures.

Once samples run, the responses land in a DataFrame with typed column descriptors, so you read the table without stringly-typed lookups. Apple names the response column exactly: responseColumn, a typed column descriptor for the model responses in the detailed DataFrame8:

// macOS 27.0 beta
var responseColumn: ResultColumn<Self.Subject> { get }

The ResultColumn<Value> generic is what makes the column typed: a descriptor for a DataFrame column, parameterized by the value it holds15. Alongside responseColumn, the framework exposes an inputColumn for the input samples and an expectedColumn for the expected values, each a typed ResultColumn1617. Typed columns are the same instinct as Swift Testing’s expression capture: instead of pulling a value out of an untyped bag and hoping the cast holds, you read it through a descriptor that knows its type. When you feed results into MetricsAggregator for mean, median, and standard deviation, the columns are how you address the data without guessing at keys18.

Tool and Trajectory Checking

The hardest AI feature to test is the agentic one, because correctness is not a string, it is a sequence of actions. When a Foundation Models session calls OCRTool, then BarcodeReaderTool, then your own catalog lookup, the question is not “did the final sentence match” but “did the model take the right path”19. ToolCallEvaluator grades that path directly. Apple describes it as an evaluator that verifies agentic tool calls against an expected trajectory9:

// macOS 27.0 beta
struct ToolCallEvaluator<Input> where Input : ModelSampleProtocol,
    Input.Expectation == TrajectoryExpectation

The where clause is the load-bearing part: the input’s expectation must be a TrajectoryExpectation. Apple describes that type as the expected pattern of tool calls for an evaluation, specified across three axes10:

// macOS 27.0 beta
struct TrajectoryExpectation

Apple states ToolCallEvaluator supports ordered sequences, unordered expectations, disallowed tool checks, and group steps, and produces both a strict and partial result from a single evaluation pass9. Each maps to a real agent failure. Ordered sequences catch a model that calls the right tools in the wrong order. Unordered expectations say “these calls must happen, order aside.” Disallowed tool checks catch a model that reaches for a tool it should never touch (the safety case: an agent calling a destructive tool when it should have stayed read-only). Group steps express “any one of these,” for branches where more than one path is acceptable.

The strict-plus-partial pair fits how agentic quality actually degrades. A new prompt rarely takes a feature from perfect to broken; it takes it from “right trajectory every time” to “right trajectory most of the time, with one call out of order.” A strict-only result reports that as a flat failure and tells you nothing about how close the model got. The partial result quantifies the near-misses, which is the signal you tune against.

Per-call correctness lives one level down, in the arguments. A model can call the right tool with the wrong values, and ArgumentMatcher defines how each argument is validated. Apple describes it as the values that define how to validate a tool-call argument11:

// macOS 27.0 beta
enum ArgumentMatcher

Apple’s discussion lists the validation rules: require exact values, verify key presence, check ranges, match patterns, or use a language model for semantic matching11. The semantic-matching case is the one a plain equality assertion cannot express. If a tool argument is a free-text query, two different strings can be equally correct, and a regular == would fail a perfectly good call. Delegating that argument to a model-as-judge match is the same escape hatch ScoreDimension provides at the output level, applied at the argument level. The case spellings come from Apple’s enumeration; the following is illustrative, so confirm against the docs:

import Evaluations

// Illustrative — confirm shapes against Apple's docs.
let expectation = TrajectoryExpectation(/* ordered / unordered / disallowed steps */)
let evaluator = ToolCallEvaluator<ModelSample<String>>(/* expectation */)

The agentic loop here is the one the Foundation Models tool-calling post describes from the runtime side. There, GenerationOptions.ToolCallingMode governs how aggressively the on-device model reaches for tools, and the framework can shift the mode after the first call to bound a request’s tool activity19. ToolCallEvaluator is the measurement side of that same behavior: you set the calling posture at runtime, then you grade the trajectory at test time to confirm the posture produced the path you intended. The runtime knob and the test-time evaluator are two ends of one feature.

How This Fits a Workflow

Evals are not a separate ritual you run quarterly. They belong next to your tests, in the same loop, run on the same Mac. The framework’s bridge into that loop is EvaluationTrait. Apple describes it as a test trait that runs an evaluation and records the result as attachments12:

// macOS 27.0 beta
struct EvaluationTrait

The word “trait” is deliberate. Swift Testing’s whole configuration model is traits applied to @Test and @Suite declarations: .enabled(if:), .disabled(_:), .tags(...), and the rest13. EvaluationTrait slots an evaluation into that vocabulary, so an eval runs the way a test runs, under the same swift test invocation, with the same parallelism. Recording the result as attachments reuses the mechanism Swift Testing already has for custom attachments, so an eval’s detailed result rides along in the test report and CI artifacts you collect13. The framework also exposes an EvaluationContext so code inside the test scope can read the result after the evaluation completes20.

That trait answers “where do evals live.” They live in your test target, beside your unit tests, gated by the same traits. A .tags(.evals) annotation runs only the model-quality checks after a prompt change, the way .tags(.regression) scopes a regression run13. Fast, unit-style tests stay green on every edit; the slower, model-driven evals run on the prompts and tool definitions that touch the model.

The eval-versus-test split mirrors the runtime-versus-tooling split from the agentic workflow post, which draws the line between the on-device model the app ships and the tooling model the developer runs to write the app21. Evaluations sits on the build side of that line: a macOS 27 developer tool, run during iteration, that measures whether the runtime feature you ship is good enough1. You do not ship the Evaluations framework to users any more than you ship XCTest. You ship the confidence it produces.

The framework also stays open on which model you grade. Apple notes it works with any model available to your code, and the dataset side is fed through a Loader, a protocol for types that supply a dataset, with built-in concrete types or a custom conformance for your own sources122. For model-as-judge grading, ModelJudgeEvaluator sends the query, response, and optional reference data to a judge model that returns scores for one or more dimensions23, with the judge prompt configurable through ModelJudgePrompt, which bundles instructions, response presentation, and reference-data injection into one composable value24. Use Claude through your own code path for the judge if that is the model your stack already trusts; the framework does not lock you to one.

FAQ

What is the Evaluations framework, and what platform does it run on?

The Evaluations framework measures the quality of your app’s intelligence-powered features using type-safe Swift APIs that integrate into your development workflow1. It is macOS 27.0 only: a developer-tooling framework you run at test time on a Mac, not a runtime feature shipped inside an app1. You define datasets, generate model responses, apply metrics, and aggregate results, then read which approach performed best and where individual responses fell short1.

Why can’t I use XCTest or Swift Testing equality assertions for AI features?

An equality assertion needs a single expected value, and a non-deterministic model does not have one: the same prompt can produce a different valid output each run. Evaluations replaces equality with graded judgment. A Metric records a passing, failing, scoring, or ignore result, and a ScoreDimension grades an output on a named axis45. You still run the evals inside a test, through EvaluationTrait, so they live in the same target and the same swift test loop as your other tests12.

How does the framework evaluate agentic tool-calling behavior?

Through ToolCallEvaluator, which verifies agentic tool calls against an expected trajectory9. You describe the path as a TrajectoryExpectation across three axes; the evaluator supports ordered sequences, unordered expectations, disallowed tool checks, and group steps, producing both strict and partial results from one pass910. Per-argument validation uses ArgumentMatcher (exact values, key presence, ranges, patterns, or model-based semantic matching)11.

What is the difference between an Evaluator and the Evaluation protocol?

Evaluator is a closure-based evaluator for inline use without defining a custom type; its closure receives the input sample and the response, with access to .value and .transcript2. The Evaluation protocol is the type you implement to define a reusable, named evaluation that runs your system under test against a dataset and applies evaluators3. Reach for Evaluator for a quick inline check, the Evaluation protocol for a structured, repeatable one.

Where do generated samples and results live?

SampleGenerator is an actor that produces samples from a language model as an async stream; after iteration you read the accepted samples or the ones the validator rejected7. Results land in a DataFrame with typed ResultColumn descriptors: responseColumn, inputColumn, and expectedColumn8161715. MetricsAggregator computes mean, median, and standard deviation over that data18.

The full Apple Ecosystem cluster: the Foundation Models framework explainer; the runtime vs tooling LLM distinction; the iOS 27 tool-calling control this framework grades; and Swift Testing vs XCTest, whose trait model EvaluationTrait plugs into. The hub is at the Apple Ecosystem Series. For broader iOS-with-AI-agents context, see the iOS Agent Development guide.



  1. Apple Developer, “Evaluations” framework overview. Available macOS 27.0 beta. Abstracted as “measure the quality of your app’s intelligence-powered features.” Apple’s discussion states you define datasets, generate model responses, apply metrics, and aggregate results with type-safe Swift APIs that integrate into your development workflow; the framework evaluates features against metrics from simple pass or fail checks to detailed scoring with model-as-judge patterns, aggregates results into summaries that show which approach performs best and where individual responses fall short, and works with any model available to your code. (macOS 27.0 beta) 

  2. Apple Developer, “Evaluator”. A structure (struct Evaluator<Input> with Input : SampleProtocol and Input.ExpectedValue conforming to Decodable, Encodable, Sendable) abstracted as a closure-based evaluator; Apple’s discussion states the closure receives the input sample and the response, providing access to .value and .transcript. (macOS 27.0 beta) 

  3. Apple Developer, “Evaluation”. A protocol (protocol Evaluation : Sendable) abstracted as a type that defines an evaluation; Apple’s discussion states the evaluation runs your system under test against a dataset and applies evaluators to measure performance. (macOS 27.0 beta) 

  4. Apple Developer, “Metric”. A structure (struct Metric) abstracted as a named metric that carries a result value; Apple’s discussion states the factory methods passing, failing, scoring, and ignore return a new Metric with the result stored inside. (macOS 27.0 beta) 

  5. Apple Developer, “ScoreDimension”. A structure (struct ScoreDimension) abstracted as a named scoring dimension for a model judge evaluator; Apple’s discussion states each dimension defines a name (used as the DataFrame column), an optional description, and a definition of what each score means. (macOS 27.0 beta) 

  6. Apple Developer, “ModelSample”. A structure (struct ModelSample<ExpectedValue> with ExpectedValue conforming to Decodable, Encodable, Sendable) abstracted as a general-purpose language model evaluation sample; Apple’s discussion states it accepts string-based prompts and instructions, and that multimodal prompts use a custom conformance or an initializer with a prebuilt prompt. (macOS 27.0 beta) 

  7. Apple Developer, “SampleGenerator”. Declared actor SampleGenerator<SampleType> with SampleType : ModelSampleProtocol, abstracted as an actor that generates evaluation samples using a language model; Apple’s discussion states you create a generator, configure its properties, then call it to produce new samples as an async stream, after which you access all generated samples or any the validator rejected. (macOS 27.0 beta) 

  8. Apple Developer, “responseColumn”. An instance property (var responseColumn: ResultColumn<Self.Subject> { get }) abstracted as a typed column descriptor for the model responses in the detailed DataFrame. (macOS 27.0 beta) 

  9. Apple Developer, “ToolCallEvaluator”. A structure (struct ToolCallEvaluator<Input> with Input : ModelSampleProtocol and Input.Expectation == TrajectoryExpectation) abstracted as an evaluator that verifies agentic tool calls against an expected trajectory; Apple’s discussion states it produces both a strict and partial result from a single evaluation pass and supports ordered sequences, unordered expectations, disallowed tool checks, and group steps. (macOS 27.0 beta) 

  10. Apple Developer, “TrajectoryExpectation”. A structure (struct TrajectoryExpectation) abstracted as the expected pattern of tool calls for an evaluation; Apple’s discussion states it specifies expected tool-calling behavior across three axes. (macOS 27.0 beta) 

  11. Apple Developer, “ArgumentMatcher”. An enumeration (enum ArgumentMatcher) abstracted as the values that define how to validate a tool-call argument; Apple’s discussion states you can require exact values, verify key presence, check ranges, match patterns, or use a language model for semantic matching. (macOS 27.0 beta) 

  12. Apple Developer, “EvaluationTrait”. A structure (struct EvaluationTrait) abstracted as a test trait that runs an evaluation and records the result as attachments. (macOS 27.0 beta) 

  13. Author’s analysis in Swift Testing: The Framework Replacing XCTest, May 2, 2026, covering @Test, @Suite, #expect, #require, parallel-by-default execution, custom attachments, and the trait vocabulary (.enabled(if:), .disabled(_:), .serialized, .timeLimit(...), .tags(...), .bug(...)) that EvaluationTrait extends, with citations to Apple’s Swift Testing and Trait references. 

  14. Apple Developer, “EvaluationResult”. A structure (struct EvaluationResult) abstracted as the results of running a model evaluation; Apple’s discussion states it contains the summary and detailed results from an evaluation run. (macOS 27.0 beta) 

  15. Apple Developer, “ResultColumn”. A structure (struct ResultColumn<Value>) abstracted as a typed descriptor for a column in an evaluation result DataFrame. (macOS 27.0 beta) 

  16. Apple Developer, “inputColumn”. An instance property (var inputColumn: ResultColumn<Self.Sample> { get }) abstracted as a typed column descriptor for the input samples in the detailed DataFrame. (macOS 27.0 beta) 

  17. Apple Developer, “expectedColumn”. An instance property (var expectedColumn: ResultColumn<Self.Sample.ExpectedValue> { get }) abstracted as a typed column descriptor for the expected values in the detailed DataFrame. (macOS 27.0 beta) 

  18. Apple Developer, “MetricsAggregator”. A structure (struct MetricsAggregator) abstracted as a utility for computing aggregate statistics from evaluation metrics; Apple’s discussion states it calculates summary statistics like mean, median, and standard deviation, processing metric data from a DataFrame to produce aggregated results. (macOS 27.0 beta) 

  19. Author’s analysis in Foundation Models in iOS 27: Tool-Calling Control, June 8, 2026, covering GenerationOptions.ToolCallingMode, the framework’s after-first-call mode shift, and the built-in Vision tools OCRTool and BarcodeReaderTool

  20. Apple Developer, “EvaluationContext”. A structure (struct EvaluationContext) abstracted as a context that provides the evaluation result within a test scope; Apple’s discussion states you access the result after the evaluation completes. (macOS 27.0 beta) 

  21. Author’s analysis in Foundation Models Agentic Workflow: In-App vs Tooling LLM, May 1, 2026, on the runtime/tooling LLM distinction and the trust boundary between the shipped on-device model and the developer’s tooling model. 

  22. Apple Developer, “Loader”. A protocol (protocol Loader<Sample> : Sendable) abstracted as a protocol for types that supply a dataset for evaluation; Apple’s discussion states you use one of the built-in concrete types or implement the protocol directly for custom data sources. (macOS 27.0 beta) 

  23. Apple Developer, “ModelJudgeEvaluator”. A structure (struct ModelJudgeEvaluator<Input> with Input : ModelSampleProtocol) abstracted as an evaluator that uses a language model as a judge to score responses; Apple’s discussion states it sends the query, response, and optional reference data to a judge model that returns scores for one or more dimensions. (macOS 27.0 beta) 

  24. Apple Developer, “ModelJudgePrompt”. A structure (struct ModelJudgePrompt<Input> with Input : ModelSampleProtocol) abstracted as a configuration for how a model-as-judge evaluator constructs its prompt; Apple’s discussion states it bundles the instructions, response presentation, and reference-data injection into a single composable value. (macOS 27.0 beta) 

  25. Apple, WWDC26 session 298, Meet the Evaluations framework. Apple states generative-AI features “break a contract that is fundamental to software testing” because “the same input can produce different outputs,” and concludes that “unit tests are insufficient.” 

Artículos relacionados

Foundation Models in iOS 27: Tool-Calling Control

iOS 27 adds GenerationOptions.ToolCallingMode to steer how the on-device model uses tools, plus built-in Vision tools: O…

16 min de lectura

Foundation Models on Private Cloud Compute

iOS 27 adds a server-scale Foundation Model on Private Cloud Compute with on-device privacy, plus a protocol to plug in …

17 min de lectura

Your Agent Has Two Untrusted Inputs

AI agents have two untrusted inputs: code the model writes and tool output it reads. One now has a real WASM sandbox; th…

12 min de lectura