Core AI: Running Models on Apple Silicon

Apple’s on-device AI stack has had a missing rung. Foundation Models gives you the system LLM, sealed and free. Core ML runs a fixed converted model with the converter making the hardware decisions for you. MLX ships an array framework you embed and a model you select. iOS 27 adds the rung below all three: Core AI, a framework whose one-line abstract is “Run AI models in your app on Apple silicon.”1 It is the model-execution surface, the place you reach when you want to drive specialization, caching, and inference scheduling yourself instead of accepting a higher layer’s defaults.

In session 324, Apple positions Core AI as the same inference framework that powers on-device Apple Intelligence, now opened up for your own app’s intelligence.15

Watch: Meet Core AI (WWDC26) Core AI is the inference framework behind on-device Apple Intelligence, now available to your app.

The framing matters because Core AI sits under the abstractions most apps should use. Apple describes it as designed with Apple silicon in mind, letting your app use the latest model architectures and inference techniques across the CPU, GPU, and Neural Engine, with a Swift API that makes common tasks simple while giving you more control over model specialization, caching, and inference performance when needed.1 The thesis of this post: reach for Core AI when you have a model you want to run with explicit control over where and how it executes, and stay at Core ML or Foundation Models when you do not. The framework rewards a specific need, not a default preference.

TL;DR / Key Takeaways

  • Core AI separates an unspecialized AIModelAsset (inspect a model’s structure and metadata cheaply) from a specialized AIModel (run inference on a device), with AIModelCache holding the device-specific artifacts and AssetError for asset operation failures.2364
  • Inference data flows through NDArray, a multidimensional array of scalar values, described by an NDArrayDescriptor that fixes shape, scalar type, and memory layout expectations.57
  • You target hardware with ComputeUnitKind (CPU, GPU, or Neural Engine) through SpecializationOptions, and schedule asynchronous work onto a ComputeStream.8910
  • An InferenceFunction owns the weights and buffers and runs inference; an InferenceFunctionDescriptor lets you inspect its input, output, and state signature first. The function is Sendable, so you can run it concurrently.1413
  • Models load from an .aimodel bundle on disk, and Core AI ships preparation, conversion, and debugging tools alongside the framework. Reach for Core AI when you need explicit control over specialization and scheduling; otherwise stay at Core ML or Foundation Models.21

Two Words That Run the Whole Design: Asset and Model

The first thing Core AI asks you to internalize is that a model on disk and a model that runs inference are different objects, and specializing one into the other is expensive. The framework gives each a type.

An AIModelAsset is “an unspecialized source model asset.”2 You create it from the URL of an .aimodel bundle on disk, and you use it to inspect a model without paying the specialization cost. Apple is explicit about why the split exists: a model asset lets you query model information without performing specialization, which is an expensive operation. From an asset you can read function signatures, input and output descriptions, compute and storage types, and author-provided metadata. What you cannot do is run inference; an asset is for inspection only.2

// Call shape is illustrative; confirm the exact initializer against Apple's docs.
let asset = try AIModelAsset(url: bundleURL)   // an .aimodel bundle on disk
// Inspect signatures, input/output descriptions, compute and storage types,
// and author-provided metadata — without specializing.

The AIModel is the other half: “a specialized model for running inference on a device.”3 An AIModel represents a specialized .aimodel asset optimized for the current device’s hardware, and you create one by loading the asset from disk.3 The asset answers what is this model?; the model answers run it here, now. The cost asymmetry between the two is the reason the API forces you to name which one you want. Inspecting a hundred candidate models to pick one is cheap if you only build assets; it would be ruinous if every inspection specialized.

Specialization produces device-specific artifacts, and those artifacts have a home: AIModelCache, “a cache that stores the specialized model artifacts for inference.”6 The cache holds the optimized, device-specific artifacts that a model loads to execute its inference functions, and Apple notes that each cache entry contains a specialized asset formed from a specific .aimodel or .aimodelc and a specialization combination.6 The practical reading: specialization is not something you want to repeat on every launch. The cache is how Core AI lets the expensive step happen once and the cheap step (loading cached artifacts) happen thereafter.

When asset operations go wrong (a missing bundle, a malformed .aimodel, an unreadable file), Core AI surfaces an AssetError, “an error that occurs during model asset operations.”4 Treat it the way you treat any I/O boundary: the asset lives on disk, disk operations fail, and the type system tells you exactly where to put the catch.

Tensors: NDArray and Its Descriptor

Inference moves numbers in and numbers out, and Core AI’s container for those numbers is NDArray, “a multidimensional array of scalar values used for model inference.”5 If you have worked with NumPy ndarray, MLX arrays, or MLMultiArray, the shape of the idea is familiar: an n-dimensional block of scalars with a defined layout. An NDArray stores its data in a layout defined by its shape and the rest of its descriptive properties.5

The companion type is NDArrayDescriptor, “a description of an array’s shape, scalar type, and memory layout expectations.”7 A descriptor is the contract. Apple’s framing is direct: the descriptor contains the expectations for an array value you provide to an inference function, and most expectations are strict. If the descriptor specifies a scalar type of .float32, the array you provide must use .float32.7 You do not guess the shape and type a function wants; you ask the function’s descriptor and conform to it.

// Call shape is illustrative; confirm exact property/method names against Apple's docs.
let inputDescriptor = function.descriptor.inputs.first!   // an NDArrayDescriptor
// The descriptor fixes shape, scalar type, and layout; the array you build
// must satisfy those expectations (e.g. .float32 means .float32).

The design lesson here mirrors the asset/model split. Core AI consistently puts a cheap description object in front of an expensive value object. You read the descriptor to learn the contract, then allocate the NDArray that satisfies it, rather than allocating first and discovering a mismatch at inference time. For image inputs specifically, Core AI also defines an ImageDescriptor, “a description of an image’s dimensions and pixel format,” so a vision model’s pixel input gets the same descriptor-first treatment.11

Choosing Where Inference Runs

Apple silicon has three places to compute: CPU, GPU, and Neural Engine. The reason Core AI exists rather than only Core ML is that Core AI lets you say which of those the framework targets, instead of inferring it.

ComputeUnitKind is “a type of hardware compute unit available for model inference.”8 You use compute unit kinds with the specialization options to control which hardware the framework targets when specializing a model, and by default specialization uses all available compute units on the device.8 The default is the right answer for most work, which is the point: you only override it when you have a reason (a latency-sensitive path you want pinned to the Neural Engine, a debugging pass you want forced onto the CPU, a GPU-heavy pipeline you are coordinating with other GPU work).

You pass that intent through SpecializationOptions, the structure that carries the choices made at specialization time.9 Specialization is the expensive step from earlier, and SpecializationOptions is where the compute-unit targeting and other specialization decisions live. Because a cache entry keys off a specific asset and specialization combination, changing your options changes which cached artifact you get back, which closes the loop between targeting and caching.6

Scheduling is the other axis of “how it runs,” and Core AI models it as a ComputeStream, “a stream of work to be run asynchronously.”10 A compute stream is what you provide to encode work onto the stream, and Apple notes that multiple inferences encoded to the same stream are serialized as needed based on the values read and written.10 Two implications follow. First, a stream is your ordering primitive: encode dependent inferences to one stream and Core AI sequences them by data dependency. Second, work is asynchronous by default, so the stream is also how you keep the calling thread free while the Neural Engine or GPU does the work.

Inference Functions: The Thing That Actually Runs

A loaded .aimodel is not a single callable. Models expose named functions (an encoder, a decoder, a vision tower, a prefill versus a decode step), and Core AI’s unit of execution is the InferenceFunction: “a function that performs inference on input values and produces output values.”14

Before you call one, you inspect it. InferenceFunctionDescriptor is “a description of an inference function’s signature,” and you use a descriptor to inspect the names and types of a function’s inputs, outputs, and states before running inference.13 States are the detail worth pausing on: a function with state is how a stateful model (a KV cache in a transformer decode loop, for instance) keeps information between calls, and the descriptor tells you a function has them before you try to drive it.

The InferenceFunction itself owns the resources inference needs, including model weights and intermediate buffers. You load a function from a model and call run(inputs:states:outputViews:) to perform inference.14 The signature of run is named in Apple’s own discussion, so the three things a call needs are explicit: the input values, the state values, and the output views you want written.

// run(inputs:states:outputViews:) is named in Apple's docs; surrounding
// loading/value-construction shapes are illustrative — confirm against Apple's docs.
let function: InferenceFunction = /* load from an AIModel */
let outputs = try function.run(
    inputs: inputValues,        // InferenceValue per input
    states: stateValues,        // any stateful values the function declares
    outputViews: outputViews
)

Two properties make the function pleasant under load. It is Sendable, so you can run it concurrently from multiple tasks, and Apple notes it automatically allocates additional intermediate buffers as needed to support that concurrency.14 You do not serialize calls behind a lock to protect shared scratch space; the function manages its own buffers per concurrent caller. That is a meaningful difference from APIs where a single inference handle is effectively single-threaded.

The values flowing through run are InferenceValue instances, “a value that an inference function accepts as input or produces as output.”12 An InferenceValue wraps either an NDArray or a pixel buffer, and you retrieve a result after inference using its value property.12 The wrapper is what lets one run signature carry both tensor inputs and image inputs without separate overloads: a text model passes NDArray-backed values, a vision model passes pixel-buffer-backed ones, and the function reads the descriptor to know which it expects.

When to Reach for Core AI

The hardest part of Core AI is not the API. It is knowing that you should be here at all rather than one layer up. The honest decision tree:

  • Foundation Models when Apple’s system model does the task. Summarize, classify, extract, rewrite, structured output: these belong to Foundation Models, which costs you no weights, no memory budget, and no specialization step. If your feature fits, stop there. Dropping to Core AI to re-implement what the system model already does is wasted work.
  • Core ML when you have a fixed, converted model and want the converter to make the hardware and optimization decisions for you. Core ML targets the Neural Engine with tight power and latency for a locked-down production model, and it asks nothing of you about specialization or scheduling. If you do not want to think about compute-unit targeting or compute streams, that is the signal to stay at Core ML.
  • MLX when you want a research-grade array framework you embed and iterate on: your own training loop, quantized open-weight models, LoRA fine-tunes, fast experimentation. MLX is a library you ship with weights, not a system model-execution surface. It wins on flexibility and iteration speed.
  • Core AI when you have a model to run and you want the framework’s explicit handles: an AIModelAsset you inspect before committing, SpecializationOptions that pin compute units, an AIModelCache you manage, a ComputeStream you schedule onto, and InferenceFunctions you call concurrently. You reach here when the higher layers’ defaults are the thing in your way, and you can name which default you need to override.

The throughline across the whole stack: each layer down trades a default away for a handle. Foundation Models hands you everything and asks nothing. Core AI hands you the levers and asks you to know which to pull. If you cannot name the specialization, caching, or scheduling control you need, you do not need Core AI yet.

How a Model Gets to Core AI

The framework is the runtime half of a larger workflow. Apple notes that Core AI includes additional tools for model preparation, integration, and debugging alongside the framework: you prepare your models for Apple silicon, convert them into the .aimodel format, and use a companion app that supports visualization and numeric debugging.1 The fact-sheet’s description of those tool names is truncated, so confirm the exact tool names and their invocation against Apple’s Core AI documentation rather than trusting any reconstruction.1 What is verified is the pipeline shape: a source model is prepared, converted to .aimodel, loaded as an AIModelAsset for inspection, specialized into an AIModel, and run through its InferenceFunctions, with AIModelCache keeping the specialized artifacts so the expensive step happens once.123614

FAQ

What is Apple’s Core AI framework?

Core AI is iOS 27’s low-level framework for running AI models on Apple silicon, summarized by Apple as “Run AI models in your app on Apple silicon.”1 It runs model inference across the CPU, GPU, and Neural Engine through a Swift API that makes common tasks simple while giving you control over model specialization, caching, and inference performance when you need it.1 It sits below Foundation Models and Core ML as a model-execution surface.

What is the difference between AIModelAsset and AIModel?

AIModelAsset is an unspecialized source asset you create from an .aimodel bundle’s URL on disk; you use it to inspect a model’s function signatures, input and output descriptions, compute and storage types, and metadata without specializing, because specialization is expensive, and an asset cannot run inference.2 AIModel is the specialized model optimized for the current device’s hardware that does run inference; you create one by loading the asset from disk.3 The split lets you inspect cheaply and specialize only when you commit.

How does Core AI choose between the CPU, GPU, and Neural Engine?

You control hardware targeting with ComputeUnitKind through SpecializationOptions. A compute unit kind names a type of hardware compute unit available for inference, and you use it to control which hardware the framework targets when specializing a model; by default specialization uses all available compute units on the device.89 You override the default only when you have a specific reason, such as pinning a latency-sensitive path to one compute unit.

What is an InferenceFunction and how do I run it?

An InferenceFunction performs inference on input values and produces output values, owning the model weights and intermediate buffers.14 You inspect its signature first through an InferenceFunctionDescriptor, which describes the names and types of the function’s inputs, outputs, and states, then load the function from an AIModel and call run(inputs:states:outputViews:).1314 The function is Sendable and allocates intermediate buffers automatically to support concurrency, so multiple tasks can run it at once.14

Should I use Core AI instead of Core ML or Foundation Models?

Use Foundation Models when the system model does the task, and Core ML when you have a fixed converted model and want the converter to make hardware and optimization decisions for you. Reach for Core AI when you want explicit control over specialization (SpecializationOptions, ComputeUnitKind), caching (AIModelCache), and scheduling (ComputeStream) that the higher layers handle on your behalf.89610 If you cannot name the control you need, stay one layer up.

The full Apple Ecosystem cluster: MLX on Apple Silicon for the array framework you embed when you want your own model and training loop; Apple Silicon’s TBDR and unified memory for the hardware substrate that makes CPU/GPU/Neural Engine sharing work; Core ML on-device inference for the fixed-model layer above Core AI; and Foundation Models for Apple’s sealed system LLM at the top of the stack. The hub is at the Apple Ecosystem Series. For broader iOS-with-AI-agents context, see the iOS Agent Development guide.

References


  1. Apple Developer Documentation: Core AI (iOS 27.0 beta). “Run AI models in your app on Apple silicon.” Core AI runs the latest model architectures and inference techniques across the CPU, GPU, and Neural Engine, with a Swift API that gives control over specialization, caching, and inference performance; it includes additional tools for model preparation, conversion to .aimodel, integration, and debugging. 

  2. Apple Developer Documentation: AIModelAsset (iOS 27.0 beta). “An unspecialized source model asset.” Created from the URL of an .aimodel bundle on disk; used to inspect a model’s structure and metadata (function signatures, input/output descriptions, compute and storage types, author-provided metadata) without performing the expensive specialization step. It cannot perform inference. 

  3. Apple Developer Documentation: AIModel (iOS 27.0 beta). “A specialized model for running inference on a device.” Represents a specialized .aimodel asset optimized for the current device’s hardware; you create one by loading the asset from disk. 

  4. Apple Developer Documentation: AssetError (iOS 27.0 beta). “An error that occurs during model asset operations.” Declared as struct AssetError

  5. Apple Developer Documentation: NDArray (iOS 27.0 beta). “A multidimensional array of scalar values used for model inference.” Stores data in a layout defined by its descriptive properties. Declared as struct NDArray

  6. Apple Developer Documentation: AIModelCache (iOS 27.0 beta). “A cache that stores the specialized model artifacts for inference.” Holds the optimized, device-specific artifacts a model loads to execute its inference functions; each entry is a specialized asset formed from a specific .aimodel or .aimodelc and specialization combination. Declared as final class AIModelCache

  7. Apple Developer Documentation: NDArrayDescriptor (iOS 27.0 beta). “A description of an array’s shape, scalar type, and memory layout expectations.” Contains the expectations for an array value provided to an inference function; most expectations are strict (a .float32 scalar type requires a .float32 array). Declared as struct NDArrayDescriptor

  8. Apple Developer Documentation: ComputeUnitKind (iOS 27.0 beta). “A type of hardware compute unit available for model inference.” Used with the specialization options to control which hardware the framework targets when specializing a model; by default specialization uses all available compute units on the device. Declared as enum ComputeUnitKind

  9. Apple Developer Documentation: SpecializationOptions (iOS 27.0 beta). The structure carrying the choices made at specialization time, including compute-unit targeting via ComputeUnitKind. Declared as struct SpecializationOptions

  10. Apple Developer Documentation: ComputeStream (iOS 27.0 beta). “A stream of work to be run asynchronously.” Work is encoded onto the stream; multiple inferences encoded to the same stream are serialized as needed based on the values read and written. Declared as final class ComputeStream

  11. Apple Developer Documentation: ImageDescriptor (iOS 27.0 beta). “A description of an image’s dimensions and pixel format.” Declared as struct ImageDescriptor

  12. Apple Developer Documentation: InferenceValue (iOS 27.0 beta). “A value that an inference function accepts as input or produces as output.” Wraps either an NDArray or a pixel buffer; retrieved after inference using its value property. Declared as struct InferenceValue

  13. Apple Developer Documentation: InferenceFunctionDescriptor (iOS 27.0 beta). “A description of an inference function’s signature.” Used to inspect the names and types of a function’s inputs, outputs, and states before running inference. Declared as struct InferenceFunctionDescriptor

  14. Apple Developer Documentation: InferenceFunction (iOS 27.0 beta). “A function that performs inference on input values and produces output values.” Owns the resources needed for inference, including model weights and intermediate buffers; loaded from an AIModel and called via run(inputs:states:outputViews:). It is Sendable and automatically allocates additional intermediate buffers to support concurrent execution. Declared as struct InferenceFunction

  15. Apple, WWDC26 session 324, Meet Core AI. Apple states Core AI “is the inference framework powering on-device Apple Intelligence” and “now, it’s available for you to use, bringing that same power to your app’s own intelligence.” 

Artículos relacionados

Foundation Models in iOS 27: Tool-Calling Control

iOS 27 adds GenerationOptions.ToolCallingMode to steer how the on-device model uses tools, plus built-in Vision tools: O…

16 min de lectura

Core ML On-Device Inference: The Patterns That Actually Ship

Core ML runs models on Neural Engine, GPU, or CPU. The patterns that ship: model conversion, dispatch hinting, latency b…

13 min de lectura

Your Agent Has Two Untrusted Inputs

AI agents have two untrusted inputs: code the model writes and tool output it reads. One now has a real WASM sandbox; th…

12 min de lectura