Metal 4 Essentials: What The New Core API Actually Changes

Metal 4 is not a rewrite. Apple shipped it as a parallel API surface alongside the original Metal types, with a MTL4 prefix on the new ones, so an app can adopt the new core API incrementally without rewriting existing render code.1 The framing matters: the Metal framework itself is long-shipping; the Metal 4 core API is what changed at WWDC25.

Three things the new core API actually changes for app developers:

  1. Multi-threaded command buffer encoding becomes a first-class pattern.
  2. The compute encoder absorbs blit and acceleration-structure encoders into one unified surface.
  3. Machine learning runs as a first-class pass type alongside render and compute, executing Core ML models on the GPU timeline without round-tripping to the CPU.

The sections below walk what each of those three changes looks like in practice, the new types developers reach for, and the reasons Apple’s documentation gives for the shape they took.

TL;DR

  • MTL4CommandQueue, MTL4CommandBuffer, MTL4RenderCommandEncoder, MTL4ComputeCommandEncoder, MTL4MachineLearningCommandEncoder are the new types.1 Original MTL-prefixed types remain. You incrementally adopt by mixing both.
  • Command buffers get their working memory from a separate MTL4CommandAllocator, which lets multiple threads encode to multiple buffers in parallel. A single commit:count: call submits the batch to the queue.1
  • MTL4ComputeCommandEncoder replaces three earlier encoders: MTLBlitCommandEncoder, MTLComputeCommandEncoder, and MTLAccelerationStructureCommandEncoder.1 One encoder, three jobs.
  • MTL4MachineLearningCommandEncoder runs Core ML models inside a Metal command buffer.2 The system picks GPU or Apple Neural Engine for each model. Tensors carry inputs and outputs; the same command buffer mixes ML inference with render and compute work.
  • Resource binding moves to argument tables (MTL4ArgumentTable) instead of per-encoder bind methods. All resources are untracked; you synchronize with explicit barriers.1

Why A Parallel API Surface

Apple’s framing for the parallel-types choice, verbatim from the docs: “Metal 4 introduces several types with the MTL4 prefix that are completely independent from the original MTL types they replace, such as MTL4CommandQueue versus MTLCommandQueue. Other types are common to all versions of Metal.”1

The runtime check is simple: the app detects whether the system supports Metal 4, creates an MTL4CommandQueue if it does, falls back to MTLCommandQueue otherwise. The type of queue the app creates determines which family of types the rest of the rendering code uses.1

The other half of the design lets the two families interoperate. MTLEvent and MTLSharedEvent synchronize across both MTLCommandQueue and MTL4CommandQueue instances.1 An app shipping a substantial Metal 1 codebase can switch a single subsystem to Metal 4 without breaking the synchronization patterns the rest of the app depends on.

That answers a question app developers have been asking since WWDC25: do I need to rewrite my Metal code? No. The shape of the API encourages incremental adoption per subsystem.

Multi-Threaded Command Buffer Encoding

The headline runtime improvement: command buffer memory comes from a companion type, MTL4CommandAllocator, rather than from the queue. Each thread can encode work to its own command buffer using its own allocator, and the queue commits buffers as a batch.

Apple’s API shape:1

let device: MTLDevice = ...
let commandQueue: MTL4CommandQueue = device.makeMTL4CommandQueue()
var commandAllocators: [MTL4CommandAllocator] = ...
let commandBuffer: MTL4CommandBuffer = device.makeCommandBuffer()

// Per frame:
let frameAllocator = commandAllocators[frameNumber % kMaxFramesInFlight]
frameAllocator.reset()
commandBuffer.beginCommandBuffer(allocator: frameAllocator)
// ...encode commands to commandBuffer...
commandBuffer.endCommandBuffer()
commandQueue.commit(commandBuffer, count: 1)

Two operational changes from the original API:

Command buffers are reusable. Apple’s docs: “You can reuse and repurpose each command buffer indefinitely by starting over, encoding new commands, and committing it again, instead of allocating a new buffer.”1 Earlier Metal required a fresh transient buffer for each commit.

No automatic resource retention. “Each MTL4CommandBuffer instance doesn’t create strong references to resources.”1 The behavior is similar to makeCommandBufferWithUnretainedReferences() from the older API. Apps need to manage resource lifetimes explicitly so that resources stay alive until the GPU finishes the work.

The frame-allocator pattern Apple ships in its sample code uses one allocator per frame in flight (the sample uses three) and rotates through them as frames advance through the encode → render → display lifecycle.1 Calling reset() on the allocator at the start of each frame returns its memory to the pool for reuse.

The Unified Compute Encoder

MTL4ComputeCommandEncoder is “a new type that combines the functionality of its three predecessors: MTLBlitCommandEncoder, MTLComputeCommandEncoder, MTLAccelerationStructureCommandEncoder.”1

The earlier API required apps to switch encoder types based on the work shape: blit for resource copies and texture transfers, compute for kernel dispatches, acceleration-structure for ray tracing scene management. Metal 4 collapses those three into one surface. An app encoding a frame that builds an acceleration structure, dispatches a denoising kernel, and copies a texture to a presentation buffer can now do all three through a single encoder type.

The render encoder also picked up a behavior change. MTL4RenderCommandEncoder supports encoding a render pass across command buffers by suspending work at the end of one render encoder and resuming it after the beginning of the next one in the sequence.1 Apple’s framing: “This technique conceptually replaces the MTLParallelRenderCommandEncoder protocol and simplifies encoding a render pass in parallel with multiple threads because each thread can have its own render encoder instead of tying all of them to a single render encoder.”1

The pattern holds: parallel encoding becomes the natural shape, and the API stops requiring a single coordinator type.

Argument Tables Replace Per-Encoder Resource Bindings

The original Metal encoder API exposed methods like setVertexBuffer(_:offset:index:) and setFragmentTexture(_:index:) on each encoder, with separate per-stage binding tables internal to the encoder.1 Metal 4 replaces that pattern with an explicit MTL4ArgumentTable instance.

Apple’s framing of the design payoff: “Metal 4 encoders don’t require memory for storing the binding tables for every resource type, at every stage. Each table consumes only the memory it needs to store its resource bindings.”1

The flow:1

let descriptor = MTL4ArgumentTableDescriptor()
descriptor.maxBufferBindCount = ...
descriptor.maxTextureBindCount = ...
let argumentTable = device.makeArgumentTable(descriptor: descriptor)

argumentTable.setResource(buffer, bufferIndex: 0)
argumentTable.setSamplerState(sampler, index: 0)

renderEncoder.setArgumentTable(argumentTable, stages: [.vertex, .fragment])

A single argument table can serve multiple encoders, including encoders on different command buffers, as long as the resources it binds are appropriate for all of them. Apple’s docs note: “The memory and runtime savings add up with each common resource your encoders share, and each time you assign the argument table to a new encoder.”1

There is a tradeoff. Earlier versions of Metal supported hazard tracking for textures and heaps that opted in via MTLTextureDescriptor.hazardTrackingMode or MTLHeapDescriptor.hazardTrackingMode.1 In Metal 4, “the framework considers all resources untracked. You need to synchronize pipeline stages that can concurrently access a resource if any of the shaders in these pipelines modify it.”1 Apps add explicit barriers to delay a stage until a previous stage finishes. That is more code than the older opt-in tracking, in exchange for predictable performance and lower runtime overhead.

Machine Learning As A First-Class Pass

MTL4MachineLearningCommandEncoder is the most architecturally significant addition. Apple’s framing:2

“Metal 4 introduces the ability to run CoreML models efficiently from within the Metal workflow. This is useful for apps that need to apply the output from a model in a Metal context, such as rendering a scene or running a compute dispatch.”

Two things are happening at once. First, ML inference runs on the GPU timeline, in the same command buffer as render and compute work. The app does not round-trip through the CPU between model inference and the render pass that consumes the output. Second, the system picks the inference engine: “The system automatically chooses an inference engine, such as a device’s GPU or Apple Neural Engine (ANE) for each machine learning model. The GPU can run additional, independent render or compute work with the GPU when the system chooses to run a model on the ANE.”2

The development workflow:2

  1. Convert a Core ML model into a Metal ML package using metal-package-builder, included in Xcode 26’s bundled tools.
  2. Add the Metal ML package to the Xcode project. Xcode compiles it to a Metal library at build time.
  3. At runtime, the app creates an MTL4MachineLearningPipelineState from that library.
  4. The encoder takes the pipeline state, an MTLHeap for scratch memory, and MTLTensor instances for inputs and outputs.

MTLTensor is a new resource type for multidimensional data arrays.2 Apple’s docs note that the type works with common ML weight types like int8 and fp16. Tensors carry inputs into the model and outputs out of it; for transient data between inference invocations, Metal Shading Language adds tensor types that live on the GPU directly:2

  • tensor_handle: a handle to an MTLTensor created on the CPU
  • tensor_inline: a tensor defined on the GPU as a view into a tensor or buffer
  • cooperative_tensor: a tensor that distributes its elements among the threads working with it

The cooperative_tensor type is the latency-sensitive case: “Cooperative tensors provide temporary memory for transient tensors by equally distributing their data among the threads that work with that tensor. This memory distribution reduces memory bandwidth by allocating the memory from thread-private or threadgroup-private address spaces, which is important for latency-critical, machine learning algorithms.”2

MSL also picked up tensor operators that work directly in shader code: convolution, matrix multiplication, reduction.2 Apps that need to manipulate weights between inference passes can do so without copying tensors back to CPU memory or running a separate compute pass; the operators fit into normal MSL kernels.

There is one boundary worth quoting: “Machine learning encoders run Core ML models but they can’t build new networks or modify layers and inputs of existing ones; for those tasks, see Core ML and Metal Performance Shaders Graph.”2 Metal 4’s ML encoder is for shipping inference, not for training or model construction.

What Metal 4 Means For The Apple Stack

Three takeaways for app developers planning Metal 4 adoption:

  1. Adopt incrementally, per subsystem. The parallel MTL4-prefixed types and event-based interop with the original API are designed for partial migration. Pick a subsystem with clear performance pressure (a render path, a compute pipeline, a model inference loop) and migrate it first.1
  2. Multi-threaded encoding is the new normal. The allocator-per-thread pattern, the commit:count: batch submission, and the suspend/resume render-pass mechanism all assume parallel encoding as the shape that performant apps will use. Single-thread encoding still works, but the framework’s runtime gains compound with multi-thread adoption.1
  3. ML runs in the same command buffer as everything else. For apps that combine on-device model inference with rendering or compute (image-processing pipelines that filter through a Core ML model, real-time effects that depend on classifier output, AR experiences whose rendering depends on a per-frame inference result), the ability to encode ML inference into the same command buffer as the render pass that consumes it is the qualitative change.2

The Metal Shading Language additions deserve their own treatment. Tensor types and operators in shader code, plus operation descriptors for custom operations, change what a Metal kernel can express. That is a separate post.

The full Apple Ecosystem cluster: the Foundation Models on-device LLM for the framework that runs on top of this stack; the custom adapter lifecycle for developer-managed specialization; Core ML on-device inference for the ML framework whose models Metal 4 now runs inline. The hub is at the Apple Ecosystem Series.

FAQ

Is Metal 4 a separate framework from Metal?

No. The framework is still Metal. Apple’s docs describe Metal 4 as “the Metal 4 core API”: a set of new types with the MTL4 prefix that ship alongside the original MTL types in the same framework.1 Apps incrementally adopt the new types by detecting Metal 4 support at runtime and creating the appropriate queue type.

Do I need iOS 26 to use Metal 4?

The Metal framework supports iOS 8+, but the Metal 4 core API is the version Apple introduced at WWDC25. Run a runtime detection and create either an MTL4CommandQueue or an MTLCommandQueue based on what the device supports.1

What’s the relationship between Metal 4 ML passes and Foundation Models?

They run on different stacks. MTL4MachineLearningCommandEncoder runs Core ML models converted to Metal ML packages, in the same command buffer as render and compute work.2 Foundation Models is a separate framework that runs Apple’s on-device system language model with its own session API, covered in the Foundation Models on-device LLM post. The two are complementary: an app can use Foundation Models for text generation and Metal 4 ML passes for vision or audio model inference inside its render loop.

Why is the compute encoder unified now?

Apple’s docs combine MTLBlitCommandEncoder, MTLComputeCommandEncoder, and MTLAccelerationStructureCommandEncoder into MTL4ComputeCommandEncoder.1 The justification is operational rather than performance-only: a single encoder type for compute, blit, and acceleration-structure work simplifies pipeline management and reduces encoder churn in apps that interleave the three.

Are store-action options still available in Metal 4?

Not for MTL4RenderCommandEncoder. Apple’s docs note: “Store-action options (see MTLStoreActionOptions) aren’t available because they don’t apply to Apple silicon GPUs.”1 The architectural decision reflects Apple’s GPU-only target for the Metal 4 core API.

Do I have to use argument tables?

In Metal 4, yes. The encoder protocols don’t expose per-resource bind methods. You configure resource bindings on an MTL4ArgumentTable and assign that table to one or more encoder stages.1 The runtime payoff is that the table only allocates memory for bindings it actually uses, instead of fixed-size per-stage tables on every encoder.

References


  1. Apple Developer, “Understanding the Metal 4 core API”. Type hierarchy comparison (MTL4 vs MTL), command queue and buffer behavior, command allocator pattern, encoder unification, argument tables, hazard tracking, suspend/resume render passes. Retrieved 2026-05-04. 

  2. Apple Developer, “Machine learning passes”. MTL4MachineLearningCommandEncoder, MTLTensor, MSL tensor types (tensor_handle, tensor_inline, cooperative_tensor), metal-package-builder, system inference engine selection (GPU/ANE), MSL tensor operators. Retrieved 2026-05-04. 

Related Posts

SwiftData's Real Cost Is Schema Discipline

SwiftData's API is two macros. The cost is what happens after you ship. Optional fields are the cheap migration; non-opt…

15 min read

Foundation Models Use Cases: General vs Content Tagging

iOS 26 Foundation Models has .general and .contentTagging use cases. Use Apple's rules to decide when prompting beats sp…

9 min read

Building AI Systems: From RAG to Agents

I built a 3,500-line agent system with 86 hooks and consensus validation. Here's what I learned about RAG, fine-tuning, …

13 min read