Metal for Machine Learning in 2026

Apple’s M5 ships a new hardware block called the neural accelerator, and it does not sit on the Neural Engine or off in some coprocessor. It sits inside each GPU shader core, alongside the existing pipelines, built to accelerate dense compute-bound work like the prefill stage of an LLM.1 That placement is the whole story of Metal in 2026. The GPU you reach for to draw triangles is now the GPU you reach for to run matrix multiplications, and Apple spent WWDC26 handing developers the APIs to use it directly. Metal is a first-class machine learning compute surface, and the graphics framing no longer covers what it does.

For years the on-device ML story ran above Metal: Core ML targeted the Neural Engine, MLX wrapped the GPU in a NumPy-shaped array API, MetalFX gave games a sealed upscaler. Metal itself was the substrate those frameworks stood on, not a layer most ML developers wrote against. WWDC26 changed the altitude. Apple shipped a Metal Shading Language API for tensor math, made custom kernels pluggable into Core AI, and showed neural networks running inline inside shaders. The reasons to drop to the Metal level are now concrete enough to name.

TL;DR

  • The M5 neural accelerator is a new hardware block located inside each GPU shader core, designed to accelerate dense compute-bound work such as LLM prefill.1
  • TensorOps is a Metal Shading Language API that accelerates tensor operations (matrix multiplication, convolution) on the GPU and automatically uses available hardware acceleration across Apple silicon generations.1
  • Metal tensors natively support quantized data types: 4- and 8-bit integers in the macOS and iOS 26 update, with 4- and 8-bit float and 2-bit integer types plus E8M0 block-wise scale factors arriving in macOS and iOS 27.1
  • Neural rendering spans three levels in Metal 4: MetalFX denoising as a black box, the ML command encoder for trained models in your command buffer, and TensorOps for tiny networks built directly in a shader.2
  • MLX Swift keeps the high-level numerical-computing path: array code that reads like the math, lazy evaluation, automatic differentiation, and GPU execution by default.4
  • New performance tooling (look-back trace collection, metalperftrace, and the StateReporting API) closes the loop on measuring all of it across long sessions.3

Metal tensors and custom ML kernels

Apple frames the ML stack as layers: Core AI and MLX at the top for minimal-code deployment, Metal Performance Shaders below for high-performance kernels, and underneath all of it Metal Performance Primitives and the TensorOps library doing the low-level acceleration.1 You drop to the Metal level for a specific reason. ML research moves fast, so you might implement a custom operation that plugs into a higher-level framework like Core AI. You might be contributing to a framework like MLX or llama.cpp. Or you might be writing a Metal-based application that needs the op inline.1

Watch on Apple Developer ↗

Apple’s Shiyao on the M5 neural accelerator: a hardware block inside each shader core, built for dense compute-bound work like LLM prefill (WWDC26, session 330).

TensorOps is a Metal Shading Language API that accelerates tensor operations on the GPU, matrix multiplication and convolution among them, and it takes full advantage of the M5 neural accelerator without you writing per-generation code.1 The library reads the available hardware and uses it. That portability is the point of working through TensorOps instead of hand-rolling SIMD-group math against a specific GPU.

The most practical 2026 addition is native quantization. Models keep growing, inference is typically memory-bandwidth bound, so compressing weights both fits more model into memory and saves bandwidth.1 The standard move: take 16-bit half-precision weights, reduce them to 4 bits, and pair the quantized values with scale factors that map them back into range at compute time.1 TensorOps now handles quantized data types natively. The macOS and iOS 26 update added 4- and 8-bit integer types; macOS and iOS 27 extend that to 4- and 8-bit floating-point types and 2-bit integers.1 You create a quantized tensor almost exactly as a regular one: fill in the descriptor’s properties, specify a quantized dataType, and call newTensorWithDescriptor on your Metal device.1

The scale factors get a clean home too. In macOS and iOS 27, a single MTLTensor can carry its scales alongside the quantized data as an additional scale plane, supporting the FP8 E8M0 block-wise scale factor format, where each scale element applies to a block of data elements.1 You build a descriptor for the scale plane, set its dataType and block factors, and attach an auxiliary plane map to the original tensor descriptor; quantized data, scales, and metadata pack into one tensor object.1 In a kernel, you declare the scale plane (for example fp8_e8m0 with a 32-by-1 block size, so every 32 data elements share one scale), declare the full tensor type, set up a matmul2d_descriptor with the tile sizes, create a matmul2d op, pass in the quantized tensors, and TensorOps handles dequantization for you.1

When you need a custom quantization format the library does not know, dequantize directly into a cooperative tensor and pass it to the matmul2d op. Cooperative tensors distribute their storage across the thread-private memory of the threads in the matmul, so the data stays in registers and you skip the round trip through threadgroup memory.1

The flagship example in session 330 is FlashAttention, the fused attention kernel at the core of every transformer. Attention multiplies Q by K, computes SoftMax over the rows of the intermediate matrix, then multiplies by V; FlashAttention fuses all three into one kernel.1 You use the execution_simdgroup operation scope so each SIMD group owns complete rows of the intermediate matrix and can compute SoftMax without exchanging data between groups, store the intermediate matrix in a cooperative tensor, and compute the row reductions with reduce_rows.1 The 2026 improvement that matters most: in macOS 26 you had to store the cooperative tensor to threadgroup memory before the second matmul, but now you feed it straight in via get_left_input_cooperative_tensor after a is_compatible_as_left layout check.1 If the check fails, you fall back to the threadgroup round trip; either way op.run is identical.1

Apple closed the loop by integrating that custom kernel into a real model. Core AI converts PyTorch models and supports custom Metal kernels, so the team defined the FlashAttention body as a string in Python, registered a TorchMetalKernel, replaced the default Hugging Face attention with one that calls the kernel, and exported a SAM 3 image-segmentation model from PyTorch as an optimized Core AI asset that correctly segmented a car in a test image.1 That is the path from a hand-written GPU kernel to a shipping model.

Real-time neural rendering

The second front is rendering. Many techniques that historically used analytical methods, including neural denoising, neural textures, and learned tone mapping, can now run as machine learning at any stage of the pipeline to improve quality, performance, or memory footprint.2 Metal 4 offers three levels of control, and the level you pick is a deliberate trade.

Watch on Apple Developer ↗

Apple’s Yulia walks the three levels of ML in a Metal 4 rendering pipeline, from MetalFX to TensorOps networks running on the M5 and A19 Pro neural accelerator (WWDC26, session 359).

At the highest level, MetalFX provides a ready-to-use neural denoiser and upscaler as a fully integrated black box, designed for the low-latency demands of a live viewport.2 In a path tracer your frame budget might allow one sample per pixel to stay interactive, and one sample is noisy. MetalFX takes that noisy frame plus auxiliary inputs (diffuse albedo, depth, and a few others) and produces a clean, near-final image using both spatial and temporal techniques.2 Maxon’s Redshift Live, a real-time path tracer in Cinema 4D, ships it, and Apple distilled the integration into three best practices: keep your auxiliary inputs noise-free (diffuse albedo is the strongest denoising signal), store what the viewer actually sees (primary surface replacement for mirrors, Fresnel-blended albedo for glass), and get your motion vectors right (MetalFX expects dejittered motion vectors, or edges shimmer).2

The middle level is the Metal 4 ML command encoder, which runs a pre-trained model directly in your command buffer with no context switch.2 Apple’s worked example is neural tone mapping. A renderer’s post-processing chain (tone mapping, color grade, film emulation) grows arbitrarily complex, and a neural network can learn the whole transformation. You train a network like HDRNet (a 2017 architecture from Gharbi and colleagues) in PyTorch, export it to an MTLPackage, load it with a function descriptor and a machine learning pipeline descriptor, then dispatch it with an argument table.2 The neural tone mapper then replaces the entire multi-stage post-processing chain with one neural evaluation, encoded in the same command buffer alongside the path tracer and MetalFX, executing in the same frame.2

The deepest level is TensorOps inside a shader, the same library from session 330 but used to build tiny networks of a few thousand parameters or less, trained on one scene and not meant to generalize.2 Because the network fits inline among your ALU and texture-sampling instructions, it enables online training: a model that adapts every frame to changing world conditions. Apple’s example is a sky probe for image-based lighting, where a dynamic day-night cycle makes a precomputed lighting signal go stale. The fix is a fully connected MLP, a 3-4-4-3 network taking a direction as three floats and returning illumination as a color, evaluated forward in the shader and trained online with a backpropagation pass each frame.2 Using SIMD-group execution scope, all participating threads work on the same matrix multiplication and gain cooperative tensors whose storage spreads across threads, so you matmul2d through each layer, apply activations in place, and read the output right there in the shader without a round trip to main memory. That is the same neural accelerator path from session 330, now feeding a pixel.2

Measuring it: performance tooling

Running ML on the GPU means competing for the same cycles, bandwidth, and thermal budget your rendering needs, so measurement stops being optional. Session 388 is about long game sessions, but the tooling applies to any Metal workload, ML included, because a frame-rate drop during a neural tone-mapping pass looks exactly like any other GPU stall until you can see what the app was doing.3

Watch on Apple Developer ↗

Apple’s Ruiwei on the system always recording Metal performance metrics in the background, so you can look back hours and collect a trace after a session ends (WWDC26, session 388).

The headline change in iOS and macOS 27: the system always records Metal performance and resource metrics in the background, storing aggregated and optional per-frame data (CPU, GPU, FPS, memory) for days.3 You collect after the fact. On macOS, metalperftrace collect --last 5h pulls a trace for the last five hours; on iOS, a one-time setup adds a Performance Trace button to Control Center that processes a look-back trace on demand.3 You analyze with metalperftrace overview, and --json emits structured output you can feed to a regression script or hand to an AI agent for triage; Instruments plots everything on a timeline and flags deviating stats in color.3

The piece that makes a raw FPS graph actionable is the StateReporting API, available in Swift and Objective-C.3 You define domains, each a finite state machine for one orthogonal area of functionality, and report transitions with a label plus optional stable metadata and per-state volatile metadata.3 For an ML-heavy app you might run a domain tracking which neural pass is active, so when frame time spikes you can ask metalperftrace to aggregate FPS for a specific state label and see directly that, say, the tone-mapping pass is the cost.3 Apple’s caution: keep transitions to the cadence of user actions or slower, because the system throttles high-frequency reporting and you lose data.3 After shipping, MetricKit in iOS and macOS 27 exposes Metal frame-rate information broken down by your StateReporting states, delivered as daily reports from devices.3

MLX: the high-level path stays

Not every ML developer wants to write GPU kernels, and Apple did not push everyone down. MLX Swift remains the expressive layer for numerical and array computing on Apple silicon, and session 328 is a clean argument for staying there when you can.4

Watch on Apple Developer ↗

Apple’s David Koski on where MLX Swift fits among Accelerate, BNNS, Metal Performance Shaders, and Swift Numerics (WWDC26, session 328).

MLX uses n-dimensional arrays as its central abstraction, like NumPy, so most NumPy code translates with minimal changes and the code reads like the math you are implementing.4 Two properties carry the framework: lazy evaluation, where operations build a compute graph that runs only when you call eval or read a value, and the function transformations that lazy evaluation powers, including grad for automatic differentiation.4 By default the GPU runs the work, and Apple’s framing is that 10x over a scalar CPU loop is certainly possible depending on the algorithm.4 Session 328 makes the case with three examples: the Mandelbrot set as a two-line loop over a whole grid of complex numbers, a heat-distribution solver where a single conv2d call applies the Jacobi stencil across the grid, and curve fitting where grad derives the exact gradient with no hand-written derivatives.4 MLX ships the full toolkit beyond that and is open source under an MIT license with Swift, Python, C++, and C front-ends sharing the same concepts, so you can prototype in Python and ship in Swift.4

How it relates to Core AI, Core ML, and MLX

The four sessions describe one stack at different altitudes, and the right mental model is a ladder.

  • Core AI and Core ML are the deployment layer. You convert a trained model and run it, and Apple owns the acceleration. Core AI’s support for custom Metal kernels is the bridge: the FlashAttention kernel from session 330 plugs into a Core AI model, so a Metal-level optimization reaches a high-level deployment without a rewrite.1
  • MLX is the array-computing and research layer, where you write math-shaped Swift, get autodiff and GPU execution for free, and run or fine-tune your own models. It is the layer most ML work should start at.4
  • Metal Performance Shaders and TensorOps are the kernel layer. You drop here when you need a custom operation, when you are building a framework like MLX itself, or when the network has to run inline in a shader. The kernel layer exposes the neural accelerator most directly.1

The hardware substrate ties them together. The neural accelerator lives in the GPU shader core, and Metal is how you reach the GPU, so every layer above eventually rides the same silicon over Apple’s unified-memory, tile-based architecture. Picking a layer is the same judgment as the MLX-versus-Foundation-Models decision: drop down only when you can name the capability the layer above does not give you, because each step down trades convenience for control.

FAQ

What is the M5 neural accelerator?

The neural accelerator is a new hardware block in the M5 chip family, located directly inside each GPU shader core alongside the other GPU pipelines.1 Apple designed it to accelerate dense compute-bound work, such as the prefill stage of an LLM, and the TensorOps library uses it automatically when present.1 Apple also referenced the neural accelerator on the A19 Pro GPU in the neural rendering session.2

Which quantized data types do Metal tensors support?

The macOS and iOS 26 update added 4- and 8-bit integer types.1 In macOS and iOS 27, TensorOps extends support to 4- and 8-bit floating-point types and 2-bit integer types, plus FP8 E8M0 block-wise scale factors carried in a tensor’s scale plane.1 Apple notes these smaller types carry additional alignment requirements, so consult the Metal documentation before adopting them.1

What are the three levels of neural rendering in Metal 4?

MetalFX is the highest level, a ready-to-use black-box neural denoiser and upscaler.2 The Metal 4 ML command encoder is the middle level, running a pre-trained model exported to an MTLPackage directly in your command buffer.2 TensorOps is the deepest level, letting you build tiny networks inside a shader, including online training that adapts a model every frame.2

When should I use MLX instead of writing Metal kernels?

Use MLX Swift when your goal is expressive numerical or array code with good performance, where the code should read like the math and you want autodiff and GPU execution by default.4 Drop to TensorOps and Metal kernels when you need a custom operation that MLX or Core AI does not provide, when you are contributing to an ML framework, or when the network has to run inline in a shader.1

How do I measure ML performance on the GPU in 2026?

In iOS and macOS 27 the system always records Metal performance metrics in the background, so you collect a look-back trace after the fact with metalperftrace collect --last <duration> on macOS or the Control Center Performance Trace button on iOS.3 Analyze with metalperftrace overview (with --json for scripting) or Instruments, and add the StateReporting API to label what your app was doing so you can attribute slowdowns to a specific pass.3


The Apple Ecosystem cluster continues: MLX on Apple Silicon for the array-framework path, Apple Silicon TBDR for the GPU substrate underneath all of it, Core AI for running models on-device, and Core ML on-device inference for the broader ML stack. The series hub is the Apple Ecosystem Series, and for building agentic features on this foundation see the iOS Agent Development guide.

References


  1. Apple, WWDC26 session 330, “Optimize custom machine learning operations with Metal tensors”. The M5 neural accelerator inside each shader core; TensorOps as a Metal Shading Language API for matrix multiplication and convolution; native quantized data types (4- and 8-bit integers in the macOS/iOS 26 update; 4- and 8-bit floats and 2-bit integers plus FP8 E8M0 block-wise scale factors in macOS/iOS 27); scale planes and auxiliary plane maps; newTensorWithDescriptor, matmul2d_descriptor, matmul2d, tensor_handle/tensor_inline; cooperative tensors; the FlashAttention kernel (execution_simdgroup scope, reduce_rows, map_iterator, get_left_input_cooperative_tensor, is_compatible_as_left/right_input, op.run); and the SAM 3 / Core AI integration via TorchMetalKernel. Apple directs developers to the Metal Performance Primitives documentation and programming guide for the full API reference. 

  2. Apple, WWDC26 session 359, “Build real-time neural rendering pipelines with Metal”. The three levels of neural rendering in Metal 4 (MetalFX, the ML command encoder, TensorOps in a shader); MetalFX denoising/upscaling, its auxiliary inputs, the transparency overlay and denoiser strength mask, and the three best practices (clean inputs, primary surface replacement and Fresnel-blended albedo, dejittered motion vectors) as adopted by Maxon’s Redshift Live; neural tone mapping via HDRNet exported to an MTLPackage; the sky-probe MLP (3-4-4-3) with online training; SIMD-group execution scope and cooperative tensors; and the reference to the A19 Pro GPU neural accelerator. Apple directs developers to the Metal Performance Primitives (MPP) Programming Guide for code details. 

  3. Apple, WWDC26 session 388, “Find and fix performance issues in your Metal games”. The always-on background recording of Metal performance and resource metrics in iOS and macOS 27; look-back trace collection via the metalperftrace command-line tool (collect --last, overview, --json, --include-state-transitions) on macOS and the Control Center Performance Trace button on iOS; analysis in Instruments; the StateReporting API (domains as finite state machines, reportTransition, stable and volatile metadata, reportVolatileMetadataUpdate) and its integration with the Metal Performance HUD, metalperftrace, and Instruments; and MetricKit’s Metal frame-rate reporting broken down by StateReporting states. 

  4. Apple, WWDC26 session 328, “Explore numerical computing in Swift with MLX”. MLX Swift as a NumPy-like n-dimensional array framework; lazy evaluation building a compute graph that runs on eval or value read; grad for automatic differentiation; GPU execution by default with up to 10x over scalar CPU code; the Mandelbrot, heat-distribution (Jacobi and SOR via conv2d), and curve-fitting examples; the broader toolkit (linear algebra, FFTs, convolutions, reductions, scans, random number generation); and MLX’s open-source MIT license with Swift, Python, C++, and C front-ends. See also Apple Machine Learning Research, MLX Swift

관련 게시물

Apple Silicon TBDR: What App Developers Actually Get

Apple silicon GPUs use tile-based deferred rendering. Imageblocks, tile shaders, raster order groups, and A11+ enhanceme…

11 분 소요

Metal 4 Essentials: What The New Core API Actually Changes

Metal 4 ships parallel MTL4-prefixed types alongside Metal in iOS 26. Multi-threaded command encoding, unified compute, …

11 분 소요

Building AI Systems: From RAG to Agents

I built a 3,500-line agent system with 86 hooks and consensus validation. Here's what I learned about RAG, fine-tuning, …

13 분 소요