Apple Silicon TBDR: What App Developers Actually Get

Q: What&rsquo;s the difference between an imageblock and threadgroup memory?

Apple&rsquo;s documented distinction: &ldquo;Threadgroup memory is suitable for unstructured data, but an imageblock is more suitable for image data.&rdquo;1 Imageblocks carry a 2D structure with a width, height, pixel depth, and named per-pixel components; threadgroup memory is a flat allocation. Apps that need structured image data with addressable slices use imageblocks; apps that need scratch buffers for compute kernels use threadgroup memory.

Q: How does Apple&rsquo;s MSAA optimization save work?

Apple&rsquo;s hardware tracks the number of unique samples per pixel as it renders new primitives. Apple&rsquo;s example: a pixel covered by two triangle edges has three unique colors at four sample positions; A11+ GPUs blend twice rather than three times because two samples share a color, and a later opaque triangle reduces the pixel back to a single color.1 The optimization runs at hardware level; apps get it without API changes.

Q: Is the Apple GPU architecture documented anywhere besides the TBDR page?

Apple&rsquo;s &ldquo;Apple silicon&rdquo; topic in the Metal documentation links to the TBDR page that backs this post. Apple&rsquo;s WWDC sessions on Metal also cover GPU architecture details, and the Metal Shading Language Specification covers the shader-level surface. Apple has not published the underlying silicon-level details (cluster counts, ALU widths, raster engine specifics) for a given Apple GPU generation in developer documentation; treat any such number found outside developer.apple.com as unverified.

Apple silicon GPUs do not render the way other GPUs render. Apple’s Metal documentation describes the architecture by name: “The GPUs in Apple silicon implement a rendering technique called tile-based deferred rendering (TBDR) that optimizes performance and power efficiency.”¹ The TBDR shape is the reason the Metal 4 API, the on-device ML stack, and the imageblock-and-tile-shader programming model exist as they do.

The sections below walk the four Apple-documented features TBDR enables and what each buys an app: imageblocks, tile shaders, raster order groups, and the enhanced multisample antialiasing implementation. The earlier post on Metal 4 essentials covered the core API surface; the focus here is the GPU substrate that surface targets.

TL;DR

TBDR breaks the render destination into tiles, runs many in parallel on separate GPU cores, and defers shading until after all geometry is evaluated for each tile.¹
Tile memory has bandwidth many times faster than device memory, latency many times lower, and significantly lower energy cost.¹
A11 and later Apple GPUs add imageblocks, tile shading, raster order groups, and imageblock sample coverage control. Apps reach all of them through Metal.¹
Imageblocks let an app define custom per-pixel data structures in tile memory, persist data across draws and dispatches, and mix render with compute work in a single pass.¹
Raster order groups synchronize fragment threads that target the same pixel, removing the read-modify-write race that breaks order-dependent blending.¹

What TBDR Actually Is

Apple’s framing, verbatim: “The GPU breaks up the render destination into a grid of smaller regions, called tiles. It processes each tile with one of its GPU cores, often running many at the same time. The GPU defers, or postpones, the rendering phase for each tile until after it evaluates all the geometry for that tile.”¹

The contrast with immediate-mode (IM) GPUs is also Apple’s framing: “An IM GPU fully processes primitives, such as lines and triangles, regardless of whether or not they’re visible in the rendering.”¹ TBDR avoids that work by gathering all geometry for a tile first, then shading only what survives occlusion. Apple states the win directly: “A TBDR GPU avoids doing unnecessary work by processing all of the geometry of a render pass at the same time and shading only the visible primitives.”¹

Tile memory is the payoff. Apple describes its advantages over device memory:¹

“Bandwidth that’s many times faster than device memory”
“Access latency that’s many times lower than device memory”
“Energy consumption that’s significantly less than accessing device memory”

Two render passes can also overlap on the hardware. Apple notes: “While the GPU runs the final stages of a render pass to tile memory, it can start the vertex stage of a future render pass. The GPU can use more hardware blocks at the same time by running both stages in parallel because they tend to use different compute and memory components.”¹

That’s the substrate. Everything below uses it.

Imageblocks: Custom Per-Pixel Data In Tile Memory

Apple’s definition of an imageblock: “Imageblocks are tiles of structured image data stored in local memory, allowing you to describe image data in tile memory that Apple GPUs can manipulate efficiently.”¹ They are 2D data structures with a width, height, and pixel depth, and “each pixel in an imageblock can consist of multiple components, and you can address each component as its own image slice.”¹ Apple’s example: an imageblock that holds three image slices for albedo, specular, and normal components.

The shape Apple documents:¹

Available to both kernel and fragment functions.
Persist for the lifetime of a tile, across draws and dispatches.
Existing render code automatically creates imageblocks that match render attachment formats.
Apps can define custom imageblocks in shaders with additional channels, arrays, and nested structures.
A fragment shader sees only the imageblock data at that fragment’s position; a compute function thread can access the whole imageblock.

Persistence across draws and dispatches is the operationally interesting part. Apple’s framing: “Imageblock persistence means that you can mix render and compute operations in a single rendering pass with tile shaders, where both can access the same local memory. You can create sophisticated algorithms that remain in local GPU memory by keeping multiple operations within a tile.”¹

For an app shipping a multi-stage rendering pipeline (deferred shading, screen-space effects, custom blending), keeping intermediate results in tile memory rather than round-tripping through device memory is the per-frame budget that TBDR gives back.

Tile Shaders: Render And Compute, Same Pass

Apple’s framing of tile shaders: “Tile shaders are compute or fragment functions that execute as part of a render pass. They allow your app to compute and save data to tile memory that’s persistent on the GPU between render passes.”¹

The traditional GPU model is what tile shaders sidestep. Apple’s words: “Traditional GPUs separate rendering and compute commands into distinct passes. These passes typically can’t communicate directly with each other. Apps work around this limitation by saving the results from one pass into device memory and then loading that data back for the next pass. In some scenarios, such as in a multiphase rendering algorithm, apps may copy intermediate data to device memory many times.”¹

Tile shaders move that intermediate data into tile memory. Apple’s documented payoff: “Apps that use tiles shaders can avoid storing intermediate results out to device memory and save time by storing data in faster tile memory.”¹

For Metal 4 apps, tile shaders pair with the unified MTL4ComputeCommandEncoder design covered in the Metal 4 essentials post. The encoder unification and the tile-shader programming model are the same architectural decision read at two layers: collapse render-vs-compute boundaries that exist on traditional GPUs because the Apple GPU hardware does not need them.

Raster Order Groups: Ordering Concurrent Fragment Threads

The problem raster order groups solve, in Apple’s words: “Metal guarantees the GPU blends in draw call order, giving the illusion that the GPU renders the scene sequentially. … The fragment shaders for each triangle run concurrently on their own thread. The fragment shader for the rear triangle might not execute before the fragment shader for the front triangle, which can be a problem for a shader that needs the results from another triangle’s shader for its custom blending function. Because of concurrency, this read-modify-write sequence can create a race condition.”¹

The mechanism: “Raster order groups overcome this access conflict by synchronizing threads that target the same pixel coordinates and sample (if you activate per-sample shading).”¹

The implementation surface: “To implement raster order groups, annotate pointers to memory with an attribute qualifier. Shaders that access pixels through those pointers go in per-pixel submission order. The hardware waits for any older fragment shader threads that overlap the current thread to finish before the current thread proceeds.”¹

Recent Apple GPUs extend the mechanism. Apple’s words: “Metal on recent Apple GPUs extends raster order groups with additional capabilities. They allow you to synchronize individual channels of an imageblock and threadgroup memory. You can also create multiple order groups, which give you finer-grained synchronization and minimize how often your threads wait for access.”¹

Apple’s worked example is deferred shading. The traditional two-phase approach writes a g-buffer of multiple textures to device memory, then reads them back for the lighting phase. Apple’s framing: “You can eliminate the need for the intermediate textures by using multiple order groups to coalesce both render phases into one. To do that, keep the geometry buffer in tile-sized chunks so they can remain in local imageblock memory.”¹

The split Apple recommends:¹

First order group: the three g-buffer fields (albedo, normal, depth).
Second order group: the accumulated lighting result.
“Apple GPUs can order the two groups separately so that outstanding writes into the second group don’t impede the reads from the first group.”¹

Two threads still synchronize at the end of execution to accumulate the lights. The win is that the nonconflicting reads run concurrently rather than serially.

MSAA That Tracks Unique Samples Per Pixel

Apple’s documented MSAA implementation on A11+ GPUs differs from the textbook description. Apple’s framing: “The hardware tracks whether each pixel contains a primitive’s edge so it runs the per-sample blending only when necessary. If another primitive covers the samples within a pixel, the GPU blends only once for the entire pixel.”¹

Apple’s example walks the optimization. A pixel covered by two overlapping triangle edges has three unique colors at four sample positions. Apple’s words: “Apple GPUs prior to A11 blend each of the pixel’s three covered samples. Starting with A11, Apple GPUs blend only twice because two samples share the same color.”¹

Color reduction goes further. Apple: “Apple GPUs can reduce the number of unique colors in a pixel. For example, if the GPU renders an opaque triangle on top of the earlier triangles, it represents the pixel by a single color.”¹

Apps can extend the implementation with tile shaders. Apple’s documented use case: “You can implement a custom resolve algorithm by modifying the sample coverage data in the tile shaders. For example, consider a complex scene that contains separate render phases for opaque and translucent geometry. You can add a tile shader that resolves the sample data for the opaque geometry before blending the translucent geometry.”¹

The tile shader runs on data in local memory and can be part of the opaque geometry phase, keeping the resolve in tile memory rather than rounding through a separate pass.

What This Means For App Architecture

Three takeaways that fall out of Apple’s documented surface.

Tile memory is the budget. The four features above (imageblocks, tile shaders, raster order groups, sample coverage) all exist to keep work in tile memory and out of device memory. Apple’s documented numbers: bandwidth many times faster than device memory, latency many times lower, energy significantly less.¹ An app architecture that respects that budget runs faster and cooler than one that does not.
Render and compute are not different worlds. Apple’s GPU does not split render and compute into distinct passes the way traditional GPUs do. Imageblock persistence and tile shaders let an app run multi-phase algorithms inside a single render pass. The Metal 4 unified compute encoder is the API-level expression of the same architectural fact.
Concurrency is the default; ordering is the opt-in. Raster order groups are how an app says “this read-modify-write sequence depends on order.” The default is unordered concurrency, which is the GPU’s natural shape. Apps that need ordered access for blending, transparency, or g-buffer writes annotate the specific pointers and let the hardware sequence the threads.

The full Apple Ecosystem cluster: the Metal 4 core API for the parallel API surface that targets this hardware; the Foundation Models on-device LLM for the framework that runs ML on the same silicon; Core ML on-device inference for the broader ML stack. The hub is at the Apple Ecosystem Series.

FAQ

Is TBDR specific to Metal 4?

No. Apple silicon GPUs have implemented TBDR across many GPU generations; Metal 4 is the new core API surface that targets them. The TBDR features documented here (imageblocks, tile shaders, raster order groups, A11+ sample coverage control) work through Metal across both the original MTL-prefixed API and the MTL4-prefixed Metal 4 types.¹

What’s the difference between an imageblock and threadgroup memory?

Apple’s documented distinction: “Threadgroup memory is suitable for unstructured data, but an imageblock is more suitable for image data.”¹ Imageblocks carry a 2D structure with a width, height, pixel depth, and named per-pixel components; threadgroup memory is a flat allocation. Apps that need structured image data with addressable slices use imageblocks; apps that need scratch buffers for compute kernels use threadgroup memory.

Why do raster order groups exist if Metal already guarantees draw-call-order blending?

Metal guarantees the appearance of sequential blending, but the GPU runs fragment shaders concurrently. Apple’s framing: a shader that does its own custom blending against another triangle’s results hits a race condition because the two threads are not actually sequential. Raster order groups are the mechanism that synchronizes only the threads that target the same pixel, leaving the rest concurrent.¹

When should I write my own MSAA resolve algorithm?

Apple documents one concrete case: a scene with separate phases for opaque and translucent geometry, where the resolve runs after the opaque phase but before the translucent blending.¹ For most apps, the hardware’s built-in MSAA implementation handles the work; custom resolves are a tool for the specific edge cases Apple’s docs describe.

How does Apple’s MSAA optimization save work?

Apple’s hardware tracks the number of unique samples per pixel as it renders new primitives. Apple’s example: a pixel covered by two triangle edges has three unique colors at four sample positions; A11+ GPUs blend twice rather than three times because two samples share a color, and a later opaque triangle reduces the pixel back to a single color.¹ The optimization runs at hardware level; apps get it without API changes.

Is the Apple GPU architecture documented anywhere besides the TBDR page?

Apple’s “Apple silicon” topic in the Metal documentation links to the TBDR page that backs this post. Apple’s WWDC sessions on Metal also cover GPU architecture details, and the Metal Shading Language Specification covers the shader-level surface. Apple has not published the underlying silicon-level details (cluster counts, ALU widths, raster engine specifics) for a given Apple GPU generation in developer documentation; treat any such number found outside developer.apple.com as unverified.

References

Apple Developer, “Tailor your apps for Apple GPUs and tile-based deferred rendering”. The TBDR architecture, A11+ enhancements (imageblocks, tile shaders, raster order groups, imageblock sample coverage control), tile memory characteristics, deferred shading worked example, MSAA optimization. Retrieved 2026-05-04. ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩