RealityKit And The Spatial Mental Model

Genre: frontier-essay. The post names the spatial mental model RealityKit asks for. SwiftUI on visionOS gets you a window in 3D space. RealityKit gets you the rest of the room. The difference is the mental model, and the mental model is the architecture.

A SwiftUI developer arriving at visionOS for the first time tends to build the same shapes they would on iOS: a WindowGroup with views inside. The result is a flat panel hovering in space. The panel is fine for a lot of apps. The room around the panel is what visionOS adds, and the room is RealityKit’s territory.

RealityKit is not SwiftUI in 3D. The architecture is different in shape, not just dimension. SwiftUI is a value-typed view tree with a result-builder DSL on top of an observation system; the framework’s job is to compute the next render from current state. RealityKit is an entity-component-system (ECS) with a scene graph rooted in anchors that bind virtual entities to real-world reference points; the framework’s job is to maintain a 3D scene as the user and the world move around it. The same Swift project might use both at the same time, and the boundary between them is where most spatial-design mistakes happen.

The post walks the spatial mental model. Five concrete ways the model differs from a SwiftUI window, and the routing question of when each framework is the right answer.

TL;DR

  • RealityKit is an entity-component-system (ECS). An entity is a node; components are typed data attached to the node; systems run logic over entities that have specific component combinations.
  • Anchors bind virtual entities to real-world reference points. RealityKit exposes anchoring targets through AnchoringComponent.Target (.world, .head, .hand(_:location:), .plane(...), .image(...), .referenceObject(...)). On visionOS, ARKit supplies the backing anchor structs (WorldAnchor, HandAnchor, ImageAnchor, ObjectAnchor, PlaneAnchor) that ARKit reports through anchor-update streams.
  • A 3D scene is not a view tree. The render loop is continuous, the scene graph mutates over time, and the rendering layer is GPU-driven (Metal underneath) rather than diff-based.
  • RealityView is the SwiftUI bridge. It places a RealityKit scene inside a SwiftUI view tree; the boundary is one direction (SwiftUI hosts RealityKit, not the other way around).
  • The routing rule: if the user wants a window, ship SwiftUI. If the user wants the room, ship RealityKit. Apps that need both put a RealityView inside a SwiftUI window and accept that the two halves coordinate explicitly.

Five Differences From A SwiftUI Window

The shape changes the moment you place anything outside the panel. Five differences matter.

1. Entity-Component-System, Not View-Body-State

A SwiftUI view is a value-typed type with a body computed property and property-wrapper-backed state.1 The framework re-runs the body when state changes; the diff feeds into the renderer.

A RealityKit entity is a reference-typed object that sits in a scene graph. Components are typed structs attached to entities (ModelComponent, Transform, CollisionComponent, PhysicsBodyComponent, custom Component types you define).2 Systems are types conforming to the System protocol; the framework runs each registered system once per frame, and System.update(context:) (typically mutating on a struct) is where the system reads and writes the components on entities that match its query.

import RealityKit

let cube = Entity()
cube.components.set(ModelComponent(
    mesh: .generateBox(size: 0.1),
    materials: [SimpleMaterial(color: .blue, isMetallic: false)]
))
cube.components.set(InputTargetComponent())
cube.components.set(CollisionComponent(shapes: [.generateBox(size: [0.1, 0.1, 0.1])]))

The cube does not have a body. The cube has a set of components. Adding InputTargetComponent and CollisionComponent is what makes the cube respond to gestures; remove them and gestures pass straight through to the entity behind it. Adding PhysicsBodyComponent is what makes the cube fall under gravity; remove it and the cube floats. The composition of components determines the entity’s behavior.

The mental shift: in SwiftUI, you describe what should be on screen as a function of state. In RealityKit, you describe what an entity is (its components) and let the systems decide what happens to it.

2. Anchors, Not Coordinates

A SwiftUI view’s coordinate system is the window. Position 0,0 is the top-left corner; positions are in points; the view’s frame is its space. The window is the universe.

A RealityKit scene’s coordinate system depends on its anchor. The anchoring layer has two faces. RealityKit’s AnchorEntity (driven by an AnchoringComponent.Target) is what you place an entity onto; ARKit’s anchor structs are the backing data the system uses to keep the target in sync with the real world.3

The RealityKit anchoring targets you reach for inside an AnchorEntity or AnchoringComponent are:

  • .world(transform:): a point in real-world space defined by a transform
  • .head: locked to the user’s head pose; the entity follows the user’s gaze
  • .hand(_:location:): locked to a specific hand joint (palm, fingertip, wrist, etc.)
  • .plane(...): locked to a detected horizontal or vertical surface (table, wall, floor)
  • .image(...) / .referenceImage(...): locked to a recognized 2D image in the environment
  • .referenceObject(...): locked to a recognized real-world 3D object

On visionOS, ARKit supplies the backing anchor data through WorldAnchor, HandAnchor, ImageAnchor, ObjectAnchor, and PlaneAnchor structs delivered through ARKitSession anchor-update providers. (Body tracking is iOS/iPadOS-only on Apple’s body-tracking surface; visionOS does not expose .body as a RealityKit anchoring target.)

The anchor is what makes the scene “real.” A virtual chess board placed on a .plane target for the user’s table stays on the table when the user walks around it; a chess board placed at fixed coordinates relative to a .head target follows the user’s head and feels like a hallucination.

The mental shift: position is not a number. Position is what is the anchor, and where is the entity relative to the anchor. A virtual object without a sensible anchor is a hallucination; the anchor is what tells the user the object belongs to the room.

3. Continuous Render Loop, Not Diff-Based

SwiftUI renders when state changes. The framework decides when a re-render is needed and computes the minimum tree change. Between renders, the screen is static.

RealityKit drives a frame-based simulation and rendering loop. The scene graph mutates over time as physics, animation systems, and input handlers update entity transforms and component values, and the renderer (Metal-backed) draws the active scene each frame. Per-frame logic lives in System.update(context:); that hook is the framework’s invitation to mutate the scene every tick.4

The mental shift: time is part of the scene. A SwiftUI view body that runs once is fine; a RealityKit entity needs to consider what happens on frame N+1, N+2, N+3. The update(context:) method on a custom System is where you write per-frame logic; the Component value you mutate inside update is what the renderer reads on the next pass.

4. RealityView Is The Bridge, In One Direction

SwiftUI views compose other SwiftUI views. RealityKit entities compose other RealityKit entities. The boundary between the two is RealityView, a SwiftUI view type that hosts a RealityKit scene.5

import SwiftUI
import RealityKit

struct ContentView: View {
    var body: some View {
        RealityView { content in
            // Build the scene
            let cube = Entity()
            cube.components.set(ModelComponent(
                mesh: .generateBox(size: 0.1),
                materials: [SimpleMaterial(color: .blue, isMetallic: false)]
            ))
            content.add(cube)
        } update: { content in
            // Mutate the scene in response to SwiftUI state changes
        }
    }
}

The make closure runs once when the view first appears. The update closure runs whenever SwiftUI state the view depends on changes. Inside both closures, you have access to the RealityKit scene through the content parameter; you add entities, mutate transforms, register systems.

The boundary is one-directional. SwiftUI hosts RealityKit. RealityKit does not host SwiftUI; you cannot put a SwiftUI view “inside” an entity and expect it to render as part of the 3D scene the way a SwiftUI subview renders inside a parent. The exception is attachments (in RealityView’s attachments: parameter): you declare named SwiftUI views, retrieve them as ViewAttachmentEntity values, and position and scale them inside the 3D scene like any other entity.5 Attachments are not embedded SwiftUI views inside an entity; they are 2D SwiftUI surfaces wrapped as entities the renderer can place in 3D. By default they hold a fixed orientation; if you want them to face the wearer, attach a BillboardComponent to the attachment entity.

5. Gestures In 3D Are Different

SwiftUI gestures (.onTapGesture, DragGesture, etc.) operate in screen space. The system knows where the finger is relative to the view; the framework dispatches based on hit-testing in 2D.

RealityKit gestures operate in scene space.6 The system knows where the user is looking (the gaze ray), where the user’s hands are (hand-tracking joints), and which entity the gaze + tap intersects with. The dispatch model is “the user looked at this entity and pinched”; the equivalent of a tap.

For an entity to receive gestures, it needs InputTargetComponent and a CollisionComponent that defines the hit-test geometry. Without InputTargetComponent, the entity is invisible to the gesture system; without CollisionComponent, the gesture system has no shape to hit-test against. Both have to be present.

The mental shift: gesture targets are not screen regions. Gesture targets are 3D entities that have explicitly opted in to input. The rest of the scene is “decoration the user can look through.”

When SwiftUI Is Enough

A common visionOS app does not need RealityKit. Three patterns where SwiftUI alone is the right answer:

Window-shaped apps with no spatial content. A meditation timer, a notebook, a settings pane, a chat interface. The app is information you read or interact with through 2D affordances. Putting it in a WindowGroup and keeping it flat is the right call. visionOS treats SwiftUI windows as floating glass panels with system chrome; the user gets a comfortable reading experience without you writing a line of RealityKit.

Multi-window apps that compose flat panels in space. A code editor with separate windows for the editor, the terminal, and the documentation. The user wants the windows arranged in 3D space (to the left, to the right, behind), but each window is itself a SwiftUI view. The 3D arrangement is the OS’s job; the panels are flat.

Documentation viewers, photo galleries, video players. Content the user consumes through the panel. The panel is the rendering surface; the third dimension is just the panel’s spatial position in the room.

The rule: if the content is 2D (text, images, video, controls), the right framework is SwiftUI. The third dimension is where the panel is positioned, not what is rendered inside it.

When RealityKit Is Required

The cases where SwiftUI is not enough:

3D content the user can walk around. A virtual object on the user’s table (a model car, a sculpture, a building). The object has volume; the user can move around it; the object should occlude correctly with the room. The right framework is RealityKit, anchored to a .plane target.

Spatial UI that responds to the room. Buttons that float above a real-world keyboard, annotations attached to a real-world object, a virtual tape measure laid along a real wall. The UI’s position is determined by world geometry, not by a window’s coordinate space. RealityKit anchors do the binding; SwiftUI attachments inside a RealityView provide the 2D affordances.

Continuous spatial simulation. A flock of birds, a ball rolling across the floor, a fluid simulation, anything where the scene state evolves over time. The continuous render loop is the right tool; SwiftUI’s diff-based renderer would either miss frames or burn battery.

Hand-tracking interactions. Pinch-to-grab, two-handed scaling, drawing in mid-air. The input model requires ARKit’s HandTrackingProvider (with HandAnchor updates) plus a .hand(_:location:) anchor target; SwiftUI does not expose that surface.

Body-tracked AR. Mirroring the user’s pose onto a virtual character, tracking the user’s body for a fitness app, recognizing real-world objects. The capture and inference happen in ARKit (RealityKit’s lower-level companion); RealityKit renders the result.

The rule: if the content is 3D and lives in the room (volumetric, anchored, simulated, or hand-driven), the right framework is RealityKit. SwiftUI is the chrome around it.

The Composition Pattern

Most non-trivial visionOS apps end up using both. The pattern that ships well:

  1. The app’s chrome (settings, navigation, lists, forms, inspector panels) lives in SwiftUI windows.
  2. The spatial scene (the volumetric content the user manipulates) lives in a RealityView inside its own window or volume.
  3. The two communicate through SwiftUI state. A button in a SwiftUI panel toggles a @State boolean; the RealityView’s update: closure reads the boolean and mutates the entity in the scene.
  4. RealityKit-side state changes that need to surface to SwiftUI go through callbacks the RealityView’s make: closure registers (subscribe(to:) on the scene’s event publisher).7
struct GalleryView: View {
    @State private var selectedSculpture: SculptureID?

    var body: some View {
        HStack {
            // SwiftUI side: list of sculptures
            List(allSculptures) { sculpture in
                Button(sculpture.name) {
                    selectedSculpture = sculpture.id
                }
            }
            .frame(width: 300)

            // RealityKit side: 3D rendering of the selected sculpture
            RealityView { content in
                // Build initial scene
            } update: { content in
                guard let id = selectedSculpture else {
                    content.entities.removeAll()
                    return
                }
                // Mutate scene to show the selected sculpture
                presentSculpture(id, in: content)
            }
        }
    }
}

The split is honest about which framework owns which job. SwiftUI owns the list, the buttons, the layout, the state. RealityKit owns the 3D rendering, the entities, the continuous simulation. The state crosses the boundary as a single @State value; neither framework reaches into the other.

What I Would Build Differently In My Stack

Three patterns that get easy to recognize once you have shipped a spatial scene:

Anchor-first, entity-second. When designing a feature, decide the anchor before the geometry. A virtual instrument anchored to the user’s hand is a different product than the same instrument anchored to a .plane target on the table. The anchor decides the user’s relationship with the object; the geometry is implementation detail.

Components, not subclasses. It is tempting to subclass Entity to build domain types like ChessPiece: Entity. The composition pattern beats inheritance every time: a chess piece is an Entity with a ChessPieceComponent (custom data: color, type, position), a ModelComponent (the 3D mesh), an InputTargetComponent, and a CollisionComponent. New behaviors are new components, not new subclasses.

Systems for cross-cutting logic. When ten entities need the same behavior (gravity, collision response, audio attenuation, gesture state), write a System that operates on the relevant components. The system runs once per frame across all matching entities. The alternative (putting the logic on each entity) produces the n-times-n-frames bug the ECS pattern was invented to avoid.

When RealityView Is The Wrong Answer

A few cases where reaching for RealityView is the wrong call:

Single 3D image, no interaction. A static 3D logo or product render. Use a SwiftUI Model3D view instead.8 Model3D is the cheap path for “load a USDZ and display it”; RealityView is for scenes you build and mutate.

iOS apps with simple AR overlays. ARKit’s ARView (the older surface) or RealityKit’s iOS-side ARView integration is often the right call when the AR experience is a feature inside a larger iOS app. RealityView is Swift-and-SwiftUI-native and lives well inside SwiftUI; older ARView workflows are sometimes simpler when the rest of the app is UIKit.

2D drawing on a panel. A whiteboard, a photo annotation tool, a flat shape editor. The right tool is Canvas (SwiftUI’s Metal-backed drawing surface) or MetalView. RealityView is overkill if you are not building in 3D space.

What The Pattern Means For Apps Shipping On visionOS 2+

Three takeaways.

  1. RealityKit and SwiftUI compose; they do not collapse. Use SwiftUI for window-shaped chrome and 2D affordances; use RealityKit for the room-shaped 3D content. The boundary is RealityView, and the boundary is one-directional.

  2. The mental model is ECS plus anchors. An entity is what it is composed of. An anchor decides how the entity relates to the user’s real space. The pair (components, anchor) is the design unit.

  3. The render loop is continuous. Time is part of the scene. Per-frame logic goes in System.update(context:); per-state-change logic goes in RealityView.update:. Mixing the two layers (writing per-frame logic in SwiftUI’s body, writing state-driven logic in System.update) is the most common architecture mistake.

The full Apple Ecosystem cluster: typed App Intents for Apple Intelligence; MCP servers for cross-LLM agents; the routing question between them; Foundation Models for on-device LLM and the Tool protocol; Live Activities for the iOS Lock Screen state machine; the watchOS runtime contract on Apple Watch; SwiftUI internals for the framework substrate; Liquid Glass patterns for the visual layer; multi-platform shipping for cross-device reach. The hub is at the Apple Ecosystem Series. For broader iOS-with-AI-agents context, see the iOS Agent Development guide.

FAQ

Is RealityKit a replacement for SwiftUI on visionOS?

No. RealityKit and SwiftUI compose. SwiftUI handles 2D windows, controls, and chrome; RealityKit handles 3D scenes anchored to real-world reference points. Most non-trivial visionOS apps use both, with RealityView as the bridge that places a RealityKit scene inside a SwiftUI view tree.

When should I use RealityView vs Model3D?

Use Model3D for displaying a single static 3D asset (a USDZ file, a single product render). Use RealityView for building or mutating a 3D scene over time (multiple entities, physics, gestures, hand tracking, anchored content). Model3D is the cheap path; RealityView is the full ECS surface.

What’s the difference between an Entity and a Component in RealityKit?

An entity is a node in the scene graph. A component is typed data attached to the node. ModelComponent gives the entity a mesh; InputTargetComponent makes it gesture-eligible; CollisionComponent defines hit-test geometry; PhysicsBodyComponent makes it respond to gravity. Custom Component types you define hold domain data. Behavior is composition over inheritance: an entity’s behavior is the sum of its components.

What are anchors and why do they matter?

Anchors bind virtual content to real-world reference points: the user’s head, hand, a detected surface, a recognized image, a recognized object, or a persistent world point. The anchor decides the user’s relationship with the entity. A virtual object on a .plane target (the table) stays put when the user walks around; a virtual object on a .head target follows the user’s head. Picking the right anchor is the first design decision in a spatial feature.

Can RealityKit run on iOS, not just visionOS?

Yes. RealityKit ships on iOS, iPadOS, macOS, and visionOS. ARKit-driven AR experiences use RealityKit’s iOS surface. The visionOS surface adds spatial-specific anchor types (head, hand, world) that iOS does not expose; the core ECS pattern is shared.

References


  1. Author’s analysis in What SwiftUI Is Made Of, April 30, 2026, covering the value-typed view tree, result-builder DSL, and observation system. 

  2. Apple Developer, “RealityKit” and “RealityKit Systems”. The Entity / Component / System architecture and the standard component types (ModelComponent, Transform, CollisionComponent, PhysicsBodyComponent, InputTargetComponent). 

  3. Apple Developer, “AnchorEntity”, “AnchoringComponent”, “Scene content anchors”, and ARKit’s “Anchor”. RealityKit anchoring targets (.world, .head, .hand(_:location:), .plane, .image, .referenceObject) and the ARKit anchor structs that supply the backing data on visionOS (WorldAnchor, HandAnchor, ImageAnchor, ObjectAnchor, PlaneAnchor). 

  4. Apple Developer, “RealityKit Systems” and the WWDC 2024 session “Build a great visionOS app”. RealityKit’s frame-driven simulation and rendering, plus the System.update(context:) per-frame hook. 

  5. Apple Developer, “RealityView”, “RealityViewAttachments”, and “BillboardComponent”. The SwiftUI bridge into RealityKit, the ViewAttachmentEntity retrieval pattern, and the optional billboard behavior when a 2D attachment should face the wearer. 

  6. Apple Developer, “Adding 3D content to your app” and “InputTargetComponent”. Gesture dispatch in spatial scenes; the role of InputTargetComponent and CollisionComponent as the input opt-in pair. 

  7. Apple Developer, “Scene” and the subscribe(to:on:_:) Combine-based event publisher that lets RealityKit-side state changes surface back to SwiftUI through callbacks registered in the make: closure. 

  8. Apple Developer, “Model3D”. The SwiftUI view for displaying a model asset; the cheap path before reaching for the full RealityKit ECS surface. 

Artículos relacionados

What SwiftUI Is Made Of

SwiftUI is a result-builder DSL on top of a value-typed View tree. Once the substrate is visible, AnyView, Group, and Vi…

17 min de lectura

watchOS Runtime Is a Contract, Not a Background Task

watchOS does not have iOS's background. WKExtendedRuntimeSession is a contract you sign with the system, broken on wrist…

15 min de lectura

The Cleanup Layer Is the Real AI Agent Market

Charlie Labs pivoted from building agents to cleaning up after them. The AI agent market is moving from generation to pr…

15 min de lectura