Running Agentic AI on the Mac with MLX
At WWDC 2026, an Apple engineer asked a local agent on his Mac to fetch the recent pull requests from the MLX repository, summarize the changes, and flag what needed attention. The model reasoned, called the GitHub CLI, read the diffs, and produced a summary. Only the git commands touched the network; the model ran entirely on his hardware.1 That demo is the whole thesis of this post: the agentic loop, the part where a model decides, calls a tool, observes the result, and decides again, now runs locally on the Mac with MLX. No cloud, no API keys, no per-token cost. And Apple shipped the rest of the story alongside it: how to scale that loop across multiple Macs, how to secure agentic features against a new class of attack, and how to debug the loop when it silently does the wrong thing.
This post walks the four WWDC 2026 sessions that, together, make local agentic AI on the Mac a real engineering surface and not a tech demo. Everything below comes from those sessions directly.
TL;DR
- MLX runs the entire agentic loop locally on the Mac through a four-layer stack: MLX at the foundation, MLX-LM for models, MLX-LM Server as an OpenAI-compatible HTTP server, and any agent that speaks the OpenAI chat completions protocol on top.1
- Setup is three steps:
pip installMLX-LM, runmlx_lm.serverwith a tool-calling model, and point your agent’s base URL at localhost.1 - When one Mac is not enough, MLX distributes a model across several over Thunderbolt 5 using RDMA and Apple’s open-source JACCL library, running trillion-parameter models and speeding up inference and fine-tuning roughly threefold on a four-node cluster.2
- Agentic features open a new attack surface: indirect prompt injection. Apple’s mitigation story leans on deterministic guardrails:
.onToolCallconfirmations and.historyTransformspotlighting in Foundation Models, plus risk-based confirmations and lock-screen authentication in App Intents.3 - The Foundation Models Instrument in Xcode 27 makes the loop observable: per-request lanes, a tree view of the model’s chain of thought, and the metrics (Time to First Token, Tokens per Second, Total Latency) you need to catch silent failures and slow inferences.4
The local agentic stack (Session 232)
Angelos from the MLX team walks through the three-step setup, starting at 2:42.
The chat experience most developers know puts the work back on the human. As the session frames it: “You send a prompt to the language model. The model sends a response back. If you need to act on that response, run a command, check a file, or fix an error, that’s on you.”1 An agent closes that gap. The agent talks to the model to decide what to do, calls tools to do it, observes the results, and goes back to the model for the next step. User to agent, agent to model, agent to tools, cycling until the task is done.
What makes the loop interesting on Apple silicon is that all of it runs locally. MLX presents the capability as four layers, from the bottom up: MLX, “our open-source array framework purpose-built for Apple silicon,” handling computation, Metal acceleration, and memory; MLX-LM to load, run, quantize, and fine-tune models from Hugging Face; MLX-LM Server, “an OpenAI-compatible HTTP server that exposes your local model through a standard API” with structured tool calling and reasoning-model support; and at the top, any agent that speaks the OpenAI chat completions protocol, whether that is Xcode, OpenCode, a Pi agent, or a custom script.1 The standard interface is the load-bearing choice: “any agent framework works out of the box,” and tools like Ollama, LM Studio, and vLLM already build on MLX and MLX-LM.1
Setup is three steps. Install MLX-LM with a single pip install. Start the server with a tool-calling model:
mlx_lm.server --model <a-tool-calling-model>
Then point your agent at the local server by setting its base URL to localhost. As the session notes, “the agent doesn’t know or care that the model is running on your Mac rather than in the cloud.”1 In OpenCode, that means defining a local provider whose URL is localhost and whose model name matches what the server expects, then telling OpenCode to use that local model for everything.
The interesting part is how MLX earns its keep on agentic workloads specifically. The session names three challenges. The first is prompt processing: “Agentic sessions usually comprise hundreds of thousands of tokens and most of those are not generated.”1 Every time the model receives tool output, it processes all that new context before reasoning further, and that cost recurs through the whole loop. The M5 chip’s dedicated Neural Accelerators make matrix multiplication four times faster than on M4, and with MLX’s specialized kernels, “this translates almost exactly to prompt processing speedup,” with no special arguments or code changes required.1 The second challenge is concurrency: agents spawn subagents, and MLX-LM Server handles the simultaneous requests with continuous batching, dynamically grouping them so subagents “don’t stall waiting in a queue.”1 The third is model size, which is where the next session picks up.
Angelos closed with a demo past read-and-report: from a blank Xcode project, he asked the agent to build a SwiftUI drawing app for iPad. The agent inspected the directory, made a plan, wrote the code, and used xcodebuild to compile and fix its own errors, producing a working app in about two minutes, then iterating to add rounded end caps on request.1 A final demo wired the same running MLX server into Xcode’s Intelligence settings as a locally hosted chat provider, so Xcode itself could find and fix an introduced bug. “Local AI means your code never leaves your Mac.”1
Scaling across Macs (Session 233)
Tatiana builds a four-Mac cluster step by step, starting at 2:21.
Eventually one machine runs out of room. As Tatiana, a research scientist on the MLX team, framed it: “Eventually, memory, compute, or bandwidth on a single machine becomes a limitation.”2 The headline case from Session 232 is a model that simply does not fit: the most recent DeepSeek model “has a whopping 1.6 trillion parameters and requires more than 800GB of memory just for the weights.”1 Session 233 is the deep dive on spreading that work across Macs you own.
The stack underneath distributed MLX has three pieces. The interconnect and transport: starting in macOS 26.2, Remote Direct Memory Access (RDMA) is supported over Thunderbolt 5, moving data directly from one machine’s memory to another’s while “avoiding most CPU and operating system overhead.”2 The communication backend: JACCL, “an open-source collective communication library built by Apple” that runs on RDMA over Thunderbolt and provides collective primitives without you managing the transport, and that “is not limited to machine learning” and “can be built without MLX,” exposing a C++ API for any distributed workload.2 MLX sits on top, using JACCL for low-latency coordination across the cluster.
Tatiana built a cluster from four M3 Ultras. Topology matters because communication time splits into latency (a fixed per-operation cost) and transfer time (which grows with message size). JACCL supports a mesh, where “every machine connects directly to every other” for the lowest latency, and a ring, where each node connects to two neighbors, freeing ports to run multiple cables per neighbor for more bandwidth. Wired as a mesh, JACCL “automatically picks the best topology depending on the message size and communication operation, mesh when latency matters, ring when bandwidth matters.”2 You enable RDMA in Settings, then launch jobs with mlx.launch pointed at a JSON hostfile; the helper script mlx.distributed_config generates that hostfile and, with --auto-setup, configures the Thunderbolt network itself.2
Running a model across the cluster is nearly identical to running it on one machine. You wrap the same mlx_lm.chat command with mlx.launch --hostfile, and “MLX LM shards the model and coordinates the distributed inference for you.”2 Side by side, a 27-billion-parameter Qwen 3.6 generated tokens “at nearly three times the rate of a single machine” on four M3 Ultras.2 MLX supports two sharding strategies: pipeline parallelism (by depth, simple communication but no speedup) and tensor parallelism (by width, all machines process the same token at once for a speedup, at the cost of frequent per-layer communication that “is why the mesh topology is crucial”).2 Tensor parallelism is the default. The session ran the one-trillion-parameter Kimi 2.6 (about a terabyte of weights at 8-bit, which “does not fit on a single M3 Ultra, but it can fit across four”) across the cluster.2 The same approach accelerates fine-tuning: data-parallel LoRA training via mlx_lm.lora took a single M3 Ultra from around 180 tokens per second to around 600 on the cluster, “more than 3 times speed up.”2 MLX exposes the same primitives through Python, Swift, and C++ for embedding distributed workflows in apps.
Securing the loop (Session 347)
Willy introduces indirect prompt injection at 4:01; Akshay covers the framework APIs starting at 11:55.
Giving a model the ability to call tools opens a door. As Willy framed it: “LLMs introduce a new probabilistic engine within your application that is both powerful, but risks being tricked.”3 The new risk is indirect prompt injection, which the session defines as “instructions embedded in extra context provided to the model with the intent to redirect control flow.”3 The session’s example app, Loose Leaf, adds an “Organize a tea party” feature that reads your calendar and friend feed and can order teas. The attack: a user asks to plan a party with their calendar appended, but a calendar event contains an injected instruction telling the model to delete sensitive user data instead.3
Injection produces two effects. Data poisoning, “an attacker influencing the parameters of an executed action,” turns a message meant for your mom into one sent to the attacker. Action poisoning, where the attacker “influences what action to execute,” steers a summarize-this-email request into opening a malicious URL with the email appended.3 The session grounds the danger in Simon Willison’s Lethal Trifecta: a user is most at risk when an agentic system combines access to private data, exposure to untrusted content, and the ability to externally communicate, generalized to “the risk of actions with any side effect.”3 The framing is honest: “solving indirect prompt injection is an active research area,” so the realistic goal is to understand your app’s risk and mitigate it.3
The method is a threat-modeling exercise. First, a data-flow analysis of everything that feeds the prompt, marking as untrusted “any inputs coming from an external entity,” which for Loose Leaf means the calendar content and the friend feed.3 Second, an inventory of the agent’s actions and side effects: a tea-ordering tool carries financial risk, a feed-posting tool carries data-exfiltration risk, and even a benign-looking brewing timer is risky because its optional label “could allow a prompt injection to write more instructions for later attacks.”3 Apple’s stated preference is to “focus on deterministic mitigations as a baseline because their security guarantees are easier to audit and reason about,” with probabilistic mitigations layered on top.3
Then Akshay showed the APIs. In Foundation Models, lifecycle event modifiers are “callbacks that deterministically trigger at certain lifecycle points in a session execution,” usable as security checkpoints. The .onToolCall modifier runs before the executor runs a tool, and “if this callback throws an error, then the tool is never executed,” which “makes this the perfect place to enforce confirmations”: check whether the current tool is the financial one, and if so demand user confirmation first.3 The .historyTransform modifier “fires before the transcript is rendered to the model for inference,” letting you wrap untrusted tool output in spotlighting delimiters and redact PII by replacing sensitive segments with a [REDACTED] placeholder before the model sees them.3 One caveat: those transforms “are scoped to the current inference iteration only,” so you reapply them each call, or use the @SessionProperty annotation for transformations you want to persist.3
For apps integrating with Siri through App Intents, two system guardrails apply. Confirmations are “risk-based” and “contextual”: when an intent adopts a schema it inherits that schema’s risk metadata (deleting photos is destructive, exfiltrating data is risky), and a Risk Evaluation system combines that static metadata with “the dynamic state of the system” to decide whether to ask the user before executing.3 Lock-screen authentication is the second: because Siri is reachable on a locked device, you set an intent’s authenticationPolicy to .requiresAuthentication so destructive actions cannot run while locked; a schema’s default policy can be overridden “only to make it stricter,” and a weaker override produces a build error.3
Debugging the loop (Session 243)
Erik diagnoses a silent agentic failure in his Craft app, starting at 1:58.
The loop’s flexibility is also its debugging problem. As Erik, an AI Tools Engineer, put it: “Traditional code is predictable. LLMs are non-deterministic; the same input can produce different outputs.”4 He named three challenges absent from traditional development: probabilistic output (so “standard unit testing breaks down” and you evaluate quality and intent instead), model-to-model communication, and observability, “when something breaks in a multi-model pipeline, it can be very hard to know where it went wrong.”4 The Foundation Models Instrument in Xcode 27 exists to answer that last one.
Erik demonstrated on his Craft app, where a brainstorming feature uses two instruction sets, brainstorming and tutorial generation, the brainstorming set offering a GenerateCraftIdeaTool and a SwitchToTutorialModeTool.4 In the trace, the feature failed: it kept offering ideas instead of switching to a tutorial. The Instructions lane told the story immediately, showing “only one set of instructions was active for the entire session but the feature was supposed to use two, so something went wrong during the handoff.”4 The tree view, which organizes everything into “sessions, requests, model inferences, instructions, prompts, and responses,” surfaced the root cause: “The prompt references the switchToTutorialMode tool but that tool isn’t actually configured with this instruction.”4 The model kept making tool calls without throwing an error: “this was a silent failure,” the hardest kind to catch.4 Adding the missing tool to the toolset fixed it, and the re-trace showed two distinct instruction sets active, with the handoff happening correctly after a switchToTutorialMode tool call.4
The Instrument also makes performance legible. The Model Inference lane uses yellow bars for input-prompt processing and orange bars for response generation.4 Three metrics drive optimization: Time to First Token (“a high Time to First Token means people are staring at a blank screen; to reduce it, shorten your prompt”), Tokens per Second (to “benchmark performance across different prompt configurations and catch regressions after changes”), and Total Latency, “the number people feel most directly,” reduced in perception by streaming partial results sooner.4 One operational note: the Instrument “captures prompt and response data from your device, which can include sensitive information,” so logging is off in production but on for the trace’s duration, and you keep trace files somewhere safe.4
How to start
The four sessions compose into a sequence you can follow on the hardware you already own:
- Stand up the local loop.
pip installMLX-LM, runmlx_lm.serverwith a small tool-calling model first to validate the setup, and point your agent’s base URL at localhost. Start with read-and-report tasks before letting the agent write files or run builds.1 - Scale only when one Mac is not enough. If a model does not fit in memory or inference is too slow, wire Macs together over Thunderbolt 5, enable RDMA in Settings, generate a hostfile with
mlx.distributed_config, and run the same commands undermlx.launch. Reach for tensor parallelism (the default) for speed and a mesh topology for the low latency it needs.2 - Threat-model before you ship agentic features. List every untrusted context source and every action’s side effects. Add
.onToolCallconfirmations on side-effecting tools and.historyTransformspotlighting and redaction on untrusted tool output; for App Intents, review each intent’s risk metadata and setauthenticationPolicyso destructive actions require an unlocked device.3 - Profile before you trust it. Profile your Foundation Models feature in Xcode 27’s Instrument, read the Instructions and Model Inference lanes for silent failures, and use Time to First Token, Tokens per Second, and Total Latency to find the slow steps.4
Everything in Session 232 is “open-source and available right now.”1
FAQ
Can I really run an AI agent entirely on my Mac?
Yes. WWDC 2026 Session 232 demonstrates the complete agentic loop running locally through MLX: a model reasons, calls tools, observes results, and iterates, with only the tool calls that genuinely need the network reaching outside the machine. The stack is MLX, MLX-LM, the OpenAI-compatible MLX-LM Server, and any agent that speaks the OpenAI chat completions protocol on top.1
How do I connect my agent to a local MLX model?
Three steps. Install MLX-LM with pip, start mlx_lm.server with a model that supports tool calling, and set your agent framework’s base URL to your local server’s address on localhost. The agent treats the local server exactly as it would a cloud LLM API, because MLX-LM Server is a drop-in OpenAI-compatible HTTP server.1
What if the model is too large for one Mac?
MLX distributes a model across multiple Macs connected over Thunderbolt 5, using RDMA (supported starting in macOS 26.2) and Apple’s open-source JACCL communication library. You launch jobs with mlx.launch and a hostfile; MLX shards the model automatically. Apple’s session ran a one-trillion-parameter model across four M3 Ultras and saw roughly threefold speedups on inference and fine-tuning versus a single machine.2
What is the main new security risk for agentic Mac apps?
Indirect prompt injection: malicious instructions hidden in untrusted context (a calendar event, a social feed, a tool result) that redirect the model into actions the user never asked for, such as deleting data or exfiltrating it. Apple recommends a threat-modeling pass plus deterministic guardrails: .onToolCall confirmations and .historyTransform spotlighting and PII redaction in Foundation Models, and risk-based confirmations and lock-screen authentication in App Intents.3
How do I debug an agent that fails silently?
Use the Foundation Models Instrument in Xcode 27. It captures every model inference, instruction set, prompt, and response into timeline lanes and a tree view, so you can see exactly which tools were available at each step and where a handoff went wrong, even when the model never throws an error. It also surfaces Time to First Token, Tokens per Second, and Total Latency for performance tuning.4
Running your own model on Apple silicon is the foundation this loop stands on: see MLX on Apple Silicon: when you need your own model, not Apple’s and running models on Apple silicon with Core AI. The runtime-versus-tooling distinction that shapes how agents touch a Swift app is in the Foundation Models agentic workflow. Once the loop runs, measuring its quality is the next step, covered in Apple’s Evaluations framework. The full series hub is the Apple Ecosystem Series, and the broader build context is the iOS Agent Development guide.
References
-
Apple, WWDC 2026 session 232, Run local agentic AI on the Mac using MLX. Source for the four-layer stack (MLX, MLX-LM, MLX-LM Server, agent), the three-step setup (
pip install,mlx_lm.server, base-URL config), the agentic loop definition, the PR-summary and SwiftUI drawing-app demos, the Xcode Intelligence-tab integration, and the three hardware challenges: prompt processing (M5 Neural Accelerators, four-times-faster matrix multiplication versus M4), concurrency (continuous batching), and model size (the 1.6-trillion-parameter DeepSeek model requiring more than 800GB for weights). ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩ -
Apple, WWDC 2026 session 233, Explore distributed inference and training with MLX. Source for RDMA over Thunderbolt 5 (macOS 26.2), the JACCL collective communication library, mesh-versus-ring topology, the
mlx.launch/mlx.distributed_configworkflow and JSON hostfile, tensor- versus pipeline-parallelism, the four-M3-Ultra cluster results (Qwen 3.6 at nearly three times single-machine token rate; one-trillion-parameter Kimi 2.6 running across four machines; LoRA fine-tuning from ~180 to ~600 tokens per second), and the Python, Swift, and C++ APIs. ↩↩↩↩↩↩↩↩↩↩↩↩↩ -
Apple, WWDC 2026 session 347, Secure your app: mitigate risks to agentic features. Source for indirect prompt injection, data poisoning and action poisoning, the Lethal Trifecta framing, the threat-modeling exercise (untrusted context sources and action side effects), and the mitigation APIs: Foundation Models lifecycle event modifiers
.onToolCall(confirmations) and.historyTransform(spotlighting and PII redaction, scoped to one inference iteration, with@SessionPropertyfor persistence), and App Intents risk-based contextual confirmations andauthenticationPolicy(.requiresAuthentication, overridable only to a stricter policy). ↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩↩ -
Apple, WWDC 2026 session 243, Debug and profile agentic app experiences with Instruments. Source for the three LLM-development challenges (probabilistic output, model-to-model communication, observability), the Foundation Models Instrument in Xcode 27 (Instructions and Model Inference lanes, the session/request/inference tree view), the silent-failure diagnosis in the Craft app (a tool referenced in the prompt but missing from the instruction’s toolset), the privacy note on trace logging, and the three performance metrics: Time to First Token, Tokens per Second, and Total Latency. ↩↩↩↩↩↩↩↩↩↩↩↩↩