MLX on Apple Silicon: When You Need Your Own Model, Not Apple's

Apple’s Foundation Models framework hands you one model: the system’s, sealed, free, and updated on Apple’s schedule. For most on-device language work that is the right tool, and reaching past it is a mistake. But some work needs a model you choose: a specific open-weight LLM, a version you pin, a fine-tune trained on your own data, or a capability the system model does not have. When you need your own model running on-device, the layer below Foundation Models is MLX1.

MLX is Apple’s array framework for machine learning on Apple Silicon, with a Swift API (MLX Swift) you embed directly in an app2. It is not a system framework you call; it is a library you ship, along with the model weights. That difference is the whole trade, and understanding it is how you decide whether to drop down a layer or stay where Apple put you.

TL;DR

  • MLX is a NumPy-like array framework built for Apple Silicon, with lazy evaluation, composable function transformations, and a Metal backend2.
  • The unified memory model is why it works on a phone. Arrays live in one memory pool the CPU and GPU share, so MLX runs across both over the same buffers with no host-to-device copy tax3.
  • Run an open-weight LLM on-device with LLMModelFactory, pointing at a quantized model like mlx-community/Llama-3.2-3B-Instruct-4bit, then generate through a ChatSession4.
  • Fine-tune with LoRA adapters: train a small adapter, ship adapters.safetensors, and load(into:) swaps the base model’s Linear layers for LoRALinear at runtime5.
  • The cost of your own model: app size (weights are big), memory pressure, no system integration, and you own every update. Foundation Models has none of those costs because Apple pays them.

What MLX is, and why Apple Silicon makes it possible

MLX gives you arrays and operations that look like NumPy, plus the transformations machine learning needs: automatic differentiation, vectorization, and lazy evaluation that builds a computation graph and only runs it when you read a result2. On its own that describes a dozen frameworks. What makes MLX run a multi-billion-parameter model on a device in your pocket is the memory model.

On a desktop GPU, data lives in system RAM and you copy it across a bus to the GPU’s separate memory to compute, then copy results back. That copy is the tax, and for a large model it is brutal. Apple Silicon has unified memory: one pool the CPU, GPU, and Neural Engine all address directly. MLX is built around that fact3. An array is not “on the CPU” or “on the GPU”; it is in memory, and any processor operates on it in place. No copies, no bus tax. A 3-billion-parameter model quantized to 4 bits fits in a few gigabytes and runs without the round-trips that would make the same work impractical on a discrete-GPU machine of similar memory. The hardware decision Apple made years ago is the reason on-device inference of a real model is viable at all, and the tile-based, unified-memory architecture is the substrate MLX stands on.

Running an LLM on-device

The path from “I want a specific model” to text on screen is short. MLX Swift’s LLM layer loads a quantized model from the Hugging Face Hub and runs it4:

let container = try await LLMModelFactory.shared.loadContainer(
    from: HubClient.default,
    using: TokenizersLoader(),
    configuration: .init(id: "mlx-community/Llama-3.2-3B-Instruct-4bit")
)

let session = ChatSession(container)
let response = try await session.respond(to: "Summarize this in one line: \(text)")

For token-by-token UI, generate a stream instead and render chunks as they arrive4:

let input = try await container.prepare(input: UserInput(prompt: prompt))
let stream = try await container.generate(input: input, parameters: GenerateParameters())
for await event in stream {
    if case let .chunk(text) = event { /* append to UI */ }
}

Two details carry most of the practical weight. First, the 4bit in the model ID is not optional sugar: quantization is what makes the model fit in memory and run at usable speed on a device. You ship 4-bit (or lower) weights, not full precision. Second, the weights are large even quantized, so you decide deliberately whether to bundle them in the app (instant, but a fat download) or fetch them on first launch (lean binary, but a wait and a failure path to handle). Foundation Models never poses that question because the model is already on the device. With MLX, the weights are your problem.

Fine-tuning: a LoRA adapter, not a new model

The reason to bring your own model is rarely the base model itself; it is teaching it your domain. Full fine-tuning of a multi-billion-parameter model on-device is not the move. LoRA (low-rank adaptation) is: you train a small set of adapter weights that adjust the base model’s behavior, leaving the base untouched. The adapter is megabytes, not gigabytes5.

MLX Swift loads a trained adapter from a directory holding adapter_config.json and adapters.safetensors, then applies it to a model already loaded in a container5:

let adapter = try LoRAContainer.from(directory: adapterURL)
await container.update { context in
    try? adapter.load(into: context.model)   // swaps Linear layers for LoRALinear
}

load(into:) replaces the model’s standard Linear layers with LoRALinear layers that fold in the adapter’s low-rank deltas, so inference now reflects your fine-tune. Because the model lives inside the container, you apply the adapter through container.update, and you can hot-swap adapters at runtime (unload(from:) one, load(into:) another) to give a single base model different behavior per feature. The pattern mirrors what Apple offers for the system model through Foundation Models custom adapters: the difference is that here you own the base model, the training pipeline, and the result, instead of adapting a model you cannot see.

The decision: Foundation Models, MLX, or cloud

Three layers, and choosing wrong costs you either capability or a pile of avoidable work.

  • Foundation Models when the system model can do the task. Free, private, zero weights to ship, zero memory you manage, and system integration you get for nothing. Default here. The on-device language tasks Apple built it for (summarize, classify, extract, rewrite, structured output) belong here, full stop.
  • MLX when you need a model the system does not give you: a specific open-weight LLM, a pinned version that does not shift under an OS update, a domain fine-tune, or an architecture (a vision-language model, a non-text model) outside Foundation Models’ scope. You pay in app size, memory, and ownership, and you buy control.
  • Cloud when the model genuinely must be large: frontier reasoning, long-context analysis, anything the biggest models do that a few-billion-parameter on-device model cannot. On-device is not a replacement for a frontier model; it is a different point on the curve.

The honest reading: MLX is a deliberate step down for a specific reason, not a better default. If you cannot name the capability Foundation Models lacks for your feature, you do not need MLX, and shipping it means carrying gigabytes of weights and a memory budget you did not have to.

When not to reach for MLX

  • The system model already does it. Re-read the Foundation Models tasks. If yours is on the list, stop here.
  • You cannot afford the weights. A quantized small model is still a large asset. If app size or first-launch download is a real constraint for your users, that constraint may decide the question by itself.
  • You need the Neural Engine’s lowest-power path for a fixed model. For a known, shipped model that does not change, Core ML and its converter target the Neural Engine with the tightest power and latency. MLX shines for flexibility and research-grade iteration; Core ML shines for a locked-down production model. They are different tools, and “on-device ML” is not one decision.
  • You will not maintain it. Your own model means you own its updates, its security, and its drift. Apple updates the system model for you. If you are not staffed to own a model, do not adopt one.

The skill MLX rewards is restraint about when to use it. The framework is genuinely remarkable: a real language model, fine-tuned to your domain, running entirely on the device with no server and no per-token cost, on hardware whose memory architecture was built for exactly this. That capability is worth reaching for when you have named the reason. Reach for it without one and you have traded Apple’s free, maintained, integrated model for a heavier, unmaintained copy you now own. The judgment is the whole job.



  1. Positioning MLX relative to the Foundation Models framework: Foundation Models exposes Apple’s fixed on-device system model (see Apple Foundation Models: The On-Device LLM Framework); MLX runs models you select and fine-tune. The two address different needs at different layers of the on-device stack. 

  2. Apple Machine Learning Research, MLX and MLX Swift. MLX is an array framework for machine learning on Apple Silicon with a NumPy-like API, composable function transformations (automatic differentiation, vectorization), lazy computation, and a Metal backend. MLX Swift is the Swift API for embedding it in apps. 

  3. MLX documentation, unified memory. MLX arrays live in shared memory; operations can run on CPU or GPU without transferring data between separate memory pools, which is the property that makes Apple Silicon’s unified memory architecture efficient for on-device model execution. Background on the hardware: Apple Silicon’s TBDR and unified memory

  4. Apple Machine Learning Research, MLX Swift Examples / MLX Swift LM. LLMModelFactory.shared.loadContainer(from:using:configuration:) loads a quantized model (for example mlx-community/Llama-3.2-3B-Instruct-4bit) from the Hugging Face Hub; ChatSession provides respond(to:) for single calls, and container.generate(input:parameters:) yields a stream of .chunk(text) events for incremental output via GenerateParameters and UserInput

  5. Apple Machine Learning Research, MLX Swift LM LoRA adapters reference. LoRAContainer.from(directory:) loads an adapter from a directory containing adapter_config.json and adapters.safetensors; applied through container.update, adapter.load(into: context.model) replaces the model’s Linear layers with LoRALinear layers, and unload(from:) removes one so adapters can be hot-swapped at runtime. Compare Apple’s system-model path in Foundation Models custom adapters

  6. Author’s hands-on MLX work: an autonomous ML research loop running fixed-budget training experiments on Apple Silicon via MLX, autonomously modifying architecture and hyperparameters to minimize validation bits-per-byte and keeping only improvements. The unified-memory and quantization behavior described here reflects that experimentation. 

Articles connexes

Apple Foundation Models: The On-Device LLM Framework, Explained

Apple's Foundation Models framework: LanguageModelSession, @Generable guided generation, tool calling, availability, and…

11 min de lecture

Core ML On-Device Inference: The Patterns That Actually Ship

Core ML runs models on Neural Engine, GPU, or CPU. The patterns that ship: model conversion, dispatch hinting, latency b…

13 min de lecture

Building AI Systems: From RAG to Agents

I built a 3,500-line agent system with 86 hooks and consensus validation. Here's what I learned about RAG, fine-tuning, …

13 min de lecture