Core ML On-Device Inference: The Patterns That Actually Ship

Core ML is the on-device inference engine that ships with every modern Apple device. The framework dispatches to the Neural Engine when available, the GPU when not, and the CPU as the last resort, choosing the fastest path automatically based on the model and the hardware1. The result on a recent iPhone is inference at sub-millisecond to low-tens-of-milliseconds latency for most production model sizes, free per call, with no network roundtrip and no third-party data exposure.

The framework’s reputation as “obscure plumbing” is dated. Core ML now powers Apple Intelligence’s on-device LLM, the Photos app’s semantic search, the Camera app’s scene recognition, and most third-party apps that ship local ML. The patterns that make a Core ML deployment actually ship, rather than work-on-my-Mac, are a small set: model conversion, dispatch hinting, latency budgeting, and quantization. The post walks each one against Apple’s documentation.

TL;DR

  • Core ML runs .mlpackage and .mlmodel files on Apple Silicon’s Neural Engine, GPU, and CPU. Dispatch is automatic but can be hinted via MLModelConfiguration.computeUnits2.
  • Model conversion happens through coremltools (PyTorch, TensorFlow, ONNX → Core ML). The conversion is a tooling task, not a runtime task; once a model is converted and bundled, the app loads and runs it.
  • Apple Silicon’s unified memory architecture means model weights are not copied between CPU, GPU, and NE; the same memory backs all three3. The architectural detail is what makes sub-millisecond inference possible.
  • Quantization (INT8, INT4 in recent Core ML versions) shrinks model size and speeds inference on the Neural Engine, with a measurable accuracy cost that depends on the model.
  • The agent-workflow tie-in: Foundation Models (Apple Intelligence’s on-device LLM) ships as a Core ML model behind a high-level Swift API; the same dispatch and quantization patterns apply.

The Mental Model: Three Compute Paths, One Memory

Apple Silicon (M-series Macs and A-series iPhones from A12 Bionic forward) ships three inference targets:

Neural Engine. A specialized accelerator for matrix multiplication at low precision. Fastest for the operations modern ML models depend on (convolutions, attention, embeddings). Lowest power consumption. Limited to specific operation types and tensor shapes; unsupported ops fall back to GPU or CPU per-layer.

GPU. General-purpose parallel compute through Metal. Slower than the Neural Engine for ML-shaped work but faster than CPU. Handles operations the Neural Engine does not support.

CPU. The fallback. Slow for ML inference, but always available, always supports every operation, and predictable.

The unified memory architecture means the same physical RAM backs all three3. A model’s weights, loaded once, do not get copied when dispatch shifts between targets. The architectural fact is what turns the multi-target dispatch from a per-layer copy cost into a per-layer scheduling decision.

MLModelConfiguration.computeUnits controls dispatch:

let config = MLModelConfiguration()
config.computeUnits = .all          // default: NE, GPU, CPU
// Other options:
// .cpuAndGPU
// .cpuAndNeuralEngine
// .cpuOnly
let model = try MyModel(configuration: config)

.all is the default and the right choice for almost every app. The framework picks the fastest path per operation, and the per-operation decision is faster than any heuristic the developer would write. The rare reason to override is to force .cpuOnly for testing parity (a model behaves differently on different paths and the test wants the deterministic path) or to force .cpuAndGPU to release the Neural Engine for another concurrent task.

Model Conversion: The Tooling Task

Most ML models are trained in PyTorch, TensorFlow, or directly through Apple’s Create ML. Core ML accepts .mlpackage files, the modern format introduced in Xcode 13 that supersedes the older .mlmodel4. Conversion happens through coremltools, Apple’s open-source Python package5.

A typical PyTorch-to-Core ML conversion follows three steps:

  1. Load the trained PyTorch model and place it in inference mode.
  2. Trace the model with an example input tensor matching the production input shape.
  3. Convert the traced model with coremltools against a target deployment iOS version.
import torch
import coremltools as ct

model = MyTrainedModel()
model.load_state_dict(torch.load("weights.pth"))

example_input = torch.rand(1, 3, 224, 224)
traced_model = torch.jit.trace(model, example_input)

mlmodel = ct.convert(
    traced_model,
    inputs=[ct.ImageType(name="image", shape=example_input.shape)],
    minimum_deployment_target=ct.target.iOS18,
    compute_units=ct.ComputeUnit.ALL,
)
mlmodel.save("MyModel.mlpackage")

The conversion happens once, in a development environment, against a target deployment iOS version (minimum_deployment_target). The output .mlpackage is what gets dropped into the Xcode project. The runtime app does not run coremltools.

Two practical traps in conversion. First, dynamic-shape inputs need explicit handling through ct.RangeDim because Core ML’s static-shape default produces unhelpful errors when the production app feeds varying input sizes. Second, custom ops in PyTorch that don’t have a Core ML equivalent need either a Core ML custom layer (Swift code that runs the missing op) or a model-architecture change to remove the op before conversion. Both are well-documented5.

Latency Budgets That Actually Apply

Three latency budgets matter for shipping apps:

16 ms (60 fps live UI). A real-time camera filter, an AR scene that updates per-frame, a live audio analyzer. The budget includes everything: image preprocessing, model inference, postprocessing, UI update. Models that fit are typically small (MobileNetV3-class, sub-100M parameters) and run on the Neural Engine.

100 ms (interactive UI). A user takes an action and waits for the result: tap to identify, draw to recognize, dictate to transcribe. The budget is more forgiving and supports larger models. Sub-1B-parameter language models, small vision transformers, and most production-grade classifiers fit comfortably.

1 s+ (background or batch). Photo library indexing, document analysis, model warm-up on app launch. Larger models work, but the user expectation must be set with a progress indicator. Foundation Models on-device LLM lives here for the larger context-window operations.

The budgets are guidelines, not hard limits. The right move is to measure on a target device using os_signpost or Instruments’ Core ML template6 rather than to trust theoretical numbers from another machine.

Quantization: When Smaller Is Faster

Core ML supports several quantization levels7:

  • Float32 (full precision). The training default. Largest, most accurate, slowest.
  • Float16. Half-precision. Smaller and faster on GPU and NE; accuracy loss is usually negligible for well-conditioned models.
  • INT8. 8-bit integer quantization with calibration. Roughly 4x smaller than Float32, often 2-4x faster on NE. Accuracy loss varies; for vision models, sub-1% top-1 accuracy loss is achievable with quantization-aware training.
  • INT4 and below. Aggressive quantization that recent Core ML versions support for specific model architectures (LLMs, large vision models). Significant accuracy loss is the trade-off; the technique works best when paired with model-aware quantization-aware training.

Linear quantization configuration through coremltools.optimize.coreml.linear_quantize_weights accepts a global op config that picks the quantization mode (linear_symmetric or linear) and a weight-size threshold below which weights stay in full precision. The conversion runs against an existing .mlpackage and produces a new quantized package; both can ship side-by-side in the bundle, with the app picking which one to load based on device class.

The quantization decision is per-model: a small classifier may not benefit because its compute is already cheap; a large language model benefits enormously because its compute is dominated by matrix multiplies on the quantized weights. The right approach is to quantize, measure accuracy on a held-out test set, and ship if the accuracy hit is acceptable for the use case.

Apple’s Built-In Models You Can Drop In

Apple ships several pre-trained Core ML models through the Core ML Models page8. Categories worth knowing:

  • Image classification: MobileNetV2, ResNet50, SqueezeNet variants, all bundled and ready to drop into a Vision framework VNCoreMLRequest.
  • Object detection: YOLOv3, MNIST, CenterNet variants.
  • Pose estimation: PoseNet for body pose (a baseline alternative to Vision’s VNDetectHumanBodyPoseRequest).
  • Semantic segmentation: DeepLabV3 for image segmentation.
  • Text recognition: ML-based OCR alternatives to Vision’s built-in.

For most apps, Apple’s pre-trained models cover the perception primitives (classify, detect, segment) without requiring custom training. The Foundation Models on-device LLM (covered in Foundation Models on-device LLM) is the largest example: a multi-billion-parameter LLM that ships as a Core ML model behind a high-level Swift API, dispatched on the Neural Engine, available offline.

Model Encryption and App Store Considerations

A .mlpackage in the app bundle is readable by anyone who unpacks the IPA. For models that represent meaningful intellectual property, Apple supports model encryption through the Encrypt your Core ML model workflow9: an encryption key gets generated through Xcode and managed through CloudKit, the model in the bundle is encrypted, and Core ML decrypts at load time.

For most apps, encryption is overkill. A model trained on commodity ImageNet data is not a competitive differentiator; encrypting it adds operational complexity without protecting anything valuable. Reserve encryption for models that represent genuine training-data investment or competitive advantage.

On-Device Privacy: The Architectural Win

The privacy story is direct. Core ML inference happens entirely on the device. Input data (images, audio, text) does not leave the device. The model file is local; the inference is local; the result is local.

For apps in regulated industries (health, finance, education), the architectural fact eliminates a class of compliance work. There is no third-party data processor to add to a privacy policy. There is no model API endpoint to vet for security. There is no data-residency question because the data never moves.

The Privacy Manifest format10 codifies the privacy story for App Store submission: an app that uses Core ML for on-device inference and nothing else can declare zero third-party data sharing for the inference path. The submission process is faster, the privacy review is shorter, and the user-facing privacy nutrition label is cleaner.

The Agent-Workflow Connection

Core ML pairs with three patterns the cluster has already covered:

Vision Framework’s VNCoreMLRequest. Custom Core ML models run through the Vision pipeline with automatic preprocessing. The pattern (covered in Vision Framework) is the right way to ship a custom image classifier or detector inside an iOS app.

Foundation Models on-device LLM. Apple Intelligence’s LLM is a Core ML model behind a high-level Swift API. The same dispatch (Neural Engine first), quantization (INT4 for the LLM weights), and latency-budget (sub-second for short generations) patterns apply. The post on Foundation Models covers the API; this post covers the underlying engine.

App Intents tools using local ML. An AppIntent that runs a local image classifier or text classifier returns structured results to Apple Intelligence without a network roundtrip. The combination is what makes “agentic Apple” actually private; the agent’s tools run locally because the framework supports it.

When Cloud Inference Is The Right Call

Core ML’s ceiling is the device’s compute. Three cases where cloud is correct:

Models too large to ship in a bundle. A 70B-parameter LLM does not fit in an app bundle. For workloads at that scale, cloud inference (or on-device-via-streaming weights, a different pattern) is the right tool.

Cross-device shared state during inference. Models that need to read or write a shared database during inference (recommendation systems with collaborative filtering against billions of records). Core ML’s purely-local model does not fit.

Rapid model iteration. A team that ships model updates daily benefits from server-side inference because rollouts do not require App Store review cycles. Core ML’s bundle-the-model-in-the-app pattern adds friction to model rev cadence; the trade-off is real.

The pattern: cloud wins on scale and iteration speed; Core ML wins on latency, cost, and privacy.

What This Pattern Means For iOS 26+ Apps

Three takeaways.

  1. Default to Core ML for any model that fits in the bundle and produces a per-call result the user can act on. Image classification, object detection, audio classification, gesture recognition, embedding generation, small-to-medium language tasks. The framework’s automatic dispatch and the Apple Silicon NPU produce sub-millisecond to low-tens-of-milliseconds inference for free.

  2. Quantize aggressively when the accuracy hit is acceptable. INT8 is usually safe; INT4 is appropriate for large models where the size savings matter. Measure accuracy on a held-out set rather than trusting that quantization is universally safe.

  3. Pair with Vision and Foundation Models for full local pipelines. Core ML is the engine; Vision is the perception API on top of it; Foundation Models is the LLM on top of it. The cluster’s Vision post and Foundation Models post cover the higher-level surfaces.

The full Apple Ecosystem cluster: typed App Intents; MCP servers; the routing question; Foundation Models; the runtime vs tooling LLM distinction; three surfaces; the single source of truth pattern; Two MCP Servers; hooks for Apple development; Live Activities; the watchOS runtime; SwiftUI internals; RealityKit’s spatial mental model; SwiftData schema discipline; Liquid Glass patterns; multi-platform shipping; the platform matrix; Vision framework; Symbol Effects; what I refuse to write about. The hub is at the Apple Ecosystem Series. For broader iOS-with-AI-agents context, see the iOS Agent Development guide.

FAQ

How does Core ML decide between Neural Engine, GPU, and CPU?

Core ML examines each operation in the model graph and dispatches it to the fastest target that supports the operation. The Neural Engine handles supported operations (most matrix multiplies, convolutions, attention) at the lowest latency and power. The GPU handles ops the NE doesn’t support. The CPU handles the rest. The decision is per-operation, automatic, and faster than a hand-written heuristic.

Should I always use .computeUnits = .all?

Almost always. The framework’s automatic dispatch is well-tuned. Override to .cpuOnly when testing for output parity (the same model returns slightly different results on NE vs CPU due to floating-point rounding) or to .cpuAndGPU to free the Neural Engine for a concurrent task.

What’s the practical difference between .mlpackage and .mlmodel?

.mlpackage is the modern format introduced in Xcode 13. It supports stored metadata, multiple model variants for ML Program (mlprogram) compilation, and the post-iOS 13 toolchain. .mlmodel is the legacy format. Both still load through MLModel; new development should use .mlpackage.

How big can a Core ML model in an app bundle be?

There is no fixed limit, but App Store bundle sizes are capped at 4 GB for download and have practical limits for over-the-air install. Foundation Models’ on-device LLM is around 3 GB and gets distributed by the OS rather than the app bundle. For app-bundled models, sub-100 MB is comfortable; 100-500 MB is feasible with a launch-time loading strategy; 500 MB+ is best handled through BGProcessingTask background download or on-demand resources.

How do I know if quantization hurt my model’s accuracy?

Hold out a test set, run inference on the original Float32 model and the quantized model, compare metrics (top-1 accuracy for classifiers, F1 for detectors, perplexity for language models, BLEU for translation, etc.), decide based on the application’s accuracy requirements. Quantization-aware training (training the model with quantization simulated in the loss) usually recovers most of the accuracy loss.

References


  1. Apple Developer Documentation: Core ML. Framework reference covering automatic dispatch behavior across compute units. 

  2. Apple Developer Documentation: MLModelConfiguration.computeUnits. Enum cases controlling which compute units the model may use. 

  3. Apple Developer: Apple silicon performance (WWDC 2020 introduction to Apple Silicon’s unified memory architecture). 

  4. Apple Developer Documentation: Core ML Model. .mlpackage and .mlmodel format reference. 

  5. coremltools documentation. Apple’s open-source Python package for converting trained models from PyTorch, TensorFlow, and ONNX to Core ML. 

  6. Apple Developer Documentation: Profiling Core ML models with Instruments. The Core ML Instruments template for per-layer latency and dispatch analysis. 

  7. coremltools Optimization. Quantization techniques and accuracy-preservation patterns supported by Core ML. 

  8. Apple Developer: Core ML Models. Apple’s gallery of pre-trained models ready to drop into iOS apps. 

  9. Apple Developer Documentation: Encrypting a Model in Your App. The CloudKit-backed encryption workflow for Core ML models. 

  10. Apple Developer Documentation: Privacy manifest files. The format for declaring an app’s data-collection and tracking behaviors. 

Related Posts

Apple's Vision Framework: What's Built In That Most Devs Reach for Cloud APIs For

Apple Vision ships 30+ on-device CV operations. Most devs default to OpenAI Vision API for tasks Vision performs in mill…

14 min read

What SwiftUI Is Made Of

SwiftUI is a result-builder DSL on top of a value-typed View tree. Once the substrate is visible, AnyView, Group, and Vi…

17 min read

Building AI Systems: From RAG to Agents

I built a 3,500-line agent system with 86 hooks and consensus validation. Here's what I learned about RAG, fine-tuning, …

13 min read