Apple Foundation Models: The On-Device LLM Framework, Explained

The Foundation Models framework gives an app direct, free, offline access to the same on-device large language model that powers Apple Intelligence1. No API key, no per-token bill, no network round trip, no data leaving the device. For a class of features that used to mean a cloud LLM and a privacy review, the cost now rounds to zero. The trade is capability: the on-device model is small, the context window is finite, and the framework draws hard lines around what it will and will not do. Knowing those lines is the whole game.

This is the reference for the framework itself: the types you actually call, the one feature that makes it worth using, and the point where you should stop and reach for something bigger.

TL;DR

  • LanguageModelSession is the entry point. Create one, call respond(to:), get text back. Multi-turn context lives in the session; single-turn work gets a fresh session each time2.
  • Guided generation is the reason to use this framework. Annotate a Swift type with @Generable and the model returns that type, populated and type-checked, instead of a string you have to parse3.
  • The Tool protocol lets the model call your code mid-generation to fetch data or take an action, then fold the result back into its answer4.
  • Check SystemLanguageModel.default.availability before you do anything. The model is absent on ineligible devices, with Apple Intelligence off, or while it downloads5.
  • The context window is real and small. SystemLanguageModel.default.contextSize reports the token budget shared across prompt and response6. Plan for it, or the session throws.
  • Requires iOS 26 and an Apple-Intelligence-capable device. Below that floor, the framework does not exist.

What the framework is, and what it is not

Foundation Models is not a wrapper around a cloud endpoint. The model lives on the device, ships with the operating system, and runs against the Neural Engine. That single fact drives every design decision in the API and every decision you make using it.

What you get: text generation, summarization, classification, extraction, short-form rewriting, and structured output, all on-device and all free. What you do not get: a frontier model. Apple built the on-device model for focused language tasks inside an app, not open-ended reasoning, not long-document analysis, not world knowledge you can quiz. Apple says as much, and the framing matters because it sets expectations the API will otherwise let you violate1.

The mental model that keeps you out of trouble: treat the on-device model as a fast, private, free intern who is excellent at shaping text and terrible at knowing facts. Hand it material and a clear task. Do not ask it questions it has no way to answer.

LanguageModelSession: the entry point

Every interaction starts with a session.

import FoundationModels

let session = LanguageModelSession()
let response = try await session.respond(to: "Summarize this review in one sentence: \(reviewText)")
print(response.content)

The session holds conversation state. Each call to respond(to:) appends to the running transcript, so a session you keep around remembers what came before. For a chat feature, that is what you want. For independent one-shot tasks (summarize this, classify that), create a fresh session per call so stale context does not leak in and eat your token budget2.

respond(to:) is async throws. It suspends while the model works and throws when the request exceeds the context window, when the model is unavailable, or when guardrails reject the content. Every one of those is a real branch you handle, not an edge case you ignore.

For a responsive UI, stream instead of waiting. streamResponse(to:) yields partial output as the model produces it, which turns a three-second stall into text that appears as it forms7.

Guided generation: the feature that earns the framework

Here is the part worth the price of admission. Most LLM integrations spend a third of their code coaxing valid JSON out of a model and the other two-thirds defending against the times it fails anyway. Foundation Models deletes that work.

Annotate a Swift type with @Generable, ask the session to generate it, and the model returns an instance of that type, populated and type-safe3:

@Generable
struct Recipe {
    @Guide(description: "The dish name")
    let title: String

    @Guide(description: "Ingredients, each as 'quantity item'")
    let ingredients: [String]

    @Guide(description: "Total minutes, start to finish", .range(5...240))
    let minutes: Int
}

let session = LanguageModelSession()
let response = try await session.respond(
    to: "A weeknight pasta for two.",
    generating: Recipe.self
)
let recipe = response.content   // a Recipe, not a String

No parsing. No JSONDecoder. No retry loop for malformed output. The @Guide macro constrains individual fields: a description the model reads as instruction, and optional limits like a numeric range or a regular expression the output must match8. The framework does not ask the model nicely for a number between 5 and 240; it constrains decoding so the field cannot come back otherwise.

The discipline this enforces is the real value. You design the output type first, in Swift, with the compiler checking it. The model fills a contract you defined instead of returning prose you reverse-engineer. For extraction, form-filling, and any feature that turns language into data, guided generation is the difference between a demo and shipping code.

One control worth knowing: respond(to:generating:) defaults includeSchemaInPrompt to true, which injects your type’s shape into the prompt to bias the model toward it. Leave it on unless the model already knows the format from training or from earlier turns in the session; turning it off to save tokens on a format the model has not seen is how you get garbage back9.

Tool calling: letting the model reach your code

Guided generation shapes what comes out. Tool calling changes what goes in. A tool is a piece of your code the model can invoke mid-generation to fetch information it does not have or perform an action, then continue its answer using the result4.

A tool conforms to the Tool protocol: a name, a description the model reads to decide when to call it, a @Generable Arguments type, and a call(arguments:) method that does the work4:

struct FindContacts: Tool {
    let name = "findContacts"
    let description = "Find a specific number of contacts from the address book"

    @Generable
    struct Arguments {
        @Guide(description: "How many contacts to return", .range(1...10))
        let count: Int
    }

    func call(arguments: Arguments) async throws -> [String] {
        // Fetch contacts, return formatted names.
    }
}

let session = LanguageModelSession(tools: [FindContacts()])
let response = try await session.respond(to: "Draft a dinner invite to three of my contacts.")

The flow: the model decides it needs contacts, calls your tool with a validated count, you return data, and the model writes the invite using real names. The arguments arrive type-checked through the same guided-generation machinery, so you never parse the model’s intent out of free text. The tool description is your only lever on when the model reaches for it, so write it like a function doc that another engineer (with no other context) has to read and use correctly.

This is also the seam where Foundation Models meets the rest of the agent story. A tool the on-device model calls and an App Intent Apple Intelligence calls are different surfaces with the same shape: a named, described, typed capability. Design the capability once and you can expose it through both.

Availability: the check you cannot skip

The model is not always there. It is absent on devices that do not support Apple Intelligence, when the user has it switched off, and during the window when the operating system is still downloading model assets. Ship code that assumes the model exists and it will crash, silently degrade, or hang for a population of your users you never tested on.

Check SystemLanguageModel.default.availability and branch on the reason5:

switch SystemLanguageModel.default.availability {
case .available:
    // Show the intelligence feature.
case .unavailable(.deviceNotEligible):
    // Hide it. This device will never have the model.
case .unavailable(.appleIntelligenceNotEnabled):
    // Prompt the user to turn on Apple Intelligence.
case .unavailable(.modelNotReady):
    // Downloading or otherwise not ready yet. Try again later.
case .unavailable(let other):
    // Unknown reason. Fail closed.
}

The three reasons demand three different product responses, and conflating them is the most common way these features feel broken. deviceNotEligible is permanent: hide the feature, do not nag. appleIntelligenceNotEnabled is a setting the user controls: a one-time prompt is fair. modelNotReady is temporary: retry, do not show an error. Build the unavailable path with the same care as the happy path, because for a real slice of devices it is the only path.

When the model is available and you know a request is coming, prewarm() on the session warms the model so the first real response lands faster10. Worth it on a screen the user is about to act on, wasteful if you call it speculatively.

The context window, and where it stops being enough

SystemLanguageModel.default.contextSize reports the token budget the model works inside, and that budget is shared: prompt plus response together must fit6. The number is small relative to a cloud model, and you feel it fast on real input. A long document, a full chat history, a fat tool result: any of them can blow the budget and make respond throw.

Two failure modes follow, and both are yours to prevent. First, the slow creep: a multi-turn session accumulates transcript until one more turn overflows. Manage it by starting fresh sessions for unrelated work and by keeping per-turn input lean. Second, the single oversized request: a 20-page PDF does not fit, full stop. Chunk it, summarize the chunks, then reason over the summaries (the map-reduce that LLM engineers know well), or accept that the task is the wrong shape for an on-device model.

The context window is the cleanest signal for the decision that actually matters with this framework: when to stay on-device and when to leave.

When not to use Foundation Models

The framework is free, private, and offline, which makes it tempting to reach for everywhere. Resist. Reach past it when:

  • You need real reasoning or breadth of world knowledge. The on-device model is small by design. Open-ended reasoning, code generation, and deep analysis belong to a frontier cloud model. Asking the on-device model for them produces confident, wrong answers.
  • The input does not fit the context window and chunking would destroy the meaning. Some tasks need to see everything at once.
  • You need a model you control: a specific checkpoint, a fine-tune, custom weights, deterministic versioning across OS updates. Apple ships and updates the model on its schedule, not yours.
  • You are below iOS 26 or on an ineligible device. The framework simply is not there, and the availability check will tell you so on every run.

For the on-device cases the framework does not cover (a custom model, your own weights, training on the device), the layer below is Core ML and Apple’s MLX. For the cases that genuinely need scale, a cloud LLM behind a privacy boundary is still the honest answer. Foundation Models is not a replacement for either. It is the right first reach for focused language work on text you already hold, and the wrong reach for everything else.

The skill this framework rewards is not prompt-craft. It is taste about scope: feeding the model tasks it is good at, designing @Generable types that capture exactly what you need, and recognizing the moment the work outgrows the device. Build with those instincts and the on-device model does a surprising amount of real work for free. Ignore them and you ship a feature that breaks for every user whose input ran one token too long.



  1. Apple Developer, “Foundation Models” framework overview. Apple describes the framework as access to the on-device model that powers Apple Intelligence, suited to focused language tasks such as text generation, summarization, classification, and structured output rather than open-ended reasoning or world knowledge. 

  2. Apple Developer, “LanguageModelSession” and “Generating content and performing tasks with Foundation Models”. A session holds multi-turn context; Apple’s guidance is to create a new session for each distinct single-turn interaction. 

  3. Apple Developer, “Generable” and “Prompting an on-device foundation model”. The @Generable macro lets the framework return a populated, type-checked Swift value rather than a string. 

  4. Apple Developer, “Tool” protocol. Defines protocol Tool<Arguments, Output>: Sendable with required name, description, and parameters: GenerationSchema, plus call(arguments:) async throws -> Output. The Arguments type conforms to ConvertibleFromGeneratedContent and is typically declared @Generable

  5. Apple Developer, “SystemLanguageModel.Availability” and its UnavailableReason. Cases: .available and .unavailable(...) with reasons deviceNotEligible, appleIntelligenceNotEnabled, and modelNotReady. SystemLanguageModel.default.isAvailable is the convenience boolean. 

  6. Apple Developer, “SystemLanguageModel.contextSize”. An instance property (reached through SystemLanguageModel.default) documented as the maximum context size, representing the total tokens across input prompt and generated response. 

  7. Apple Developer, “LanguageModelSession.streamResponse(to:)”. Streams partial generated output as the model produces it, for incremental UI updates. 

  8. Apple Developer, “Guide(description:_:)”. A peer macro that attaches a natural-language description and optional constraints (numeric ranges, regular-expression guides) to a @Generable property. Requires iOS 26.0+. 

  9. Apple Developer, “respond(to:schema:includeSchemaInPrompt:options:)”. includeSchemaInPrompt defaults to true; Apple’s discussion recommends keeping the default unless the model already knows the expected format. 

  10. Apple Developer, “LanguageModelSession.prewarm()”. Asks the framework to load model resources ahead of a known upcoming request to reduce first-response latency. 

  11. Author’s related analysis: On-Device LLMs with Apple’s Foundation Models, Custom Adapters for Foundation Models, Foundation Models Use Cases, and Agentic Workflows on Foundation Models. The App Intents and tool-surface argument is developed in App Intents Are Apple’s New API to Your App

Powiązane artykuły

Foundation Models Use Cases: General vs Content Tagging

iOS 26 Foundation Models has .general and .contentTagging use cases. Use Apple's rules to decide when prompting beats sp…

9 min czytania

Foundation Models Custom Adapters: When To Train One

iOS 26 Foundation Models custom adapters train LoRA weights, export .fmadapter packages, ship via Background Assets, and…

13 min czytania

When the Maintainer Is the Attacker: jqwik 1.10.0

jqwik 1.10.0 emits a destructive prompt-injection string in Maven output. ANSI escapes hide it from humans. The maintain…

18 min czytania