Foundation Models Image Input in iOS 27

iOS 26 gave an app an on-device large language model that read text and wrote text. iOS 27 hands that same model a pair of eyes. At WWDC26, Apple confirmed the on-device system language model “is also gaining Vision capabilities, which unlocks entire new categories of applications,”1 and the way you reach those capabilities is almost anticlimactic: you put an image into the prompt next to the words, as an attachment, and ask your question. No separate vision pipeline, no model swap, no new session type. The prompt that used to carry a string now carries a picture too, and the model answers about both.

Why the small surface area is the headline: image input is not a sidecar API bolted onto Foundation Models. Apple describes it as “a natural extension of the existing prompt builders,”1 which means every concept you already learned in iOS 26 (sessions, guided generation, the Tool protocol) keeps working unchanged the moment a prompt goes multimodal.4 If you have not met LanguageModelSession yet, start with the Foundation Models framework explainer and come back.

TL;DR

  • The iOS 27 on-device model accepts image input. You insert an image attachment into your prompt alongside text, and the model answers questions about the image12.
  • Image attachments can be created from a variety of types: UIImage, NSImage, CGImage, Core Image types, CoreVideo pixel buffers, and file URLs1.
  • The model supports images in any size and aspect ratio, so you do not crop or pad to a shape, but larger images consume more tokens and incur more latency1.
  • Foundation Models gives the LLM broad versatility on an image; the Vision framework gives fixed, fast, fine-tuned analysis. Apple’s guidance is to combine them through tool calling rather than choose2.
  • The Private Cloud Compute server model also supports image input, with a 32K context window (versus 4K on-device), so multimodal prompts that carry text plus several images have room to breathe3.

What changed: the prompt grew a picture

Watch on Apple Developer ↗
Apple’s Erik introduces Vision capabilities on the on-device model, then shows the API: create a session, insert an image attachment into the prompt alongside text, and the model answers questions about the image.

In session 241, Apple’s framing is precise. The on-device model “is more intelligent; better at logic and tool calling,” and on top of that intelligence it now gains Vision.1 The demo asks the model about a photo of origami: “Simply insert an image attachment into your prompt, together with text. Now, the model can answer questions about the image.”1 The order matters. Text and image live in the same prompt, the model reads the whole prompt, and the answer reasons over both.

The set of source types is wide enough that you rarely have to convert anything yourself. Apple lists six: image attachments “can be created from a variety of types including UIImage, NSImage, CGImage, Core Image types, CoreVideo Pixel Buffers, and file URLs.”1 A UIImage straight out of PhotosPicker, a CGImage you already rendered, a frame you pulled off the camera as a CVPixelBuffer, or a file on disk by URL all become valid input. The transcript names the input types but not the exact attachment initializer signatures, so treat the type list as the contract and let Xcode autocomplete fill in the call site once you are on the iOS 27 SDK.

One constraint you do not have to fight: shape. “The model supports images in any size and aspect ratio, so you don’t need to crop or pad to any particular shape.”1 A tall receipt, a wide panorama, and a square thumbnail are all acceptable as-is. The cost is the one you would expect from any token-budgeted model: “Arbitrary image sizes are allowed, but bear in mind that larger images will consume more tokens and incur more latency.”1 An image is not free context. It spends from the same budget your text does, which is the first design decision multimodal forces on you, and the reason context size (covered below) becomes load-bearing.

Because image input rides the existing prompt builder, the iOS 26 machinery survives intact. Guided generation still shapes the output into a @Generable type. The Tool protocol still lets the model call your code. Streaming still streams. The model gained a sense, not a new programming model.

The image-understanding tie-in: Foundation Models and Vision are not rivals

Watch on Apple Developer ↗
Megan Williams from the Vision team contrasts the two approaches: Foundation Models leverages LLMs that “can do almost anything you ask them,” while Vision uses a fixed set of computer vision APIs that are fine-tuned, fast, and often real-time.

Session 237, “What’s new in image understanding,” is where Apple draws the line between the two ways to analyze an image, and the distinction is the most useful thing in either talk for deciding what to build. The opening is a tell: the presenter’s agenda goes missing, she photographs her sticky notes, and asks “a large language model to generate an agenda. This is pretty easy to do with the Foundation Models framework. Thankfully, this year Foundation Models is supporting image inputs.”2 That is the whole pitch for the descriptive side of multimodal: caption images, suggest interior-decorating fixes from a photo of a room, generate a recipe from a picture of a fridge. The presenter’s verdict on where the LLM shines: “Models tend to do well with descriptive tasks.”2

Then comes the honest comparison. “The Foundation Models framework leverages large language models, which can do almost anything you ask them. By comparison, traditional image processing frameworks, like Vision, use a fixed set of computer vision APIs. Vision APIs are fine-tuned for specific tasks, which they do really well. And Vision is fast. Often fast enough to analyze video frames in real time.”2 Read that as a routing rule. Open-ended question about a still image, where you want language back? Foundation Models. Specific, well-defined task at video frame rates (face detection, pose, saliency, segmentation)? Vision. The LLM is a generalist that thinks; the Vision API is a specialist that runs.

Apple’s punchline is that you do not have to choose: “you don’t always have to choose between Vision and Foundation Models to analyze your images. There’s a way to leverage Vision’s expertise with Foundation Model’s versatility using tool calling.”2 iOS 27 tool calling now supports image arguments. When a model cannot identify something itself (the talk uses plant identification), it calls a tool, and “rather than passing the whole image as an argument, the model would instead pass a reference to the image.”2 That reference, an ImageReference, “needs to be a reference to an existing image from the current chat session,”2 which the tool resolves back into an attachment through the session’s history, ready for analysis. The control loop and the built-in OCRTool and BarcodeReaderTool are the subject of the companion piece, tool-calling control in iOS 27; here the point is narrower: image input and image-argument tools are two layers of the same multimodal stack, and they compose.

Multimodal on Private Cloud Compute: same prompt, bigger room

Watch on Apple Developer ↗
Louis confirms the on-device model “now has support for image input,” then shows the server model: an article-summarizer that takes “the text and images” of a markdown file, feeds them into a LanguageModelSession, and summarizes using the 32K context PCC offers.

Session 319 opens by confirming both halves of the multimodal story at once. The on-device model “now has support for image input, it’s better at instruction following and calling your custom tools,”3 and for the heavier cases there is a new server model on Private Cloud Compute. The reason multimodal and PCC belong in the same conversation is context size. Apple states the numbers plainly: “The on-device model offers 4k, and with PCC you get 32K.”3 An image spends tokens from that budget1, so a prompt that carries text plus several images is exactly the kind of payload that strains 4K and fits comfortably in 32K.

The summarizer demo makes the fit concrete. “Here I have an app that summarizes an article using the PCC model. I can select a markdown file, and we take the text and images, feed that into a LanguageModelSession, and generate a summary. This works great with the large context size that PCC offers.”3 Text and images, one session, one prompt. The migration cost from on-device to server is one line: Apple shows that “by changing just 1 line of code, you can switch to the new server model on PCC,”3 because “the Foundation Models framework offers a unified Swift API, regardless of which model you’re talking to.”3 Guided generation with Generable and tool calling “works just the same with the PCC model, as it does with the on-device model.”3

PCC adds reasoning, which on-device does not have, and reasoning has a multimodal-relevant cost: “reasoning is extra text that the model generates. So it uses tokens. This counts towards your context size limit.”3 Pair deep reasoning with several full-resolution images in one prompt and you are spending the 32K budget from both ends. The deeper details (the three reasoning levels, the quotaUsage and isLimitReached daily-limit handling, the entitlement you apply for on the developer site) belong to the Private Cloud Compute deep dive; the multimodal takeaway is that the same image-bearing prompt runs on both models, and the server model exists for when the prompt outgrows the device.

Adopting image input

A short checklist that follows from the contracts above.

Start on-device, measure, then decide. Apple’s own advice is to choose the model “based on data, not just vibes,”3 and warns “you may be surprised how well the on-device model performs at certain tasks, especially with the updated model this year.”3 A caption or a what-is-this-object query may never need PCC. Reach for the server model when the prompt carries multiple images or long text that overruns the 4K on-device window3.

Pick the cheapest source type for your pipeline. You can hand the model a UIImage, NSImage, CGImage, Core Image type, CVPixelBuffer, or file URL1. If a frame already exists as a pixel buffer from the camera or as a file on disk, pass it directly rather than round-tripping through UIImage.

Treat image resolution as a budget knob, not a quality dial. Any size and aspect ratio is legal1, so resist over-cropping for shape. But because larger images cost more tokens and more latency1, downscale a 48-megapixel photo before it enters a prompt when the task (read this sign, what room is this) does not need every pixel.

Route by task, not by reflex. Descriptive, open-ended, language-out work goes to Foundation Models; fixed, fast, real-time computer-vision work goes to Vision; when you need both, call Vision from inside a Foundation Models tool2. The two frameworks are complementary layers, and the image-argument tool is the seam that joins them.

Keep the availability check. Image input rides the same model, and the model is “only available on Apple Intelligence devices.”3 Check the availability API and degrade gracefully where Apple Intelligence is absent3.

FAQ

How do I send an image to the Foundation Models model in iOS 27?

You insert an image attachment into your prompt alongside the text, using the existing prompt builder, then ask the model to respond. Apple describes the API as “a natural extension of the existing prompt builders”: create a session, “simply insert an image attachment into your prompt, together with text,” and “the model can answer questions about the image.”1 No separate vision pipeline and no new session type are involved.

What image types can I pass to Foundation Models?

Image attachments can be created from UIImage, NSImage, CGImage, Core Image types, CoreVideo pixel buffers, and file URLs1. The transcript enumerates those source types; it does not spell out every initializer signature, so let the iOS 27 SDK supply the exact call site.

Do I need to resize or crop images before sending them?

No. “The model supports images in any size and aspect ratio, so you don’t need to crop or pad to any particular shape.”1 The trade-off is cost, not legality: “larger images will consume more tokens and incur more latency,”1 so downscaling a very large photo is a budget decision when the task does not need full resolution.

When should I use Vision instead of Foundation Models for an image?

Use Vision for fixed, well-defined, speed-critical tasks. Apple notes Vision “use[s] a fixed set of computer vision APIs,” is “fine-tuned for specific tasks,” and is “often fast enough to analyze video frames in real time,” while Foundation Models “can do almost anything you ask” and excels at descriptive tasks2. When you want both, call a Vision-backed tool from a Foundation Models session through tool calling2.

Does image input work with the Private Cloud Compute server model?

Yes. Apple confirms the on-device model “now has support for image input,”3 and the PCC demo feeds “the text and images” of a document into a LanguageModelSession for summarization3. The same unified Swift API runs on both models, so the same image-bearing prompt works on-device or on the server with a one-line change. PCC’s 32K context (versus 4K on-device) gives multi-image prompts more room3.

The full Apple Ecosystem cluster: the Foundation Models framework explainer; the on-device LLM; tool-calling control in iOS 27; and the Private Cloud Compute deep dive. The hub is the Apple Ecosystem Series. For broader iOS-with-AI-agents context, see the iOS Agent Development guide.



  1. Apple, WWDC26 session 241, “What’s new in the Foundation Models framework.” developer.apple.com/videos/play/wwdc2026/241. Apple states the on-device model “is also gaining Vision capabilities,” describes the API as “a natural extension of the existing prompt builders” where you “simply insert an image attachment into your prompt, together with text,” lists the supported source types (UIImage, NSImage, CGImage, Core Image types, CoreVideo pixel buffers, and file URLs), and notes the model “supports images in any size and aspect ratio” while “larger images will consume more tokens and incur more latency.” 

  2. Apple, WWDC26 session 237, “What’s new in image understanding.” developer.apple.com/videos/play/wwdc2026/237. Apple states “this year Foundation Models is supporting image inputs,” contrasts the Foundation Models LLM (“can do almost anything you ask them,” strong at descriptive tasks) with the Vision framework (“a fixed set of computer vision APIs,” “fine-tuned for specific tasks,” “fast enough to analyze video frames in real time”), and shows tool calling supporting image arguments via an ImageReference to “an existing image from the current chat session” resolved through the session’s history. 

  3. Apple, WWDC26 session 319, “Build with the new Apple Foundation Model on Private Cloud Compute.” developer.apple.com/videos/play/wwdc2026/319. Apple confirms the on-device model “now has support for image input,” states “the on-device model offers 4k, and with PCC you get 32K,” shows switching to the PCC server model “by changing just 1 line of code” through “a unified Swift API,” demonstrates feeding “the text and images” of a document into a LanguageModelSession, advises choosing a model “based on data, not just vibes,” and notes reasoning “is extra text that the model generates” that “counts towards your context size limit.” 

  4. Apple Developer, “Foundation Models” framework documentation. Reference for LanguageModelSession, the prompt builder, guided generation via @Generable, and the Tool protocol that the iOS 27 image-input and image-argument features extend. 

相關文章

Foundation Models in iOS 27: Tool-Calling Control

iOS 27 adds GenerationOptions.ToolCallingMode to steer how the on-device model uses tools, plus built-in Vision tools: O…

16 分鐘閱讀

Foundation Models on Private Cloud Compute

iOS 27 adds a server-scale Foundation Model on Private Cloud Compute with on-device privacy, plus a protocol to plug in …

17 分鐘閱讀

Your Agent Has Two Untrusted Inputs

AI agents have two untrusted inputs: code the model writes and tool output it reads. One now has a real WASM sandbox; th…

12 分鐘閱讀