Apple's Vision Framework: What's Built In That Most Devs Reach for Cloud APIs For
Apple’s Vision framework, the one without the “OS” suffix, ships more than two dozen on-device computer vision operations. Most iOS developers default to OpenAI Vision API, Google Cloud Vision, or AWS Rekognition for tasks the framework performs in microseconds on the device’s Neural Engine. The default reflects a bias more than an evaluation: cloud APIs feel “modern AI” and Vision feels “platform plumbing,” so the platform gets skipped. The bias misreads what the platform now contains.
Vision is the local-first CV framework. It runs on the Neural Engine when available, on the GPU otherwise, and on the CPU as a last resort. Inference happens in a few milliseconds for most operations. The framework costs nothing per call. The data never leaves the device. The API key does not exist because no API exists. For most computer vision work an iOS app does, this is the right tool.
TL;DR
- Apple Vision provides more than two dozen on-device CV operations: text recognition, face detection and landmarks, body and hand pose estimation, barcode reading, document segmentation, image embeddings, saliency, animal detection, contours, trajectories, optical flow, and a runner for any Core ML model.
- Each operation runs in milliseconds on the Neural Engine, costs nothing per call, requires no network, and produces no third-party telemetry.
- Cloud APIs win for one specific case: complex semantic reasoning about an image (a multimodal LLM understanding a chart, a meme, or a document’s intent). For pixel-level operations (find faces, read text, detect a hand), Vision wins on cost, latency, and privacy.
- The agent-workflow connection: Vision results feed App Intents and Foundation Models on-device LLM calls without a network roundtrip. The whole pipeline runs locally.
What Vision Actually Contains
Vision groups its operations as VNRequest types. A request gets created, configured with parameters, fed an image (or CVPixelBuffer, or CIImage, or CGImage, or URL), and run. Results come back as observations attached to the request. The categories below cover the framework’s territory as of iOS 26.
Text Recognition
VNRecognizeTextRequest performs OCR. The request supports recognitionLevel (.fast for live camera streams, .accurate for document scanning), language hints, custom word lists, and bounding-box confidence. The accurate path on iOS 18+ rivals commercial OCR APIs on receipts, signs, and printed documents; handwriting recognition is supported in many languages.
let request = VNRecognizeTextRequest { request, error in
guard let observations = request.results as? [VNRecognizedTextObservation] else { return }
let lines = observations.compactMap { $0.topCandidates(1).first?.string }
print(lines.joined(separator: "\n"))
}
request.recognitionLevel = .accurate
request.usesLanguageCorrection = true
request.recognitionLanguages = ["en-US"]
let handler = VNImageRequestHandler(cgImage: image, options: [:])
try handler.perform([request])
The same operation through OpenAI Vision API costs roughly a fraction of a cent per call for low-detail mode and significantly more for high-detail mode, takes 1-3 seconds round-trip, and sends the image to OpenAI’s servers. Vision returns results in 100-300 ms locally, free, with no data exfiltration.
Face Detection and Landmarks
Three layers of face analysis ship in Vision:
VNDetectFaceRectanglesRequestreturns bounding boxes for every face in the frame.VNDetectFaceLandmarksRequestreturns structured landmark regions per face (jawline, mouth, eyes, eyebrows, nose, pupils), each with multiple keypoints.VNDetectFaceCaptureQualityRequestreturns a quality score the Camera app uses for selfie capture timing.
For most apps that need to find faces, crop to faces, blur faces, or count faces, the rectangles request is the right tool. For apps that animate something to a user’s face (filters, masks, tracking), landmarks plus pupil tracking are the right tool. None of this requires a model file or a network call.
Body and Hand Pose
VNDetectHumanBodyPoseRequest returns the 19 named joints in VNHumanBodyPoseObservation.JointName4 (nose, neck, shoulders, elbows, wrists, hips, knees, ankles, ears, eyes, root) with 2D coordinates and per-joint confidence. VNDetectHumanBodyPose3DRequest extends the topology into 3D space on devices with a LiDAR Scanner. VNDetectHumanHandPoseRequest returns 21 hand landmarks at finger-joint resolution.
Body pose is what fitness apps use to count reps without a wearable, what AR apps use to attach virtual content to a user’s hands, and what posture apps use to evaluate form. Hand pose drives gesture recognition (the user holds up two fingers, the app sees two fingers). Both run at 60 fps on a recent iPhone Neural Engine. The cloud equivalents are Google MediaPipe or proprietary fitness-tech APIs, which the framework replaces.
Barcode and QR
VNDetectBarcodesRequest reads the symbologies most retail and inventory workflows need (QR, PDF417, Aztec, Code 128, Code 39, EAN-13, ITF14, Data Matrix, GS1 DataBar, and more) and returns the raw payload plus the bounding rectangle. The detection runs in milliseconds and works in low-light conditions Apple’s Camera app already validates.
Document Segmentation
VNDetectDocumentSegmentationRequest finds rectangular documents in a frame and returns their corner points, accounting for perspective. The request is what document-scanner apps use to crop and rectify the document into a flat image. Apple’s own VisionKit framework wraps the request plus a UI, but the underlying operation is callable directly when an app needs custom UI.
Saliency and Aesthetics
VNGenerateAttentionBasedSaliencyImageRequest returns a heatmap of where a viewer’s attention is most likely to focus in an image. VNGenerateObjectnessBasedSaliencyImageRequest returns a heatmap of where objects are. VNCalculateImageAestheticsScoresRequest, added as a public API in iOS 181, returns aesthetic-quality scores including a utility classification (memos, screenshots) and an aesthetic value. The scores are what Photos uses to surface “Memory” candidates and what auto-cropping decisions feed.
Image Classification and Embeddings
VNClassifyImageRequest returns top-N category labels for an image using a built-in classifier (over 1,000 categories from a model trained on web-scale data). VNGenerateImageFeaturePrintRequest returns a feature vector (the model’s embedding) suitable for image-similarity search.
Embeddings are how a Photos app, a recipe app’s “find similar dishes,” or a moodboard app’s deduplication-by-similarity actually works. The cloud equivalent is OpenAI CLIP embeddings or Google’s Vertex AI; Vision returns them locally for free.
Object Tracking and Trajectories
VNDetectTrajectoriesRequest tracks moving objects across frames and returns parabolic-trajectory fits (a thrown ball, a fired arrow). VNTrackObjectRequest follows a manually-bounded object across a video sequence.
Trajectories are the underlying primitive for sports apps (tracking a baseball, a basketball, a tennis ball). The detection works on a live AVFoundation stream and returns results in real time.
Custom Models via VNCoreMLRequest
VNCoreMLRequest runs any Core ML model through the Vision pipeline. The request handles preprocessing (image resize, color-space conversion, normalization) automatically based on the model’s input description. An app trains a custom classifier in Create ML (handful of categories, hundred sample images per category, ten minutes of training) or downloads a published model, drops the .mlpackage into the app bundle, and runs it through Vision with three lines of code.
let model = try VNCoreMLModel(for: MyClassifier(configuration: .init()).model)
let request = VNCoreMLRequest(model: model) { request, error in
let results = request.results as? [VNClassificationObservation]
print(results?.first?.identifier, results?.first?.confidence)
}
let handler = VNImageRequestHandler(cgImage: image, options: [:])
try handler.perform([request])
The cloud equivalent for a custom classifier is hosting the model on a server, paying for inference compute, managing the API, and accepting the network latency. Vision turns it into a .mlpackage in the app bundle and a request handler.
Where Cloud APIs Actually Win
Vision’s territory is pixel-level operations: find this thing, classify this image, recognize this text. The framework does not provide complex semantic reasoning over an image’s meaning. Three cases where cloud APIs are the right call:
Multimodal LLM understanding. “What is this person doing in this image?” “Is this chart misleading?” “Translate this menu and tell me which items are vegetarian.” None of those are pixel-level questions. They require a large multimodal model to combine visual perception with world knowledge and language. Apple’s Foundation Models (the on-device LLM, covered in Foundation Models on-device LLM) is starting to handle some of this on-device, but for complex reasoning, GPT-4o, Claude Sonnet, or Gemini still win.
One-shot custom tasks without training data. Vision’s classification model is fixed; custom Core ML models require training data. A multimodal LLM can answer “is this a photo of a cat with a bow tie?” without seeing a single labeled training example. For prototyping or one-off tasks where collecting training data is too expensive, cloud LLMs are the right tool.
Document intelligence beyond OCR. Vision’s OCR returns text. A document-intelligence API (AWS Textract, Google Document AI, Azure Form Recognizer) returns structured fields: invoice number, date, line items, totals. The structuring is the value-add, not the OCR. For high-value document workflows, the cloud APIs are usually right; for “read this receipt and dump the text,” Vision is.
The pattern: cloud wins on reasoning and on highly-specialized vertical APIs; Vision wins on perception primitives.
Honest Latency and Cost Comparison
A representative inference pipeline running on iPhone 16 Pro (A18 Pro chip):
| Operation | Vision (on-device) | OpenAI Vision API | AWS Rekognition |
|---|---|---|---|
| OCR (1 page receipt) | 150-300 ms | 1-3 s round-trip + cost per image | 200-500 ms + cost per image |
| Face detection (1 frame) | 5-15 ms | 1-2 s + cost | 100-300 ms + cost |
| Body pose (live 60fps) | <16 ms | not real-time | not real-time |
| Image embedding | 20-40 ms | 200-500 ms + cost | not directly offered |
| Custom classifier | depends on model size | requires hosted model | requires hosted model |
The numbers above are derived from public Apple benchmarks and developer-reported measurements; the message is the order of magnitude, not the exact figure. Vision’s wins are in cost (zero per call), in tail latency (no network jitter), and in privacy (data never leaves the device).
The cost compounds when an app calls vision operations frequently. A photo-editing app processing 100 images per session costs on the order of dollars per session through cloud APIs and zero through Vision.
The Agent-Workflow Connection
Vision pairs cleanly with two cluster ideas already shipped:
App Intents tools for Apple Intelligence. When the app exposes a “Find faces in my photos” or “Read text from screenshot” capability through an AppIntent, the intent’s perform method runs Vision locally and returns a structured result. Apple Intelligence’s orchestrator can call the intent without sending the user’s photo to a server. The post on App Intents walks the surface contract.
Foundation Models on-device LLM. A pipeline that needs both perception and reasoning runs Vision first (extract text, find faces, locate objects) and Foundation Models second (reason about what was found, generate a summary). Both stages run on-device. Total network calls: zero. The post on Foundation Models explains how to call the LLM; this post argues that Vision is what feeds it without a cloud roundtrip.
let textRequest = VNRecognizeTextRequest()
textRequest.recognitionLevel = .accurate
let handler = VNImageRequestHandler(cgImage: receiptImage, options: [:])
try handler.perform([textRequest])
let extractedText = (textRequest.results ?? [])
.compactMap { ($0 as? VNRecognizedTextObservation)?.topCandidates(1).first?.string }
.joined(separator: "\n")
let llmResponse = await foundationModel.generate(
"Summarize this receipt as JSON with merchant, total, and date fields:\n\(extractedText)"
)
The whole pipeline runs on the device. No API key. No network call. No third-party data exposure.
What Has Matured Over The Last Two Releases
Three additions worth naming, with conservative dating against Apple’s release notes2:
Aesthetics scoring as a public API (iOS 18). VNCalculateImageAestheticsScoresRequest returns scores including utility classification and aesthetic value, replacing what photo-curation apps previously had to approximate with custom Core ML models.
Improved multilingual OCR. VNRecognizeTextRequest has expanded its non-Latin script support across recent releases, narrowing the gap with cloud OCR services that historically had stronger multilingual coverage. Apple’s text recognition documentation lists current language support3.
Document segmentation with VisionKit integration. VNDetectDocumentSegmentationRequest finds rectangular documents and returns corner points; VisionKit’s DataScannerViewController wraps the request with a designed UI for live document scanning.
The framework’s headline capabilities (face, text, pose, barcode, embeddings) have been mature for several iOS releases. The pattern: extend rather than reinvent.
Why Most Developers Skip Vision
Three reasons the framework gets skipped despite the case being clear:
Cloud-first habit. Most modern AI development happens against cloud APIs first. Developers know how to call OpenAI; the surface area of VNRecognizeTextRequest plus VNImageRequestHandler plus VNRecognizedTextObservation feels like more API to learn for what is, in practice, fewer lines of code.
Misjudgment of capability. Developers who haven’t checked the framework recently assume it covers OCR and barcodes only. The category list above is over a dozen capabilities, several of which have no cloud-native equivalent and several of which match commercial APIs without the cost.
Prototype vs production divergence. Cloud APIs win in early prototyping (one curl command to get a result), and the prototype gets turned into the production pipeline without a re-evaluation. The right move is to prototype with whatever is fastest and re-evaluate the perception layer once the workflow is real.
The fix is not to refuse cloud APIs; the fix is to know what the platform contains so the choice is real.
What This Pattern Means For iOS 26+ Apps
Three takeaways.
-
Default to Vision for perception primitives. Find faces, read text, detect barcodes, run pose estimation, get image embeddings. The framework runs in microseconds on the Neural Engine, costs zero, leaves no third-party data trail. For pixel-level CV operations, the framework is the right starting point.
-
Use cloud APIs for reasoning, not perception. A multimodal LLM understanding image meaning, a vertical document-intelligence API extracting structured fields, a one-shot custom task without training data. Those are cloud’s territory; ceding them to cloud is correct.
-
Pair Vision with Foundation Models for full on-device pipelines. Perception (Vision) feeds reasoning (on-device LLM). The pipeline runs locally end-to-end, with no API keys, no network jitter, and no telemetry leaving the device. The cluster’s Foundation Models post covers the LLM half; Vision is the input half.
The full Apple Ecosystem cluster: typed App Intents; MCP servers; the routing question; Foundation Models; the runtime vs tooling LLM distinction; three surfaces; the single source of truth pattern; Two MCP Servers; hooks for Apple development; Live Activities; the watchOS runtime; SwiftUI internals; RealityKit’s spatial mental model; SwiftData schema discipline; Liquid Glass patterns; multi-platform shipping; the platform matrix; what I refuse to write about. The hub is at the Apple Ecosystem Series. For the broader iOS-with-AI-agents context, see the iOS Agent Development guide.
FAQ
What’s the difference between Apple Vision and visionOS?
The Vision framework is the on-device computer vision API for iOS, macOS, and visionOS. visionOS is the operating system for Apple Vision Pro. The naming overlap is unfortunate. Vision (the framework) runs on every modern Apple device; visionOS (the OS) runs on Vision Pro hardware specifically.
When should I use Vision instead of OpenAI Vision API or Google Cloud Vision?
For pixel-level perception tasks (find faces, read text, detect objects, count items, estimate pose, generate image embeddings), Vision is almost always the right call. It runs in milliseconds, costs nothing per inference, and keeps user data on the device. Cloud APIs are right when the task requires complex semantic reasoning about an image’s meaning, or when a vertical document-intelligence API provides structured fields beyond text extraction.
Can I run my own Core ML model through Vision?
Yes. VNCoreMLRequest wraps any Core ML model and handles preprocessing automatically. Drop the .mlpackage file into the app bundle, instantiate the model, wrap it in a VNCoreMLModel, and run it through a request handler. The same handler can run multiple requests in parallel, including the built-in Vision requests and the custom Core ML model.
How does Vision dispatch work on Apple Silicon?
Vision (and the Core ML models it runs) dispatches automatically to the Neural Engine when available, falls back to the GPU when not, and to the CPU as a last resort. The framework chooses the fastest path for the device and the operation. For most modern iPhones (A12 Bionic and later), the Neural Engine handles the bulk of inference; the developer does not configure dispatch manually.
What’s been added recently?
The conservative summary, dated against Apple’s release notes: VNCalculateImageAestheticsScoresRequest was added as a public API in iOS 18; VNRecognizeTextRequest has expanded multilingual support across recent releases; VisionKit’s DataScannerViewController wraps document scanning in a designed UI. The headline capabilities (text, face, pose, barcodes, embeddings) have been mature for several iOS releases.
References
-
Apple Developer Documentation:
VNCalculateImageAestheticsScoresRequest, introduced in iOS 18.0+. ↩ -
Apple Developer Documentation: Vision framework, reference for available requests and platform availability. ↩
-
Apple Developer Documentation: Recognizing Text in Images, supported recognition languages by API call. ↩
-
Apple Developer Documentation:
VNHumanBodyPoseObservation.JointName, enumerated joint names returned by 2D body-pose requests. ↩