Meet Music Understanding: On-Device Audio Analysis

At WWDC 2026, Apple’s Final Cut Pro team shipped two features built on a single framework: a beat detection feature that reveals a song’s beat grid so editors can align cuts to bars and beats, and a montage feature on iPad that synchronizes clips to the music automatically.1 Both run on Music Understanding, a new framework that hands you a song’s musical intelligence (key, rhythm, structure, pace, instrument activity, and loudness) without asking you to know anything about signal processing or machine learning. It runs entirely on-device, so the audio you analyze stays private and works offline.1 This post walks the framework as a build-along: the six analysis areas, how a MusicUnderstandingSession produces them, and the streaming-loudness AsyncSequence that makes audio-reactive visuals practical.

TL;DR

  • Music Understanding analyzes six areas of a song (key, rhythm, structure, pace, instrument activity, and loudness) on-device, with no signal-processing or machine-learning expertise required.1
  • You create a MusicUnderstandingSession from an AVAsset or a custom audio provider, then call analyze() for everything or analyze(for:) to target specific types and skip unnecessary computation.1
  • Results arrive in a SessionResult struct where each feature is an optional field; the general analyze() populates all of them, and the targeted analyze(for:) leaves the rest nil.1
  • Two time-aware types thread through the API: a TimedValue pairs a value with a CMTime, and a RangedValue pairs a value with a CMTimeRange.1
  • MusicUnderstandingSession also exposes a streaming loudness API that delivers values over an AsyncSequence for every 100ms of audio analyzed, which is the foundation for driving live audio-reactive animation.1

Why on-device musical intelligence matters

Watch on Apple Developer ↗

Conner from Apple’s Computational Music Team enumerates the framework’s six analysis areas, starting at 1:39.

The pitch is narrow and honest: the framework “handles all the signal processing and model inference for you so you don’t need any expertise in signal processing or machine learning to use it.”1 That removes the part of audio analysis most app developers never wanted to own. Detecting a tempo, segmenting a song into a chorus and a verse, or measuring perceived loudness each used to mean either licensing a third-party engine or building a DSP pipeline by hand.

Running on-device changes the privacy math too. Because the framework “runs entirely on-device, the audio you analyze stays private and works offline.”1 A song never leaves the phone to get analyzed, and the analysis works on a plane with no signal. For a DJ app sorting a library by tempo, or a video editor matching cuts to beats, that combination of no network dependency and no audio leaving the device is the practical unlock.

Apple frames the six areas as the building blocks of a song. Rhythm is the pulse, driven by individual beats that build into bars; the number of beats in one minute is beats per minute, or bpm.1 Bars form phrases (musical sentences), phrases combine into segments, and segments build sections like a chorus, verse, intro, or bridge.1 Instruments such as a drum, bass, or vocals play at different times and intensities around a common set of notes called the key.1 A song may hold a steady bpm while different parts feel slower or faster, which Apple calls pace, and the song grows louder at some points than others.1 Those six concepts map one-to-one onto the framework’s result types.

The session: one object, two ways to ask

Apps interact with a MusicUnderstandingSession, initializing it “with either an AVAsset or a custom audio provider.”1 To run analysis, you call analyze and await the results. The default behavior is to analyze for all types, and Apple is explicit about the performance lever: “For the highest performance, you can specify which analysis types you are interested in to avoid unnecessary computations.”1 Computing only what you render is the difference between a responsive tool and one that grinds on every load.

The sample app, Music Understanding Lab, shows the file path end to end. A SwiftUI fileImporter selects a song and returns its URL, and that URL becomes an AVURLAsset. Apple flags one setting as load-bearing: set PreferPreciseDurationAndTimingKey to true “to ensure the most accurate results.”1 You then create the session from the asset, call analyze and await the return of the session results.

Those results land in a SessionResult struct, where “every feature Music Understanding analyzes gets its own results field. These are all optionals.”1 The two entry points differ in what they populate. The general analyze() API makes all results available. The targeted analyze(for:) API returns only the results you asked for, and “the rest will be nil.”1 Optionality, then, is not an accident of API design; it is how the framework tells you which work it actually did.

Two types recur throughout the framework to attach time to a value. A TimedValue associates a value with a CMTime (a single instant), and a RangedValue associates a CMTimeRange with a value (a span).1 Almost every result below is expressed in one of those two shapes, so learning them once pays off across all six areas.

Walking the six results

Key. For key analysis the framework returns a KeyResult struct, which “contains an array of ranges, mapping a KeySignature to a specific time range using a RangedValue.”1 A KeySignature holds a tonic and a mode. The tonic “can be any of the standard chromatic pitches” and represents the root note (like C or G) the song is built around; the mode “is either major or minor.”1 Because the result is an array of ranges rather than one value, the API accommodates songs that change key partway through.

Rhythm. Analyzing for rhythm yields a RhythmResult. The struct gives you “the timestamps for every beat and bar as arrays of CMTime,” plus the overall global tempo through beatsPerMinute.1 One detail matters for live UIs: beatsPerMinute is optional, “because if the framework hasn’t processed enough audio to find at least two beats, the bpm will be set to nil.”1 You need two beats to measure an interval, so the nil is the framework refusing to guess.

Structure. Requesting structure analysis returns a StructureResult with three properties, “for sections, segments and phrases,” and for each you get an array of CMTimeRanges.1 The three levels nest: a section is made of one or more segments, and each segment is made of phrases.1 That hierarchy is what lets an editor snap a cut to a chorus boundary instead of an arbitrary timestamp.

Pace. Pace “tells you how fast the music feels to the listener,” with more energetic parts carrying a higher value than slower ones.1 Requesting it returns a PaceResult, a struct with “a single property containing an array of ranged values.”1 Pace is distinct from bpm: the tempo can hold steady while the felt energy rises and falls.

Instrument activity. Requesting instrument activity returns an InstrumentActivityResult with two properties, one for ranges and one for activity.1 The Ranges API “provides a dictionary, mapping each Instrument to” a per-instrument value (the transcript cuts off before naming that value’s type), and Apple frames ranges as the right choice when “you just want to know if an instrument is present or not.”1 The activity property carries more detail: it “maps an instrument to a TimedValue of Floats,” and those values “express how intensely an instrument is playing over time.”1 Apple calls the activity result “a great source to drive audio-reactive animations,” because a per-instant intensity per instrument is exactly what a visualizer wants to bind to.1

Loudness. The framework measures loudness in Loudness Units Full Scale (LUFS), “the industry standard for modeling how the human ear perceives volume.”1 Requesting loudness analysis produces a LoudnessResult struct that supports integrated, momentary, and shortTerm loudness.1 Integrated is a single value for the overall loudness of the audio. Momentary and shortTerm both provide time-stamped values every 100 milliseconds, but over different windows: momentary uses a 400-millisecond window and catches “short, sudden spikes in loudness,” while shortTerm uses a 3-second window for “a smoother view of the loudness trend over time.”1 The result also carries a peak value, the absolute highest audio volume measured in decibels.1

The streaming loudness AsyncSequence

The batch APIs above analyze a finished file. For live work, MusicUnderstandingSession “also provides a streaming API for loudness,” where “values are delivered via an AsyncSequence for every 100ms of audio analyzed by the framework.”1 A new loudness reading every 100ms is the cadence a real-time visualizer runs on, which is why this API, not the batch one, is the centerpiece for audio-reactive UI.

The usage pattern is two concurrent tasks. You initialize the session as before, then “set up two tasks: one to consume the loudness results as they are delivered, and another to begin the analysis.”1 One task awaits values off the sequence and pushes them into your animation; the other drives the analysis forward. The producer and the consumer run side by side instead of blocking each other.

Feeding live audio in means supplying an AudioProvider. An AudioProvider “conforms to AsyncSequence and yields AVReadOnlyAudioPCMBuffer objects.”1 Apple calls out the termination contract explicitly: when the provider “has sent all audio buffers, it must send a final nil to signal completion.”1 Forget the trailing nil and the consuming task waits forever for audio that never ends. The provider being an AsyncSequence itself is the elegant part: your audio source and the framework’s loudness output speak the same async-iteration language end to end.

Two more session capabilities round out the picture. Every Music Understanding result is codable, so exporting a full analysis is “just create a JSONEncoder and encode the session results.”1 And the sample app’s Video tile shows the results composing: it “uses structure and pace to create a video synced to the music,” identifying section time ranges, then using each section’s pace (an event-per-minute rate divided by 60 seconds) to decide how many clips fit in that range, with shorter, faster clips in energetic parts and longer, slower clips in calm ones.1

Key Takeaways

For audio and media app developers:

  • Start from analyze(for:), not analyze(): name only the analysis types you render so the framework skips the rest, since unrequested results come back nil anyway.1
  • Treat beatsPerMinute as genuinely optional in your UI; a nil means the framework has not yet seen two beats, so show a pending state rather than a fake tempo.1
  • Set PreferPreciseDurationAndTimingKey to true on the AVURLAsset before creating the session, because Apple ties accurate results to that flag.1

For real-time and visualizer work:

  • Build live audio-reactive animation on the loudness AsyncSequence (a value every 100ms) and on the instrument activity property, which maps each instrument to a TimedValue of intensity over time.1
  • Run a consumer task and an analysis task concurrently, and have your custom AudioProvider send a final nil after the last AVReadOnlyAudioPCMBuffer so the stream terminates cleanly.1

For catalog and tooling teams:

  • Sort or cluster a music library by key or tempo using KeyResult and RhythmResult, and persist analyses by encoding the codable SessionResult to JSON for reuse.1

FAQ

What does Apple’s Music Understanding framework analyze?

It analyzes six areas of a song: key, rhythm, structure, pace, instrument activity, and loudness. Each maps to a result type (KeyResult, RhythmResult, StructureResult, PaceResult, InstrumentActivityResult, and LoudnessResult) returned inside a SessionResult. The framework handles the signal processing and model inference, so no DSP or machine-learning expertise is required.1

Does Music Understanding run on-device or in the cloud?

On-device. Apple states the framework “runs entirely on-device,” so the audio you analyze stays private and works offline. The analysis works across Apple platforms with no network dependency.1

How do I get only the analysis I need?

Call analyze(for:) instead of the general analyze(). The general call populates every field of the SessionResult; the targeted call returns only the types you requested and leaves the rest nil. Apple recommends specifying types “for the highest performance” to avoid unnecessary computation.1

What is the difference between TimedValue and RangedValue?

A TimedValue associates a value with a single CMTime instant, while a RangedValue associates a value with a CMTimeRange span. Both types appear throughout the framework: key signatures arrive as ranged values, for example, and per-instrument activity arrives as timed values.1

How do I build a live audio-reactive visualizer with it?

Use the streaming loudness API on MusicUnderstandingSession, which delivers values over an AsyncSequence for every 100ms of audio analyzed. Run two concurrent tasks (one consuming results, one driving analysis), and feed live audio through a custom AudioProvider that conforms to AsyncSequence, yields AVReadOnlyAudioPCMBuffer objects, and sends a final nil to signal completion.1


On-device audio analysis sits alongside the other media intelligence Apple shipped this year: see how on-device AI reaches Spotlight and media in iOS 27 and how the Speech framework compares with SFSpeechRecognizer for the audio-to-text side of the same problem. When you outgrow Apple’s built-in models entirely, running your own models on Apple silicon with Core AI is the next step. The full series hub is the Apple Ecosystem Series.

References


  1. Apple, WWDC 2026 session 253, Meet the Music Understanding framework. Source for the on-device, privacy, and offline framing; the Final Cut Pro beat-detection and iPad montage features; the six analysis areas (key, rhythm, structure, pace, instrument activity, and loudness) and the song-building-block definitions; the MusicUnderstandingSession initialized from an AVAsset or audio provider; analyze() versus analyze(for:) and the SessionResult of optional fields; the AVURLAsset and PreferPreciseDurationAndTimingKey setup via SwiftUI fileImporter; the TimedValue/CMTime and RangedValue/CMTimeRange types; the KeyResult/KeySignature (tonic and mode), RhythmResult/beatsPerMinute (optional below two beats), StructureResult (sections, segments, phrases), PaceResult, InstrumentActivityResult (ranges and activity, activity as a TimedValue of Floats), and LoudnessResult (LUFS, integrated/momentary/shortTerm windows, peak in decibels) types; the streaming loudness AsyncSequence delivering values every 100ms with two concurrent tasks; the AudioProvider conforming to AsyncSequence, yielding AVReadOnlyAudioPCMBuffer objects and sending a final nil; the codable results and JSONEncoder export; and the structure-and-pace Video tile algorithm. 

Artigos relacionados

What's New in Swift (2026): The WWDC26 Update

Swift 6.3 and 6.4 from WWDC26: anyAppleOS availability, module selectors, borrow/mutate accessors, the Iterable protocol…

18 min de leitura

App Schemas: Make Your App Available to Siri

App Schemas describe an app's content and actions in terms Siri already understands, so Siri reasons over your data and …

15 min de leitura

Your Agent Has Two Untrusted Inputs

AI agents have two untrusted inputs: code the model writes and tool output it reads. One now has a real WASM sandbox; th…

12 min de leitura