Immersive Audio Model and Formats (IAMF) Architecture

in

The IAMF Architecture consists of several key components: the Audio Object (AO), which represents a specific element in the audio mix; the Channel Group (CG), which groups together related AOs for processing and delivery; and the Temporal Unit (TU), which contains one or more CGs for a given time period.

The IAMF Architecture also includes various parameters that can be adjusted to optimize performance and quality, such as demixing information, recon gain values, and down-mix parameters. These parameters are used by the encoder to generate compressed audio data that can be transmitted over a network or stored on a media device for playback.

For example, let’s say we have two Channel Groups: one with 7.1.4 channels (CL #i) and another with 5.1.2 channels (CL #i-1). The de-mixed channels from CL #i-1 are Lss7, Rs7, Ltf2, Rtf2, Ltb4, and Rtb4. To create the immersive audio experience for CL #i, we use these de-mixed channels along with their relevant demixing parameters to generate new signals that can be played back on a 7.1.4 system or in virtual reality headsets.

The recon gain values are used to adjust the volume of each signal based on its importance and relevance to the overall audio mix. For example, if we have a low-level background noise that is not critical to the scene, we can set a lower recon gain value for it to reduce its impact on the final output. On the other hand, if we have a crucial dialogue or sound effect, we can increase the recon gain value to ensure that it comes through clearly and distinctly in the mix.

In terms of realism and lip-sync, VOCA (Voice Conversion Architecture) is an emerging technology that allows for high-quality voice conversion between different speakers while maintaining naturalness and preserving speaker identity. This can be particularly useful for creating immersive audio experiences where multiple characters are speaking simultaneously or in different languages.

VOCA works by first extracting the mel-spectrogram features from both the source and target speech signals, then aligning them using a time warping algorithm to ensure that they match in terms of duration and timing. The aligned features are then fed into a neural network for conversion, which generates a new set of mel-spectrograms based on the characteristics of the target speaker’s voice.

The reconstructed speech signal is then passed through a vocoder (voice coder) to generate high-quality audio output that closely matches the original source material in terms of pitch, tone, and inflection. This can be particularly useful for creating immersive audio experiences where multiple characters are speaking simultaneously or in different languages, as it allows for more natural and realistic dialogue between them.

Overall, VOCA provides a powerful tool for content creators to enhance the realism and lip-sync of their immersive audio experiences, while also preserving speaker identity and maintaining high levels of quality and fidelity.

SICORPS