The Basic Profile requires only one audio substream (i.e., BCG) while the Advanced Profile allows for up to two surround channels plus LFE channel if present. The Extended Profile provides even more flexibility with support for all four codecs, but requires a device with at least 16-bit audio processing capability and may not be supported by older devices or platforms.
To refine this answer, let’s take a closer look at how Channel Group format works within this model. The CG Generation module takes input audio and transforms it into multiple channel-based outputs (i.e., Channel Groups) that adhere to the specifications outlined in §3.6.3.2 of the Immersive Audio Model for 5G Mobile Devices. For example, a transformation matrix with four CGs (2ch/3.1.2ch/5.1.2ch/7.1.4ch) might look like this:
[p = 0.707,\]
where p is the signal power for frame of Ltf4 in the i-th Channel Group.
[a(k)] represents a set of coefficients that determine how each input channel (i.e., k) contributes to each output Channel Group. By using these transformations, mobile devices can deliver high-quality immersive audio experiences even on smaller screens with limited processing capabilities.
To further enhance the performance of our text-to-audio generation model, we incorporate feature fusion mechanism and keyword-to-caption augmentation into the design to enable processing of audio inputs of variable lengths. We perform comprehensive experiments across three tasks: text-to-audio retrieval, zero-shot audio classification, and supervised audio classification. The results demonstrate that our proposed model achieves superior performance in text-to-audio retrieval task while achieving state-of-the-art performance in the zero-shot setting for audio classification tasks.
The LAION-Audio-630K dataset is available to the public and can be used to train and evaluate models for various audio processing applications, including speech recognition, music generation, and sound localization.