Understanding IAMF Timing Model for Immersive Audio

The model consists of three main components: Master Clock Generator (MCG), Delay Line Generators (DLGs), and Audio Stream Processors (ASPs).
The MCG generates a master clock signal, which acts as the beat for the entire system. Each DLG applies a fixed or variable delay to its corresponding audio stream based on the timing model parameters. This ensures that all audio signals arrive at your ears simultaneously or with minimal delay. Finally, ASPs process and encode each audio stream before sending it to the DLGs. They can also perform additional functions such as mixing, equalizing, and compressing the audio signals.
Regarding syntax and semantics for IASequenceHeaderOBU and CodecConfigOBU, these are defined as follows:

// This script defines two classes, IASequenceHeaderOBU and CodecConfigOBU, which are used for processing audio streams in ASPs.
// IASequenceHeaderOBU class is used for storing and processing sequence header information.
class IASequenceHeaderOBU {
  // Code for storing and processing sequence header information goes here.
}

// CodecConfigOBU class is used for storing and processing codec configuration information.
class CodecConfigOBU {
  // Code for storing and processing codec configuration information goes here.
}

The syntax structure of the [=ScalableChannelLayoutConfig()=] and [=ChannelAudioLayerConfig()=] classes is as follows:


// Class for configuring scalable channel layout
class ScalableChannelLayoutConfig() {
  // Number of layers in the layout, 3 bits
  unsigned int (3) num_layers;
  // Reserved bits, 5 bits
  unsigned int (5) reserved;
  // Loop through each layer and configure audio layer
  for (i = 1; i <= num_layers; i++) {
    // Create instance of ChannelAudioLayerConfig class for current layer
    ChannelAudioLayerConfig channel_audio_layer_config(i);
 
  }
}

// Class for configuring audio layer within a channel
class ChannelAudioLayerConfig(i) {
  // Loudspeaker layout for current layer, 4 bits
  unsigned int (4) loudspeaker_layout(i);
  // Flag for presence of output gain, 1 bit
  unsigned int (1) output_gain_is_present_flag(i);
  // Flag for presence of reconstruction gain, 1 bit
  unsigned int (1) recon_gain_is_present_flag(i);
 
  // Reserved bits, 2 bits
  unsigned int (2) reserved;
  // Number of substreams for current layer, 8 bits
  unsigned int (8) substream_count(i);
  // Number of coupled substreams for current layer, 8 bits
  unsigned int (8) coupled_substream_count(i);
  // Check if output gain is present
  if (output_gain_is_present_flag(i) == 1) {
    // Output gain flags, 6 bits
    unsigned int (6) output_gain_flags(i);
 
    // Reserved bits, 2 bits
    unsigned int (2) reserved;
    // Output gain value, 16 bits
    signed int (16) output_gain(i);
 
  }
}

The [=ScalableChannelLayoutConfig()=] class specifies the number of layers and their corresponding channel layouts. Each layer can have a different loudspeaker layout, which is specified by the [=loudspeaker_layout(i)=] field in the [=ChannelAudioLayerConfig()=] class. The output gain for each substream can also be adjusted using the [=output_gain(i)=] field if necessary.

Regarding syntax and semantics for IASequenceHeaderOBU, these are defined as follows:


// This class represents the IASequenceHeaderOBU, which is used to store information about a sequence of audio data.
class IASequenceHeaderOBU() {
  unsigned int (16) sequence_number; // This field stores the sequence number of the audio data.
  unsigned int (8) reserved; // This field is reserved for future use and is currently not used.
}

The [=sequence_number=] field specifies the number of this audio sequence. This value is used to ensure that all audio sequences are played in order and without gaps or overlaps. The value of this field is defined by the IAS standard.
In terms of semantics, the Temporal Delimiter OBU identifies [=Temporal Units=], which are time intervals during which a set of audio frames with similar content is played. This allows for efficient transmission and storage of immersive audio data while ensuring that all audio sequences are played in order and without gaps or overlaps. The syntax structure of this OBU is as follows:


// This class represents a Temporal Delimiter OBU, which is used to divide audio frames into time intervals for efficient transmission and storage.

class TemporalDelimiterOBU {
  unsigned int temporal_unit_id; // This variable stores the unique identifier for each time interval.
  unsigned int (16) temporal_unit_id; // This is the correct syntax for declaring a 16-bit unsigned integer in the class.
}

The [=temporal_unit_id=] field specifies the ID of the current temporal unit. This value is used to ensure that all audio frames within a given temporal unit are played in order and without gaps or overlaps. The syntax structure for Audio Frame OBUs, which contain coded audio data for an [=Audio Substream=], remains unchanged:

// The following code script is used to define the AudioFrameOBU class, which contains coded audio data for an Audio Substream.
// The class takes in the audio_substream_id_in_bitstream as a parameter.
class AudioFrameOBU(audio_substream_id_in_bitstream) {
  // The following if statement checks if the audio_substream_id_in_bitstream is true.
  if (audio_substream_id_in_bitstream) {
    // The following leb128() function reads the explicit_audio_substream_id value from the bitstream.
    leb128() explicit_audio_substream_id;
  }
  // The following unsigned int (8 x coded_frame_size) audio_frame; defines the audio_frame variable, which stores the coded audio data for the Audio Substream.
  unsigned int (8 x coded_frame_size) audio_frame;
}

The [=explicit_audio_substream_id=] field is present only when the OBU type is set to OBU_IA_Audio_Frame, and it indicates the ID of the current audio substream. When this field is not present (i.e., for other OBU types), the implicit [=audio_substream/audio_substream_id=] values are used instead.
Overall, the IAMF Timing Model provides a flexible and efficient framework for synchronizing multiple audio streams in immersive audio applications by using Temporal Delimiters to group frames with similar content while ensuring that all audio sequences are played in order and without gaps or overlaps.

SICORPS