EnCodec: A Highly Efficient and Robust Audio Compression Algorithm

in

It works by taking all the little sounds in an audio file and turning them into these codebook representations, which are then sent through some kind of neural network magic to make everything sound super crisp and clear even when it’s being transmitted over a crappy cellular connection or something like that.

To use EnCodec for encoding and decoding audio files, you can follow this example:

# Import necessary libraries
import torchaudio # Importing torchaudio library for loading audio files
import torch # Importing torch library for using neural network models
from datasets import load_dataset, Audio # Importing load_dataset and Audio from datasets library
from transformers import EncodecModel, AutoProcessor # Importing EncodecModel and AutoProcessor from transformers library

# Load the dataset (in this case, Librispeech)
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") # Loading the Librispeech dataset and specifying the split as "validation"

# Load the EnCodec model and processor from Hugging Face's Transformers library
model = EncodecModel.from_pretrained("facebook/encodec_24khz") # Loading the EnCodec model from Hugging Face's Transformers library
processor = AutoProcessor.from_pretrained("facebook/encodec_24khz") # Loading the EnCodec processor from Hugging Face's Transformers library

# Iterate over the dataset and encode each audio file using the model and processor
for example in librispeech_dummy:
    # Load the audio file as a TorchAudio object
    wav, sr = torchaudio.load(example["filepath"]) # Loading the audio file as a TorchAudio object and storing the waveform and sample rate in variables
    
    # Convert the audio to match EnCodec's sample rate (24 kHz) and channels (1 for mono)
    converted_audio = convert_audio(wav, sr, model.sample_rate, model.channels) # Converting the audio to match EnCodec's sample rate and channels using the convert_audio function
    
    # Unpack the encoded frames from the output of `model.encode`
    with torch.no_grad():
        encoded_frames = model.encode(converted_audio) # Encoding the audio using the EnCodec model and storing the encoded frames in a variable
        
    # Concatenate all the codebooks for each frame into a single tensor (B, n_q, T)
    codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1) # Concatenating all the codebooks for each frame into a single tensor and storing it in a variable
    
    # Print out some information about the encoded audio file
    print(f"Original duration: {example['duration']} seconds") # Printing the original duration of the audio file
    print(f"Encoded duration: {len(codes) / model.sample_rate} seconds") # Printing the encoded duration of the audio file

In this example, we first load a dataset (in this case, Librispeech), and then iterate over each audio file in the dataset using a `for` loop. For each audio file, we convert it to match EnCodec’s sample rate and channels, encode it using the model and processor, and print out some information about the encoded audio file (such as its original duration and encoded duration).

Note that this example assumes you have already installed Hugging Face’s Transformers library and TorchAudio. If not, you can follow their respective installation instructions to get started.

SICORPS