NVIDIA TensorRT Model Optimizer: A Library to Quantize and Compress Deep Learning Models for Optimized Inference on GPUs

in

Use examples when they help make things clearer.

Let me provide a more detailed explanation of how NVIDIA TensorRT Model Optimizer works using examples and simplified language.

First, let’s say you have a deep learning model that needs to be optimized for faster inference on GPUs. You can load this model into the Model Optimizer API provided by TensorRT, which is essentially just a fancy Python library for optimizing models. Then, you choose one of two techniques to compress and quantize your model: Post-training Quantization (PTQ) or Sparsity.

Post-training Quantization involves converting floating point numbers into fixed point values after training. This can significantly reduce the size of your model while preserving its accuracy. For example, let’s say you have a pre-trained Hugging Face model that is 10GB in size. After applying PTQ to this model using TensorRT Model Optimizer, it could be reduced to just 50MB without any loss in performance!

Sparsity, on the other hand, reduces the memory footprint by removing unnecessary weights from your model. This can also significantly reduce the size of your model while preserving its accuracy. For example, let’s say you have a pre-trained NeMo model that is 5GB in size. After applying Sparsity to this model using TensorRT Model Optimizer, it could be reduced to just 100MB without any loss in performance!

Both PTQ and Sparsity can significantly reduce the size of your model while preserving its accuracy, making it perfect for deploying in production environments where every millisecond counts. And with support for popular pre-trained models like Hugging Face, NeMo, Megatron-LM, Medusa, Diffusers, and more, you’re guaranteed to find a recipe that works for your needs.

So if you want to speed up your deep learning inference without sacrificing accuracy or memory usage, give the NVIDIA TensorRT Model Optimizer a try today! And don’t forget to check out our examples section for some killer recipes and benchmarks.

To generate processed input data for inference, run the following commands:

1. Install MedPy library using pip command: `pip install medpy`
2. Create three directories named test_data_set_0, test_data_set_1, and test_data_set_2 inside a folder of your choice (e.g., code-samples/posts/TensorRT-introduction-updated).
3. Run the `prepareData.py` script in each directory to generate input data for that set:
For example, run `python prepareData.py –input_image your_image1 –input_tensor test_data_set_0/input_0.pb –output_tensor test_data_set_0/output_0.pb` in the first directory to generate input and output data for set 0 using an image named `your_image1`.
Repeat this step for each set (i.e., run a similar command with different arguments for sets 1 and 2).

That’s it, you have the input data ready to perform inference!

Import the ONNX model into TensorRT, generate the engine, and perform inference using the following steps:

1. Change directory to `code-samples/posts/TensorRT-introduction-updated` (or wherever you saved your files).
2. Run the sample application with the trained model and input data passed as inputs by running this command: `./simpleOnnx path/to/unet/unet.onnx fp32 path/to/unet/test_data_set_0/input_0.pb` (replace “path/to” with your actual file paths).
This assumes that the ONNX model is named `unet.onnx`, and the input data for set 0 is stored in a directory called `test_data_set_0`. If you used different names or directories, adjust these arguments accordingly.
3. The sample application compares output generated from TensorRT with reference values available as ONNX .pb files in the same folder and summarizes the result on the prompt.
4. It can take a few seconds to import the UNet ONNX model and generate the engine. It also generates the output image in the portable gray map (PGM) format as `output.pgm`.

Using the cudaStreamSynchronize function after calling launchInference ensures GPU computations complete before the results are accessed. The number of inputs and outputs, as well as the value and dimension of each, can be queried using functions from the ICudaEngine class.

Batch your inputs
This application example expects a single input and returns output after performing inference on it. Real applications commonly batch inputs to achieve higher performance and efficiency. A batch of inputs that are identical in shape and size can be computed in parallel on different layers of the neural network.

Larger batches generally enable more efficient use of GPU resources. For example, batch sizes using multiples of 32 may be particularly fast and efficient in lower precision on Volta and Turing GPUs because TensorRT can use special kernels for matrix multiply and fully connected layers that leverage Tensor Cores.

Pass the images to the application on the command line using the following code.

SICORPS