Isn’t it easier just to use the pre-trained models provided by NVIDIA?” Well, let me tell ya something sometimes you need a little more control over your inference pipeline than those ***** pre-trained models can provide. Maybe you have some custom data that needs to be fed into the model, or maybe you want to fine-tune the weights yourself. Whatever the reason may be, building and running TensorRT samples with cuDLA standalone mode is a great way to get started.
So, let’s dive right in! To kick things off make sure you have all of the necessary tools installed on your machine. You’ll need CUDA Toolkit 10.2 or later (which includes TensorRT), cuDNN v7.6.5 or later, and a supported NVIDIA GPU with at least 4GB of memory. If you haven’t already done so, download the latest version of cuDLA from GitHub and extract it to your preferred location.
Next, let’s take a look at some sample code that demonstrates how to use TensorRT with cuDLA standalone mode. Here’s an example script that loads an image, preprocesses it using cuDLA, and then feeds the resulting tensor into a TensorRT engine for inference:
# Import necessary libraries
import os # import os library for file path operations
import numpy as np # import numpy library for array operations
from PIL import Image # import PIL library for image processing
import cv2 # import cv2 library for computer vision operations
import time # import time library for measuring execution time
import sys # import sys library for system operations
sys.path.append('../cudla') # add path to cudla library
os.environ['CUDA_HOME'] = '/usr/local/cuda' # set CUDA environment variable for cuDLA to find cuda libraries and include files
from cudnn.tensorrt import * # import TensorRT library from cudnn
# Load image from file
img_path = 'input.jpg' # set path to input image
image = Image.open(img_path) # open image using PIL library
width, height = 256, 256 # set desired image size
image = np.array(image.resize((height, width))) / 255.0 # resize and normalize image
input_shape = [1, height, width] # set input shape for TensorRT engine (batch size of 1)
# Load pre-trained model from TensorRT engine file
engine_path = 'model.trt' # set path to TensorRT engine file
with open(engine_path, "rb") as f: # open engine file in read binary mode
graph = runtime.deserializeCudaGraphFromFile(f.name) # deserialize CUDA graph from engine file
bindings = [np.ndarray(shape=input_shape[1:], dtype=np.float32), np.ndarray(shape=(output_shapes[0][-1]),), dtype=np.int32)] # set input and output binding arrays for TensorRT engine
inputs, outputs = graph.getInputsOutputs() # get input/output names from CUDA graph
input_index = graph.getInputIndex(inputs[0]) # get index of first input tensor in CUDA graph
output_index = graph.getOutputIndex(outputs[-1]) # get index of last output tensor in CUDA graph
context = runtime.infer(graph, bindings=bindings) # create TensorRT execution context for CUDA graph
# Run inference using cuDLA preprocessing and TensorRT engine
input_tensor = cudla.dnnNCHWToContiguous(cuda.mem_alloc(image.nbytes), image, [height, width, 3], [1, height, width, 3]) # convert input tensor to contiguous memory layout for cuDLA
output_shape = graph.getOutputs()[0].shape[1:] # get output shape from CUDA graph (excluding batch size)
outputs = np.empty(output_shape, dtype=np.float32) # create empty array to hold output tensor data
cudla.dnnContiguousToNCHW(input_tensor, outputs[0], [1, height, width, 3], [height, width, 3, 1]) # convert input and output tensors back to NCHW memory layout for TensorRT engine
start = time.time() # start measuring execution time
for i in range(5): # run inference 5 times
context.executeV2([input_index], [output_index], [bindings[0]], [outputs[0]]) # execute CUDA graph with input and output binding arrays
end = time.time() # end measuring execution time
print("Inference took {} seconds".format(round((end - start) * 1000, 2))) # print inference time (in milliseconds)
As you can see, this script loads an image from a file, preprocesses it using cuDLA to convert the input tensor to contiguous memory layout for faster GPU access, and then feeds the resulting tensor into a TensorRT engine for inference. The output is stored in a separate array that’s converted back to NCHW memory layout before being returned by the script.
Now, let me tell ya something this might not be the most efficient way to use cuDLA with TensorRT, but it does demonstrate how to build and run custom models using both frameworks together. And hey, sometimes that’s all you need!
Just remember to keep an open mind and be willing to experiment with different techniques until you find what works best for your specific use case!