To use the library, you can first define your model configuration using a Python script that specifies its input/output data types, optimization level, and other parameters. Then, you can call the Model Optimizer API to convert your model into an optimized format suitable for deployment on TensorRT-enabled hardware accelerators like NVIDIA GPUs or Xilinx FPGAs.
For example, let’s say we have a pretrained BERT model that we want to fine-tune for sentiment analysis using the Hugging Face Transformers library:
# Import necessary libraries
from transformers import AutoTokenizer, TFBertForSequenceClassification
import tensorflow as tf
from tensorrt.plugins import *
# Load pretrained BERT model and tokenizer from Hugging Face Hub
model_config = {
"model_name_or_path": "bert-base-uncased",
"problem_type": "sequence_classification"
}
tokenizer = AutoTokenizer.from_pretrained(model_config["model_name_or_path"])
model = TFBertForSequenceClassification.from_pretrained(model_config["model_name_or_path"], num_labels=2)
# Define input/output data types and optimization level for TensorRT Model Optimizer
quantization_config = {
"per_channel": True, # Quantize each channel separately (default: False)
"symmetric": True, # Use symmetric quantization (default: False)
}
model_kwargs = dict(
revision=None, # Set to None for latest model version
trust_remote_code=True, # Allow downloading pretrained models from remote servers
attn_implementation="multihead", # Use multi-headed attention (default: "single")
torch_dtype="auto", # Automatically select the best data type for training and inference
use_cache=False, # Disable caching of intermediate results during training
device_map=None, # Set to None for automatic allocation of resources on GPUs or FPGAs
quantization_config=quantization_config, # Specify the quantization configuration (optional)
)
# Create a new model instance with the specified configuration
model = TFBertForSequenceClassification.from_pretrained(model_config["model_name_or_path"], **model_kwargs)
# Define input data and labels for training or inference
inputs = tokenizer("Hello, my dog is cute", return_tensors="tf")
labels = tf.one_hot(np.array([1]), num_classes=2).astype('float32')
# Convert the model to TensorRT format using Model Optimizer API
# Create a new trainer instance with the specified model and problem type
trainer = SFTTrainer(model, problem_type="sequence_classification", loss_name="sparse_categorical_crossentropy")
# Use the trainer's convert method to convert the model to TensorRT format
optimized_model = trainer.convert()
In this example, we first load the pretrained BERT model and tokenizer from Hugging Face Hub using the AutoTokenizer and TFBertForSequenceClassification classes provided by Transformers library. Then, we define our input/output data types and optimization level for TensorRT Model Optimizer using a dictionary called `quantization_config` that specifies whether to use per-channel quantization (default: False) and symmetric quantization (default: False).
Next, we create a new instance of the SFTTrainer class provided by TensorFlow.Hub library to convert our model into an optimized format suitable for deployment on TensorRT-enabled hardware accelerators like NVIDIA GPUs or Xilinx FPGAs. The `problem_type` parameter specifies that this is a sequence classification task, and the `loss_name` parameter specifies that we want to use sparse categorical crossentropy loss for training our model.
Finally, we call the convert() method of the SFTTrainer class to generate an optimized version of our model using TensorRT Model Optimizer API. This will produce a new instance of the `OptimizedModel` class that can be used for inference or deployment on edge devices with low latency and memory usage.