TensorRT-LLM: A High Performance Inference Server for Language Models

in

And an “inference server” just means it can handle multiple requests at once, like if you have a bunch of people asking the same question and we want to answer all their questions as quickly as possible.

So how does this work in practice? Let’s say you have a big text document that you want to analyze using a language model (like GPT-3 or BERT). Instead of running these calculations on your computer’s CPU, which can be slow and take forever, we use TensorRT to optimize the math so it runs faster on your GPU. And then we set up an inference server that can handle multiple requests at once, like if you have a bunch of people asking the same question or analyzing different parts of the text document.

Here’s how this might look in code:

# Import necessary libraries
import tensorrt as trt
from transformers import AutoTokenizer, TFBertForSequenceClassification
import numpy as np

# Load your language model and tokenizer (in this case, we're using BERT)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', return_dict=True)

# Load your text document and preprocess it (in this case, we're using the tokenizer to convert it into a list of tokens)
text = "This is some sample text that we want to analyze."
tokens = tokenizer(text, padding='max_length', truncation=True, return_tensors="tf")

# Set up your inference server (in this case, we're using TensorRT to optimize the math and run it on a GPU)
builder = trt.Builder()
config = builder.create_network() # Create a network for optimization
config.add_input("input_ids", dtype=trt.int32, shape=tokens['input_ids'].shape) # Add input layer with correct shape and data type
config.add_output("output", dtype=trt.float32, shape=model.get_outputs()[0].shape) # Add output layer with correct shape and data type
with open('bert-base-uncased.prototxt', 'r') as f:
    proto_text = f.read().replace('\t\t<input> ', '<input>\n').replace('\t\t<output> ', '<output>\n')
config.add_custom_shape("input", [1, tokens['input_ids'].shape[0], tokens['input_ids'].shape[-1]])
config.add_custom_shape("output", [1, model.get_outputs()[0].shape[0]])
builder.set_max_batch_size(1) # Set the maximum batch size to 1
builder.set_max_workspace_size(1 << 30) # Set the maximum workspace size to 1GB
builder.set_optimization_profile(1, 'O2') # Set the optimization level to O2 (which is the highest)
network = builder.build_cuda(config) # Build the network for execution on a GPU
with open('bert-base-uncased.prototxt', 'w') as f:
    f.write(proto_text)
serr = trt.Serializer()
serr.serialize([network], bert_base_uncased.prototxt, []) # Serialize the optimized network to a file
with open('bert-base-uncased.bin', 'wb') as f:
    f.write(serr.output) # Write the serialized network to a binary file

# Load your inference server (in this case, we're using TensorRT to optimize the math and run it on a GPU)
engine = trt.Runtime("") # Set up our runtime object for running the engine
context = engine.create_execution_context() # Create an execution context that will handle all of our inference requests
input_shape = [1, tokens['input_ids'].shape[0], tokens['input_ids'].shape[-1]] # Define the input shape (in this case, we're using a batch size of 1 and passing in the preprocessed text)
output_shape = [1, model.get_outputs()[0].shape[0]] # Define the output shape (in this case, we're expecting one output with the same dimensions as our input)
input_dtype = trt.float32 # Set up our input data type to be float32 (which is what BERT expects)
output_dtype = trt.float32 # Set up our output data type to be float32 (which is what we're expecting from the language model)
input_data = np.array(tokens['input_ids'].numpy(), dtype=np.int32).reshape((1, tokens['input_ids'].shape[0], tokens['input_ids'].shape[-1])) # Convert our preprocessed text into a numpy array with the correct shape and data type
output_data = np.zeros(output_shape) # Initialize an empty output array to hold our results
context.execute_v2([network.get_bindings()[0], input_data], [network.get_bindings()[1]], [output_data]) # Run the inference server and pass in our preprocessed text as input, then store the results in our output array
predictions = np.argmax(np.squeeze(output_data), axis=1) # Convert our output data into a list of predictions (in this case, we're using argmax to find the highest probability for each token)

So basically what we’ve done here is set up an inference server that can handle multiple requests at once by optimizing the math with TensorRT and running it on a GPU. This allows us to analyze large text documents much faster than if we were using just our CPU, which can be slow and take forever for big data sets. And since we’re using BERT as our language model (which is one of the most popular models out there), this means that we can handle all kinds of different text analysis tasks with ease!

SICORPS