That’s where Neuron SDK comes in it helps optimize those models so that they run better on Inf1 without sacrificing accuracy.
Here’s an example: let’s say you want to train a transformers model for question answering using the Hugging Face library (which is popular among NLP researchers). Normally, this would involve writing some code in Python and running it on your computer or a server. But with Neuron SDK, you can just add one line of code to trace and optimize your model for Inf1:
# Import necessary libraries
from transformers import AutoTokenizer, TFBertForQuestionAnswering
import neuron as nx
import numpy as np
# Load the pre-trained BERT question answering model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained('bert-large-uncased-whole-word-masking')
model = TFBertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking')
# Define the input and output tensors for Inf1
input_tensor = nx.Tensor(np.random.randn(1, 512).astype(np.float32))
output_tensor = nx.Tensor(np.zeros((1,), dtype=np.int64))
# Define the input and output shapes for Inf1
input_shape = (1,) + model.input_shapes[0] # input shape is a tuple with 1 as the first element and the shape of the model's input as the remaining elements
output_shape = (1,) + model.output_shapes[0][-1] # output shape is a tuple with 1 as the first element and the last element of the model's output shape as the remaining elements
# Trace the model using Neuron SDK
with nx.Session() as session:
# Load the pre-trained BERT question answering model into memory
session.load(model)
# Define a function to run the inference on Inf1
def infer_function():
# Run the inference using the loaded model and input tensor
output = model(input_tensor, training=False)[0] # output is a tuple with the model's output as the first element and the model's loss as the second element
# Convert the output into an array of integers for Inf1
return np.argmax(output[0], axis=-1) # np.argmax returns the indices of the maximum values along an axis, in this case, the axis is -1 which represents the last dimension of the output
# Compile the function using Neuron SDK
compiled_function = session.compile(infer_function, input_shape=input_shape, output_shapes=[output_shape]) # compile the infer_function with the specified input and output shapes
# Run the compiled function on Inf1 and get the results
result = compiled_function() # result is the output of the compiled function, which is the index of the maximum value in the model's output array
That’s it! Now your transformers model is optimized for deployment on AWS Inf1 with Neuron SDK. And if you want to learn more about which models can be converted out of the box, check out the Model Architecture Fit section in the Neuron documentation.