This can be time-consuming if you have long sequences or are running on slower hardware.
But what if we could save some of these intermediate calculations and reuse them later? That’s where Key-Value Cache comes in! When the model is trained, it keeps track of certain values (the “keys”) that correspond to specific inputs (the “values”). Then, during decoding or generation, instead of recalculating everything from scratch, we can look up these precomputed keys and use them as a starting point.
For example, let’s say you have a sequence of text: “The quick brown fox jumps over the lazy dog.” And you want to generate the next word in that sequence based on what came before. Instead of running all those attention mechanisms again from scratch, we can use Key-Value Cache to look up the last hidden state (the key) and use it as a starting point for generating the output (the value).
Here’s how you might do this using FlaxT5ForConditionalGeneration:
# Import necessary libraries
from transformers import AutoTokenizer, FlaxT5ForConditionalGeneration
import jax.numpy as jnp
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = FlaxT5ForConditionalGeneration.from_pretrained("t5-small")
# Define input text
text = "summarize: My friends are cool but they eat too many carbs."
# Tokenize input text and convert to numpy array
inputs = tokenizer(text, return_tensors="np")
# Encode input text using the model
encoder_outputs = model.encode(**inputs)
# Get the first token of the decoder input
decoder_start_token_id = model.config.decoder_input_ids[0]
# Set up the initial state for our decoding loop
state = model.init_decoder_state(encoder_outputs, batch_size=1)
# Loop through each word in our output sequence and generate the next one based on what came before
for i in range(len(text)-1): # Loop through each word in the input text
for j, decoder_input_ids in enumerate(inputs["decoder_input_ids"][:-1]): # Loop through each input sequence and generate output based on it
if i == 0: # If this is the first iteration of our outer loop (i.e., we're generating the very first word), set up our initial state with the last hidden state from the encoder outputs
decoder_input = jnp.array([decoder_start_token_id]) # Set up our input for this iteration (which is just the start token)
output, past_key_values = model(state=state, inputs_embeds=jnp.zeros((1, 512)), decoder_input_ids=decoder_input)
state = output[0] # Save our new state for next iteration (since we're not using the past key values in this example)
else: # Otherwise, use Key-Value Cache to look up the last hidden state and generate output based on it
decoder_input = jnp.array([decoder_start_token_id]) # Set up our input for this iteration (which is just the start token)
output, past_key_values = model(state=state, inputs_embeds=jnp.zeros((1, 512)), decoder_input_ids=decoder_input, use_cache=True)
state = output[0] # Save our new state for next iteration (since we're not using the past key values in this example)
Key-Value Cache is a powerful tool that can help speed up decoding and generation in Transformers models. By precomputing certain intermediate calculations, we can save time and resources while still producing high-quality output.