Let’s say we have a large language model (LLM) that processes input text using self-attention, which is a common technique used to understand the relationships between words within a sentence.
In traditional self-attention methods, every token in the input sequence needs to be compared with all other tokens, resulting in a lot of unnecessary computation and memory usage. This can become a bottleneck for LLMs when dealing with very long sequences or large datasets.
Flash Attention is an optimization technique that reduces this computational overhead by caching key-value vectors from previous layers and reusing them to compute attention scores for the current layer. By doing so, Flash Attention can significantly improve both computational efficiency (up to 2x faster) and memory usage (up to 4x less).
Here’s an example of how this works in practice: let’s say we have a sequence of input tokens with shape torch.Size([1, 23]) and the corresponding input_ids tensor with shape torch.Size([1, 24]). We can use Flash Attention to compute attention scores for each token using only the previous key-value vectors (with shape torch.Size([1, 25])) instead of comparing every token with all other tokens in the sequence.
This not only saves time and memory but also improves accuracy by allowing the model to focus on more relevant information from previous layers. In fact, Flash Attention has been shown to improve performance on various benchmarks such as GLUE and SQuAD without sacrificing accuracy or increasing training time.
However, for auto-regressive decoding, Grouped-Query-Attention (GQA) can be a better option than MQA due to its ability to maintain model capacity while reducing memory usage. GQA uses n < n_head key-value projection weights instead of just one as in MQA, allowing for less drastic reductions in query head projection weights and thus arguably less performance degradation compared to vanilla multi-key-value head projections. Moreover, existing model checkpoints can be uptrained to have a GQA architecture with as little as 5% of the original pre-training compute, making it easier for LLMs to handle longer input sequences.