Use examples when they help make things clearer.
Transformers for pretraining on multimodal data offer a powerful tool for handling complex natural language processing tasks that involve multiple inputs such as images or videos. By training a single model to handle all modalities simultaneously, we can improve accuracy, reduce computational cost, and better generalize to new tasks. For example, in the context of text-to-image generation, Parti is an all-Transformer model that shows potential for scalability across modalities by generating high-fidelity images with content-rich text understanding (Fig. 11.9.12). However, when dealing with sequence-to-sequence problems where the input sequence is always available throughout the target sequence prediction, decoder-only Transformers may have limitations due to their inability to attend to previous inputs during training. Nonetheless, by fine-tuning a powerful language model using solid evaluation and annotation procedures for NLP systems, we can achieve data-centric successes that reduce compute costs, carbon footprint, and save time and resources required to train models from scratch.
Transformers have been pretrained as encoder-only (e.g., BERT), encoderdecoder (e.g., T5), and decoder-only (e.g., GPT series). Pretrained models may be adapted to perform different tasks with model update (e.g., fine-tuning) or not (e.g., few-shot). Scalability of Transformers suggests that better performance benefits from larger models, more training data, and more training compute. Since Transformers were first designed and pretrained for text data, this section leans slightly towards natural language processing. Nonetheless, those models discussed above can be often found in more recent models across multiple modalities. For example, (i) Chinchilla was further extended to Flamingo, a visual language model for few-shot learning; (ii) GPT-2 and the vision Transformer encode text and images in CLIP, whose image and text embeddings were later adopted in DALL-E 2 text-to-image system. Although there have been no systematic studies on Transformer scalability in multimodal pretraining yet, an all-Transformer text-to-image model called Parti shows potential of scalability across modalities: a larger Parti is more capable of high-fidelity image generation and content-rich text understanding (Fig. 11.9.12).
In terms of limitations for decoder-only Transformers in sequence-to-sequence problems, their inability to attend to previous inputs during training may result in lower performance compared to encoder-decoder models that can better capture contextual information from both the input and output sequences. However, this limitation can be mitigated by using techniques such as teacher-student distillation or attention mechanisms to learn cross-modal relationships between different modalities (e.g., text and images).