First off, let’s break down what we mean by “multimodal” and “chain-of-thought.” Multimodal refers to the use of multiple input modalities (like text or images) to generate an output in a different modality. Chain-of-thought reasoning is when a model generates intermediate steps between its inputs and outputs, essentially showing us how it arrived at its final answer.
Now, Let’s roll with some examples!
Imagine you ask your favorite language model (let’s call her Lila) to explain why the sky is blue. She might respond with something like this: “Well, first of all, we know that light from the sun enters our atmosphere and gets scattered by tiny particles called molecules. These molecules are mostly made up of nitrogen and oxygen, which have different properties when it comes to scattering light. Nitrogen molecules scatter blue light more than other colors, while oxygen molecules scatter red light more than other colors. This is why the sky appears blue to us.”
Lila’s response shows that she not only understands the concept of why the sky is blue but also has a chain-of-thought process for arriving at her answer. She first explains how light enters our atmosphere and gets scattered by molecules, then goes into more detail about which types of molecules scatter certain colors more than others.
But what if Lila’s response wasn’t quite as clear? Maybe she just said something like “The sky is blue because it has to do with the way light interacts with particles in our atmosphere.” While this answer might be technically correct, it doesn’t really help us understand why the sky appears blue.
That’s where multimodal chain-of-thought reasoning comes in! By using multiple input modalities (like text and images) to generate an output, we can get a more detailed explanation of complex concepts like this one. For example, Lila might show us an image of the Earth’s atmosphere with different colors representing how much light is scattered at each altitude. She could then explain that blue light gets scattered more than other colors because it has shorter wavelengths and smaller particles to scatter off of.