You have two images that are similar but not exactly the same (like a dog and its reflection in a mirror), and your job is to figure out which one is the original and which one has been flipped or rotated or something.
But instead of doing this by hand, we’re using computers to do it for us! And instead of just looking at two images, we’re actually comparing lots of different pairs of images that are similar but not exactly the same (like a dog and its reflection in a mirror, but also a cat and its reflection in a mirror).
The idea is that by training our computer to distinguish between these “similar-but-not-exactly-the-same” pairs, we can learn some really useful features about images. For example, if the computer keeps getting confused between dogs and cats (because they look pretty similar sometimes), then maybe it’s not learning a good enough feature that distinguishes them from each other.
So how do we actually train our computer to play this game of “spot the difference”? Well, first we need to create some pairs of images that are similar but not exactly the same (like a dog and its reflection in a mirror). We can do this by taking an original image and then applying some random transformations to it (like flipping or rotating or blurring), which will give us a new “augmented” version of the image.
Then we feed these pairs of images into our computer, along with some labels that tell it whether the original image is on the left or the right (or top or bottom, depending on how you’re orienting your images). The computer then tries to figure out which one is the original and which one has been flipped or rotated or something.
But here’s where things get a little tricky: instead of just comparing the two images directly, we actually compare their “embeddings”. An embedding is like a fancy way of representing an image as a set of numbers (called features) that capture some important information about it. For example, one feature might be how much red there is in the image, while another feature might be how many edges there are.
So instead of comparing two images directly, we’re actually comparing their embeddings to see if they’re similar or not (like a dog and its reflection in a mirror). And by doing this for lots of different pairs of images that are similar but not exactly the same, we can learn some really useful features about images.
And that’s basically how contrastive learning works! It’s like playing a game of “spot the difference” with computers and embeddings instead of humans and images. And by doing this for lots of different pairs of images that are similar but not exactly the same, we can learn some really useful features about images that can be used to improve all sorts of other tasks (like image classification or object detection).
Contrastive learning in a nutshell. It’s like playing “spot the difference” with computers and embeddings instead of humans and images, which can help us learn some really useful features about images that can be used to improve all sorts of other tasks (like image classification or object detection).