Well in this case, the authors are combining two different techniques transformers and convolutions. Transformers have been popular recently for tasks like language processing and image classification, but they’re not typically used for semantic segmentation (which is essentially labeling each pixel with a category). Convolutional neural networks (CNNs) on the other hand are great at this task because they can learn to extract features from images.
So what do these authors propose? They want to combine the best of both worlds use transformers for their ability to handle long-range dependencies and context, but also incorporate convolutions to make sure we’re not losing any important details in our segmentation results. This is where the “bilateral awareness” part comes in by adding a bilateral filter (which smooths out noise while preserving edges) to both the transformer and CNN layers, they can ensure that their model doesn’t over-smooth or under-segment certain areas of the image.
In terms of how it works in practice, let me give you an example: imagine we have a high-resolution urban scene image with lots of buildings, cars, and people. We want to segment this image into different categories like “building”, “car”, or “person”. Instead of using a traditional CNN that would apply the same set of filters at every location in the image (which can lead to blurring), we’re going to use our transformer-CNN hybrid model.
First, we feed this high-resolution input into our transformer layer which will learn to extract features from different parts of the image and pass them along to the next stage. These features are then fed through a series of convolutional layers that can help us better distinguish between different categories (like “building” vs “car”).
But here’s where things get interesting instead of just applying these filters at every location in the image, we’re going to use our bilateral filter to smooth out noise while preserving edges. This means that if there are two pixels that are close together and have similar features (like a building and another part of the same building), they will be more likely to be labeled as “building” than if they were far apart or had different features.
So in summary, this paper proposes using transformers for their ability to handle long-range dependencies and context, but also incorporating convolutions to make sure we’re not losing any important details in our segmentation results. By adding a bilateral filter to both the transformer and CNN layers, they can ensure that their model doesn’t over-smooth or under-segment certain areas of the image.