Computer Vision News - January 2024

Computer Vision News 6 Training-Free Grounding Paper Walid Bousselham is a PhD student at the University of Bonn under the supervision of Hilde Kuehne. He speaks to us about his new paper proposing a Grounding Everything Module (GEM), which aims to adapt vision-language models to not only identify objects within an image but also localize them. Vision-language models such as CLIP and BLIP have proven powerful at zero-shot classification, excelling in recognizing objects within images. However, when it comes to localizing these objects, traditional models fall short. GEM’s primary objective is to extend the capabilities of these models, adapting them for localization without disrupting the extensive vocabulary they have learned through pretraining on millions of images. “One of the ways to do that is not to retrain the model but just to adapt the forward pass so that we’re able to do localization without perturbing the weights,” Walid explains. “We introduced a module called self-self attention that essentially enables the clustering of the internal features of CLIP so that we can query the features later with text.” GEM proves particularly useful for the popular computer vision task of semantic segmentation. Unlike classical methods with restricted sets of localizable classes, GEM is based on CLIP and can segment and localize any object described with text. Application of the technology could extend to robotics, with robots precisely locating specific objects based on textual prompts. This versatility opens new possibilities for industries relying on object localization for various tasks. Grounding Everything: Emerging Localization Properties in Vision-Language Transformers Matias Valdenegro