Computer Vision News - December 2025

39 Ivona Najdenkoska Computer Vision News Computer Vision News The model then generates new images that follow the demonstrated patterns, making image generation more intuitive and faithful to the intended target. Another limitation she tackled concerns contrastive visionlanguage models like CLIP, whose training context window is limited to only 77 tokens of text. This becomes a bottleneck when working with long captions. Her proposed approach called TULIP augments CLIP with relative positional encodings and distills knowledge from the original text encoder. This improves performance in long-caption retrieval, image generation, and any mutlimodal task that benefits from richer textual inputs. Her final chapter turns to the generation of long captions i.e., paragraphs by considering the inherent diversity of data. This challenge is tackled in the context of radiology report generation, as these reports often reflect uncertainty and diversity between experts. Her proposed approach called Variational Topic Inference framework models this diversity by capturing sentence-level topics, leading to generation of reports that are coherent and better aligned with the images. Across all chapters, her PhD work shows that visual and textual context can meaningfully improve multimodal foundation models. As Ivona moves into her postdoctoral work, she aims to build models that leverage context while behaving reliably in real-world applications. More information about her work and publications is available here.

RkJQdWJsaXNoZXIy NTc3NzU=