WACV 2025 Daily - Monday

and reading these captions, does it still look unique?” Some of the more interesting vision work in this paper was kind of “vision language”, since they did lots of things on the language side. They could take this synthetic data and then train the video language model itself on the synthetic data to achieve better performance. In the way the model was designed, the existing losses of the video language models could be taken and then they would just sample from the synthetic text data to improve the alignment between the videos and the new text. Another interesting thing is it also improved the retrieval on nonsynthetic original paragraph text. This makes it possible to improve the way 10 DAILY WACV Monday Poster Presentation Figure 1. In real-world text-to-video retrieval, users could use diverse queries. Standard long video datasets use only paragraph-style captions (``Existing'', ``Full paragraph''), which does not allow for training or evaluation on a representative set of long video descriptions. Practical applications also require the ability to handle complex, short, and partial descriptions of a long video. In this work, we introduce an approach to generate, evaluate, and train on such diverse video description data.

RkJQdWJsaXNoZXIy NTc3NzU=