WACV 2025 Daily - Monday

that these models can reason about videos and the types of captions that they were already familiar with. With respect to ensembles, you could take a model zero shot and use the different synthetic captions as different attempts to query the same video instead of just having one paragraph caption. You could use the model, generate lots of captions, use all of those as queries and take the median result as the retrieved video. And that was often a more reliable way to find the video in question, as opposed to just relying on the original caption. In Matt’s words, the term for this is query expansion to get better multimodal alignment. Matt is most proud that the team were able to put a work out that hopefully contributes to a paradigm change in how we look at and treat long video; and especially paying attention to unintentional biases that creep in, in the way that current data collection processes treat the relationship between video and text. “One thing that my work points to a little bit,” Matt concludes “is just a level of controllability and reliability that I think is just very important for folks to understand as we use more and more synthetic data in training. I craft my prompts very carefully. I use humans to verify that those prompts are effective in preventing hallucinations. And I think that those are very, very important things for folks to keep in mind as we use more and more synthetic data, both for training and evaluation.” Talk with Matt at Poster Session 4 today (Monday) from 11:15 to 13:00. 11 DAILY WACV Monday Figure 2. We can train with our synthetic data to boost performance. For contrastive finetuning for retrieval with video-caption pairs, we propose mixing our 10k text captions with ground truth captions. We compute standard contrastive loss, but each caption is sampled randomly from the 10k captions for a given video, according to a mixing ratio, $\eta$. This sort of finetuning boosts performance both on 10k text evaluation data, as well as on the original evaluation data. A Video is Worth 10,000 Words

RkJQdWJsaXNoZXIy NTc3NzU=