ICCV Daily 2021 - Tuesday

The main idea of this work is that when we want to have some understanding of what is in an image, we should not be bound by what an object detector can find. In previous work, most object detectors are trained using a fixed label set . That means if there are 1,600 classes the object detector can find, once it finds those classes, the downstream multi-modal understanding uses one of those detected objects to reason over open-ended questions like, what is on the table? If the object detector does not have this class in its fixed label set, it will no longer be able to find it and the downstream reasoning will fail. Aishwarya Kamath is a PhD student at New York University, under the supervision of Yann LeCun and Kyunghyun Cho. Her work, which has been accepted for an oral presentation, proposes a novel multi-modal approach to understanding images and text. She speaks to us ahead of her live Q&A session today . MDETR – Modulated Detection for End-to-End Multi-Modal Understanding 4 DAILY ICCV Tuesday Oral Presentation