ICCV Daily 2021 - Tuesday

“ To solve this, we propose a new method called modulated detection , ” Aishwarya tells us. “ The idea is you can find anything in an image if you can describe it in the text. In the paper, we have this example of a pink elephant , which is not something that you ever see in the real world. The model can leverage compositionality of language and is trained to detect whatever you speak about in the text in an end-to-end manner. It can put together the fact that it knows what pink is and it knows what an elephant is to find things like a pink elephant. It can find many more things than would be possible with a fixed classification based on a fixed label set. ” The end-to-end part is important because having the object detector in the loop means detection is a main part of the model. It trains the features in conjunction with the downstream tasks. Without this, you could have many features that are not relevant to the task which the model has to learn to ignore. This paper proposes giving the model exactly what it needs. By only detecting objects that are relevant to the query, it performs much better. “ The biggest takeaway for the computer vision community is that we push for no more fixed class labels and instead just use free-form language , ” Aishwarya advises. “ This is applicable for any pure computer vision task. ” One challenge for Aishwarya and her colleagues stems from the fact that these models are so large and so they do not have many iteration cycles because it takes such a long time to train them. “ I think that was the most stressful part because the results in the paper all depend on one large run of the model which takes more than a week! ” she reveals. “ Everything has to be perfect before that week because you only have one shot at it . It takes a lot of compute, and a lot of money goes into it, 5 DAILY ICCV Tuesday Aishwarya Kamath

RkJQdWJsaXNoZXIy NTc3NzU=