9 DAILY CVPR Friday boundaries, and make the final image look natural. And at the same time, you have to repose the object to fit some kind of interaction. If the users feed the interaction they want through text, it needs to follow that text and also keep the identity of the objects, which is a very hard task. It took a while and it wasn’t easy to find a solution, but it is a brilliant one. Gemma found that the big part of it was in the training data. She found different ways of obtaining this training data. She combined four different sources and four different ways of obtaining that data using segmentation models, VLMs and grounding models. That allowed to combine various sorts of data that excelled in separate aspects offering different kinds of text prompts and different kinds of interactions. That allowed the model to learn a more complete set of interactions and reposings. Gemma Canet Tarrés
RkJQdWJsaXNoZXIy NTc3NzU=