10 DAILY CVPR Friday Moreover, how to balance the text and the visual information? For that, what Gemma did was adding customization as an auxiliary task. So, during the training, sometimes you don't provide the background and you get the model to generate it on its own. That allows it to focus only on balancing textual and visual information in a few steps, allowing the final result to be more balanced. “Honestly, this customization as an auxiliary task part was kind of by chance,” Gemma admits. “We wanted to see if we could get our model to do both tasks, to do object compositing, but also customization. And we found that when adding that, it actually helped the other task. And then we thought about it and we thought like, yes, actually, it makes sense because it is an easier task and they are complementary!” The model is based on Stable Diffusion 1.5, which is quite an old version now. But since the team were combining that many things, they decided to test their pipeline on a smaller model, even though it is not small. And then now it can be adapted to a bigger model that would provide better image quality. Still, the baseline is basically a UNet-based diffusion model. We asked Gemma about her thoughts when she discovered that her paper was accepted as a highlight. Gemma candidly admits that she did not expect it. Though she’s very happy about it. “I think people like the fact that it's a new task that we're doing,” Gemma suggests. “Like no one had done multiple object compositing at the same time. And also we're able to do compositing and customization, Highlight Presentation
RkJQdWJsaXNoZXIy NTc3NzU=