When Xiaohang began exploring how to control occlusion in image generation – deciding which object appears in front of another – he quickly identified the limits of current diffusion models. “Occlusion is a spatial relationship of objects rather than a semantic one,” he explains. “It’s not something that a text prompt can easily control.” LaRender proposes a method for generating images with precise occlusion relationships, eliminating the need for retraining or fine-tuning of the model. “We designed a very novel method using the principle of 3D rendering to generate the image in latent space,” Xiaohang tells us. “We use rendering to let the model understand the spatial relationship of objects. In this way, we don’t introduce any extra parameters or training modules, so the whole framework is training-free. We observed very good quality and very accurate control of occlusion.” The idea grew from a reluctance to rely on traditional data-driven methods. “When we consider controlling something in a model, we need to collect paired data,” he notes. “We manually annotate the relationships and use this paired data to fine-tune the model. That’s the typical way, but I think it’s a little bit boring. I wanted to find a way to perform this without any annotation, without any paired data, without tuning – and that’s hard.” Xiaohang Zhan is a Senior Research Scientist at Adobe and previously worked at Tencent. His paper, which has been shortlisted as a candidate for a Best Paper Award, introduces a new approach to controlling spatial relationships between objects in generated images. Ahead of his oral and poster presentations, Xiaohang tells us how the idea came about and what makes it different. LaRender: Training-Free Occlusion Control in Image Generation via Latent Rendering 12 DAILY ICCV Thursday Oral & Award Candidate
RkJQdWJsaXNoZXIy NTc3NzU=