15 DAILY ICCV Thursday It was also a gradual and highly iterative process. The research was conducted largely in his spare time, while his primary work focused on multimodal understanding. “I just find this topic very interesting,” he says with a smile. “I was focused on it with one of my interns, but neither of us had a lot of time. That’s one of the reasons why it’s training-free – we didn’t have time!” Early prototypes failed completely. “At first, we wanted to make a full 3D rendering inside the diffusion model,” he recalls. “But it’s really difficult to estimate a 3D shape inside a representation.” He continued simplifying it – switching to an orthographic camera, using 2D latent features instead of 3D shapes, but keeping the 3D layout and spatial relationships. At last, it worked. That experience taught Xiaohang a valuable lesson. “Sometimes you need to simplify,” he reflects. “Your first idea might seem appealing, but if it doesn’t work, you need to find the best trade-off among your ideas. You might need to sacrifice something and simplify it.” LaRender adapts principles of volumetric rendering to the “latent” level of a diffusion model, allowing the system to combine features according to physical rules of occlusion and transmittance. The research shows that it outperforms text-to-image and layout-to-image methods on occlusion-related benchmarks. Xiaohang Zhan
RkJQdWJsaXNoZXIy NTc3NzU=