already observed follow-up works, including AnySplat, which utilizes VGGT’s feature backbone to enable feed-forward Gaussian parameter prediction for novel synthesis, and Spatial-MLLM, which combines its backbone with other large vision models to establish a unified foundation model for 3D perception. “In the future, we could see further trials on 4D tasks,” he envisions. “As we go from 2D to 3D, I think in probably two or three years, we’ll have something good in 4D. In 4D, people dance, run, and many scenes are dynamic!” In conclusion, while Jianyuan’s model represents a significant step forward, he emphasizes that datadriven 3D vision is just the beginning. “As Rich Sutton said in 2019, general approaches that leverage computation will ultimately prove to be the most effective,” he reflects. “This ‘Bitter Lesson’ has attracted great attention in the 2D and NLP communities, and we believe it’s true for 3D as well. Feed-forward models will be the future of 3D vision.” To learn more about Jianyuan’s work, visit Oral Session 2A: 3D Computer Vision (Karl Dean Ballroom) this afternoon from 13:00 to 14:30 [Oral 5] and Poster Session 2 (ExHall D) from 16:00 to 18:00 [Poster 86]. 7 DAILY CVPR Friday Jianyuan Wang
RkJQdWJsaXNoZXIy NTc3NzU=