or 100 frames, but they still want the reconstruction results.” To address this, he implemented an alternating-attention mechanism, utilizing frame-wise attention to enable the model to identify which tokens correspond to which input frame. Jianyuan's research leverages several advanced computer vision techniques. Drawing inspiration from the success of 2D vision, it utilizes DINO, a 2D foundation model based on a vision transformer architecture. This approach enables the model to patchify the input images into multiple tokens, transforming the image information into a format that networks can understand and process. Additionally, the model features a camera head that regresses the camera's extrinsic and intrinsic parameters. This simple transformer approach is informed by previous works in camera pose estimation, such as Relpose, PoseDiffusion, and VGGSfM. He also employs DPT, a computer vision network developed four years ago, to predict dense, pixel-wise outputs. Now that we know which techniques Jianyuan has learned from, are there computer vision techniques that he thinks could benefit from his work? “Yes, neuro rendering methods, such as 3D Gaussian or NeRF, because they need camera poses predicted from upstream methods such as ours,” he responds. “Also, our model can predict a high-level latent representation of the 3D properties, so recent large 3D VLM models could benefit from it.” One potential application of this work in the real world is in online shopping, where customers often rely on 2D images of products. By utilizing this model, retailers could offer 3D reconstructions of items, allowing customers to rotate and view products from all angles, and even create personal 3D avatars for a virtual fitting. 5 DAILY CVPR Friday Jianyuan Wang
RkJQdWJsaXNoZXIy NTc3NzU=