Cameras as Rays: Pose Estimation via Ray Diffusion

Authors: Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed methods, both regression and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures.
Researcher Affiliation Academia Jason Y. Zhang , Amy Lin , Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani Carnegie Mellon University
Pseudocode No The paper includes figures and equations but no explicit pseudocode or algorithm blocks.
Open Source Code Yes Project Page: https://jasonyzhang.com/RayDiffusion. (This project page links to the GitHub repository https://github.com/jasonyzhang/ray-diffusion).
Open Datasets Yes Our method is trained and evaluated using CO3Dv2 (Reizenstein et al., 2021).
Dataset Splits No The paper mentions training on 41 categories and holding out 10 for generalization, and evaluating by randomly sampling N images from test sequences. However, it does not specify a train/validation/test split for the dataset in terms of percentages or sample counts for reproducibility within the dataset itself.
Hardware Specification Yes The ray regression and ray diffusion models take about 2 and 4 days respectively to train on 8 A6000 GPUs. All benchmarks are completed using a single Nvidia A6000 GPU.
Software Dependencies No The paper mentions using 'pre-trained, frozen DINOv2 (S/14)' and 'Di T with 16 transformer blocks' but does not specify versions for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Following Lin et al. (2024), we place the world origin at the point closest to the optical axes of the training cameras, which represents a useful inductive bias for center-facing camera setups. We use a Di T (Peebles & Xie, 2023) with 16 transformer blocks as the architecture for both f Regress (with t always set to 100) and f Diffusion. We train our diffusion model with T=100 timesteps. For all experiments, we use the X0 predicted at T = 30.