Cameras as Rays: Pose Estimation via Ray Diffusion
Authors: Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proposed methods, both regression and diffusion-based, demonstrate state-of-the-art performance on camera pose estimation on CO3D while generalizing to unseen object categories and in-the-wild captures. |
| Researcher Affiliation | Academia | Jason Y. Zhang , Amy Lin , Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, Shubham Tulsiani Carnegie Mellon University |
| Pseudocode | No | The paper includes figures and equations but no explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project Page: https://jasonyzhang.com/RayDiffusion. (This project page links to the GitHub repository https://github.com/jasonyzhang/ray-diffusion). |
| Open Datasets | Yes | Our method is trained and evaluated using CO3Dv2 (Reizenstein et al., 2021). |
| Dataset Splits | No | The paper mentions training on 41 categories and holding out 10 for generalization, and evaluating by randomly sampling N images from test sequences. However, it does not specify a train/validation/test split for the dataset in terms of percentages or sample counts for reproducibility within the dataset itself. |
| Hardware Specification | Yes | The ray regression and ray diffusion models take about 2 and 4 days respectively to train on 8 A6000 GPUs. All benchmarks are completed using a single Nvidia A6000 GPU. |
| Software Dependencies | No | The paper mentions using 'pre-trained, frozen DINOv2 (S/14)' and 'Di T with 16 transformer blocks' but does not specify versions for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Following Lin et al. (2024), we place the world origin at the point closest to the optical axes of the training cameras, which represents a useful inductive bias for center-facing camera setups. We use a Di T (Peebles & Xie, 2023) with 16 transformer blocks as the architecture for both f Regress (with t always set to 100) and f Diffusion. We train our diffusion model with T=100 timesteps. For all experiments, we use the X0 predicted at T = 30. |