Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Cameras as Relative Positional Encoding
Authors: Ruilong Li, Brent Yi, Junchen Liu, Hang Gao, Yi Ma, Angjoo Kanazawa
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are on three tasks, which span six datasets. We begin with a series of studies comparing camera conditioning techniques for feedforward novel view synthesis using Real Estate10K [12] and Objaverse [13]. Our results highlight the advantages of relative encodings particularly PRo PE compared to absolute ones. |
| Researcher Affiliation | Collaboration | Ruilong Li1,2 Brent Yi1 Junchen Liu1 Hang Gao1 Yi Ma1,3 Angjoo Kanazawa1 1UC Berkeley 2NVIDIA 3HKU |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks in the main text. The authors state in the NeurIPS Paper Checklist that pseudocode will be provided in supplemental material. |
| Open Source Code | Yes | Code is available on our project webpage2. 2https://www.liruilong.cn/prope/ |
| Open Datasets | Yes | Our experiments are on three tasks, which span six datasets. We begin with a series of studies comparing camera conditioning techniques for feedforward novel view synthesis using Real Estate10K [12] and Objaverse [13]. Our results highlight the advantages of relative encodings particularly PRo PE compared to absolute ones. We then verify that these benefits extend to other settings: we demonstrate improvements when integrating PRo PE into Uni Match [14] for stereo depth estimation across three benchmarks, for a discriminative spatial cognition task using DL3DV [15], and when scaling to larger novel view synthesis models [7, 8]. |
| Dataset Splits | No | The paper mentions training and evaluating separately on Real Estate10K and Objaverse datasets, and describes specific test-time scenarios (e.g., varying input views or intrinsics). However, it does not explicitly provide specific train/validation/test splits (e.g., percentages, sample counts) for the datasets themselves. |
| Hardware Specification | No | Our models are trained on 2x GPUs with a total batch size of 4, as opposed to 512 in the original paper. This applies to all experiments expect for the ones in Section 4.7. For the scaling-up experiments in Section 4.7, we use a LVSM model with 12 transformer blocks and keep the all other configurations including the MLP channel dimension (3072). These models are trained on 8x GPUs with a total batch size of 64. |
| Software Dependencies | No | The paper mentions adhering to the LVSM [8] implementation specifications but does not provide specific version numbers for software libraries, frameworks, or programming languages used in their own implementation. |
| Experiment Setup | Yes | All main experiments use identical input, output, and overall model sizes ( 25M parameters); we also validate larger models in Section 4.7. More details are provided in Appendix A.1.1. ... We trained exclusively at 256 256 resolution and did not perform the additional fine-tuning at higher resolutions. Limited by academic-level resources, we use a smaller version of the LVSM model with 6 transformer blocks, and reduce the MLP channel dimension from 3072 to 1024. Our models are trained on 2x GPUs with a total batch size of 4... For the scaling-up experiments in Section 4.7, we use a LVSM model with 12 transformer blocks and keep the all other configurations including the MLP channel dimension (3072). These models are trained on 8x GPUs with a total batch size of 64. |