Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Evaluating Robustness of Monocular Depth Estimation with Procedural Scene Perturbations

Authors: Jack Nugent, Siyang Wu, Zeyu Ma, Beining Han, Meenal Parakh, Abhishek Joshi, Lingjie Mei, Alexander Raistrick, Xinyuan Li, Jia Deng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we introduce PDE (Procedural Depth Evaluation), a new benchmark which enables systematic evaluation of robustness to changes in 3D scene content. PDE uses procedural generation to create 3D scenes that test robustness to various controlled perturbations, including object, camera, material and lighting changes. Our analysis yields interesting findings on what perturbations are challenging for state-of-the-art depth models, which we hope will inform further research.
Researcher Affiliation	Academia	Jack Nugent, Siyang Wu, Zeyu Ma, Beining Han, Meenal Parakh, Abhishek Joshi, Lingjie Mei, Alexander Raistrick, Xinyuan Li, Jia Deng Princeton University EMAIL
Pseudocode	No	The paper describes its methodology and evaluation metrics in narrative and mathematical forms within sections 3 and 3.2, but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data are available at https://github.com/princeton-vl/proc-depth-eval.
Open Datasets	Yes	Code and data are available at https://github.com/princeton-vl/proc-depth-eval. We construct our PDE (Procedural Depth Evaluation) dataset using scenes generated by both Infinigen Nature[27] and Infinigen Indoors[28].
Dataset Splits	No	The PDE dataset comprises 5 object categories and 38 distinct scenes, with each object category appearing in 8 scenes. In the next section, we will introduce 12 possible procedural perturbations, each with up to 60 different parameter settings. This results in a total of 13684 unique scene variations. The paper describes the composition of the dataset used for evaluation but does not specify explicit training/test/validation splits for its own benchmark.
Hardware Specification	Yes	We evaluate all models using jobs with 20GB of memory and either one NVIDIA RTX 3090 GPU or one NVIDIA RTX 2080 GPU.
Software Dependencies	No	The paper mentions using Infinigen [27, 28] for procedural generation, which is described as an open-source procedural generator, and notes that implemented perturbations are additional procedural generation code on top of this system. However, it does not specify version numbers for Python, Infinigen, or any other software libraries used in their evaluation framework.
Experiment Setup	Yes	We evaluate all models on our dataset consisting of 1280 720 images. We use the default inference procedure of each model (which may include image resizing) to output a depth map of the same resolution. We follow standard procedures for computing the scale and shift alignment as in Marigold[16] and Mi Da S[29]. When evaluating on an object of interest, we compute alignment using only the depth values for the object. Additional details can be found in the section A of the appendix.