Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, JINGFENG YAO, Lianghui Zhu, Yuechuan Pu, Cheng Chi_, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments 4.1 Experimental Setup 4.2 Ablations and Analysis 4.3 Zero-Shot Relative Depth Estimation 4.4 Edge-Aware Point Cloud Evaluation
Researcher Affiliation Collaboration 1Huazhong University of Science and Technology 2Xiaomi EV 3Zhejiang University
Pseudocode No The paper describes the generative formulation using equations 1-3, and the model architecture details in text, but does not include a structured pseudocode or algorithm block.
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The datasets used in our experiments are publicly available, ensuring accessibility of the data. Although the code is not provided at submission time, we plan to release the code, models, and detailed instructions to facilitate full reproducibility.
Open Datasets Yes We adopt Hypersim [50], a photorealistic synthetic dataset with accurate and clean 3D geometry, which contains approximately 54K samples, to train the 512 512 model. For the 1024 768 model, we additionally leverage four datasets, Urban Syn [19] (7.5K), Unreal Stereo4K [62] (8K), VKITTI [5] (25K), and Tartan Air [71] (30K), to further enhance the model s generalization and robustness.
Dataset Splits No Training datasets. Our objective is to estimate pixel-perfect depth maps, which, when converted to point clouds, are free of flying pixels and geometric artifacts. To achieve this, it is essential to train on datasets with high-quality ground truth point clouds. We adopt Hypersim [50], a photorealistic synthetic dataset with accurate and clean 3D geometry, which contains approximately 54K samples, to train the 512 512 model. For the 1024 768 model, we additionally leverage four datasets, Urban Syn [19] (7.5K), Unreal Stereo4K [62] (8K), VKITTI [5] (25K), and Tartan Air [71] (30K), to further enhance the model s generalization and robustness. [...] To evaluate on the official test split of the Hypersim [50] dataset, which provides high-quality ground-truth point clouds and is not used during training.
Hardware Specification Yes We train all models on 8 NVIDIA GPUs with a per-GPU batch size of 4, using the Adam W optimizer with a constant learning rate of 1 10 4.
Software Dependencies No We train all models on 8 NVIDIA GPUs with a per-GPU batch size of 4, using the Adam W optimizer with a constant learning rate of 1 10 4. The training loss is the MSE loss between the predicted and true velocity, as shown in Equation 3, and the gradient matching loss, which is adopted from [82].
Experiment Setup Yes In our implementation, we use a total of N = 24 Di T blocks, each operating at a hidden dimension of D = 1024. The first 12 blocks are standard Di T blocks with a patch size of 16, corresponding to (H/16) (W/16) tokens for an input of size H W. After the 12th block, we employ an MLP layer to expand the hidden dimension by a factor of 4, followed by reshaping to obtain (H/8) (W/8) tokens. The remaining 12 SP-Di T blocks then further process these (H/8) (W/8) tokens. Finally, we employ an MLP layer followed by a reshaping operation to transform the processed tokens into an H W depth map. [...] We train all models on 8 NVIDIA GPUs with a per-GPU batch size of 4, using the Adam W optimizer with a constant learning rate of 1 10 4. The training loss is the MSE loss between the predicted and true velocity, as shown in Equation 3, and the gradient matching loss, which is adopted from [82].