Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers
Authors: Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, JINGFENG YAO, Lianghui Zhu, Yuechuan Pu, Cheng Chi_, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments 4.1 Experimental Setup 4.2 Ablations and Analysis 4.3 Zero-Shot Relative Depth Estimation 4.4 Edge-Aware Point Cloud Evaluation |
| Researcher Affiliation | Collaboration | 1Huazhong University of Science and Technology 2Xiaomi EV 3Zhejiang University |
| Pseudocode | No | The paper describes the generative formulation using equations 1-3, and the model architecture details in text, but does not include a structured pseudocode or algorithm block. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The datasets used in our experiments are publicly available, ensuring accessibility of the data. Although the code is not provided at submission time, we plan to release the code, models, and detailed instructions to facilitate full reproducibility. |
| Open Datasets | Yes | We adopt Hypersim [50], a photorealistic synthetic dataset with accurate and clean 3D geometry, which contains approximately 54K samples, to train the 512 512 model. For the 1024 768 model, we additionally leverage four datasets, Urban Syn [19] (7.5K), Unreal Stereo4K [62] (8K), VKITTI [5] (25K), and Tartan Air [71] (30K), to further enhance the model s generalization and robustness. |
| Dataset Splits | No | Training datasets. Our objective is to estimate pixel-perfect depth maps, which, when converted to point clouds, are free of flying pixels and geometric artifacts. To achieve this, it is essential to train on datasets with high-quality ground truth point clouds. We adopt Hypersim [50], a photorealistic synthetic dataset with accurate and clean 3D geometry, which contains approximately 54K samples, to train the 512 512 model. For the 1024 768 model, we additionally leverage four datasets, Urban Syn [19] (7.5K), Unreal Stereo4K [62] (8K), VKITTI [5] (25K), and Tartan Air [71] (30K), to further enhance the model s generalization and robustness. [...] To evaluate on the official test split of the Hypersim [50] dataset, which provides high-quality ground-truth point clouds and is not used during training. |
| Hardware Specification | Yes | We train all models on 8 NVIDIA GPUs with a per-GPU batch size of 4, using the Adam W optimizer with a constant learning rate of 1 10 4. |
| Software Dependencies | No | We train all models on 8 NVIDIA GPUs with a per-GPU batch size of 4, using the Adam W optimizer with a constant learning rate of 1 10 4. The training loss is the MSE loss between the predicted and true velocity, as shown in Equation 3, and the gradient matching loss, which is adopted from [82]. |
| Experiment Setup | Yes | In our implementation, we use a total of N = 24 Di T blocks, each operating at a hidden dimension of D = 1024. The first 12 blocks are standard Di T blocks with a patch size of 16, corresponding to (H/16) (W/16) tokens for an input of size H W. After the 12th block, we employ an MLP layer to expand the hidden dimension by a factor of 4, followed by reshaping to obtain (H/8) (W/8) tokens. The remaining 12 SP-Di T blocks then further process these (H/8) (W/8) tokens. Finally, we employ an MLP layer followed by a reshaping operation to transform the processed tokens into an H W depth map. [...] We train all models on 8 NVIDIA GPUs with a per-GPU batch size of 4, using the Adam W optimizer with a constant learning rate of 1 10 4. The training loss is the MSE loss between the predicted and true velocity, as shown in Equation 3, and the gradient matching loss, which is adopted from [82]. |