Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Jasmine: Harnessing Diffusion Prior for Self-supervised Depth Estimation
Authors: Jiyuan Wang, Chunyu Lin, cheng guan, Lang Nie, Jing He, Haodong Li, Kang Liao, Yao Zhao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Jasmine achieves So TA performance on the KITTI benchmark and exhibits superior zero-shot generalization across multiple datasets. Project page and code are available at here. |
| Researcher Affiliation | Academia | Jiyuan Wang1 Chunyu Lin1 Cheng Guan1 Lang Nie4 Jing He3 Haodong Li3 Kang Liao2 Yao Zhao1 1BJTU 2NTU 3HKUST 4CQUPT Corresponding Author |
| Pseudocode | Yes | An overview of the whole framework is shown in Fig. 2 and its training pseudocode is shown in Algorithm 1. |
| Open Source Code | Yes | Project page and code are available at here. |
| Open Datasets | Yes | KITTI[13]: Following the previous work[15], we mainly conduct our experiments on the widely used KITTI dataset. Hypersim[38]: This photorealistic synthetic dataset (461 indoor scenes) contributes approximately 28k samples from its official training split for mix-batch image reconstruction. Driving Stereo[67]: Contains 500 images per weather condition (fog, cloudy, rainy, sunny) for zero-shot testing. City Scape[7]: Evaluated on 1,525 test images with dynamic vehicle-rich urban scenes, using ground truth from [55]. MIR Analysis Datasets ETH3D[41]: We resize this high-resolution (6048 4032) dataset to 4K resolution, then randomly cropped to 1024 320 per iteration (898 total samples). Virtual KITTI[4] is a synthetic street scene dataset. |
| Dataset Splits | Yes | KITTI[13]: Following the previous work[15], we mainly conduct our experiments on the widely used KITTI dataset. We employ Zhou s split[86] containing 39,810 training and 4,424 validation samples after removing static frames. The evaluation uses 697 Eigen raw test images with metrics from [15], applying 80m ground truth clipping and Eigen crop preprocessing[8]. |
| Hardware Specification | Yes | All experiments are conducted on 8 NVIDIA A800 GPUs with a total batch size of 32, training for a total of 25k training steps, requiring around 1 day. MACs and Runtime are measured on a image with 1024 320 resolution on RTX 4090. |
| Software Dependencies | Yes | We implement the proposed Jasmine using Accelerate[16] and Py Torch[35] with Stable Diffusion v2[39] as the backbone. |
| Experiment Setup | Yes | The loss weights specified in Sec. 3 are empirically configured as: ηa = 8e 3, ηt1 = 0.6, ηt2 = 0.9, ηp1 = 0.85, ηp2 = 0.15, ηstep = max(1, 30 (stepnow/stepmax)). Training uses the Adam W optimizer[30] with a base learning rate of 3e 5. All experiments are conducted on 8 NVIDIA A800 GPUs with a total batch size of 32, training for a total of 25k training steps, requiring around 1 day. Following [15], we also employed standard data augmentation techniques (horizontal flips, random brightness, contrast, saturation, and hue jitter). |