Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
3D Visual Illusion Depth Estimation
Authors: Chengtang Yao, Zhidan Liu, Jiaxi Zeng, Lidong Yu, Yuwei Wu, Yunde Jia
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance. In this paper, we present a 3D-Visual-Illusion dataset to investigate the impact of 3D visual illusions on depth estimation. |
| Researcher Affiliation | Collaboration | 1Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology, China 2Guangdong Provincial Key Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, Shenzhen, China 3NVIDIA, 4NEOLIX EMAIL EMAIL |
| Pseudocode | Yes | As illustrated in Algorithm 1 we randomly sample three points to define a candidate plane at each iteration of RANSAC. As shown in Algorithm 2, we generate a right-view image ˆIR from a given left-view image IL and disparity map D. As shown in Algorithm 3, the process begins by upsampling the depth map and scaling the intrinsic matrix accordingly. |
| Open Source Code | No | We do not provide open access to the data and code at this time, but can publish part of them at the rebuttal stage if the reviewers need it. The complete data and code will be published after the paper is accepted. |
| Open Datasets | Yes | We first pre-train the models on the Scene Flow dataset [28], and then fine-tune them on the virtual 3DVisual-Illusion data. The fine-tuned models are evaluated on both the real-world 3D-Visual-Illusion data and the Booster training set [32]. In addition to illusion scenes, we also present the performance of our model in the Middlebury dataset. |
| Dataset Splits | Yes | The dataset comprises nearly 3,000 scenarios, with over 200,000 frames for training and 617 frames for testing. The model is initially trained on the Scene Flow dataset, then fine-tuned on the 3D-Visual-Illusion training set, and finally evaluated on the 3D-Illusion test set and the Booster training set. |
| Hardware Specification | Yes | All experiments were conducted on a single NVIDIA H100 GPU with an input resolution of 1920 1080. We use Lora [15] to fine-tune the last layer of Qwen Vl2-7B and the Q$$V projection layer of FLUX on 4 H100 with a batch size of 6 on each GPU. |
| Software Dependencies | No | For dataset construction, we use Qwen2-VL-72B [1, 40] to perform initial screening... We then built a Flask[37]-based web tool... The video generation is achieved via Sora [30] and Kling [21], and a small part of the data is generated from Hunyuan Video [20]. We then use Instant Splat [6], DUSt3R [41], and GS [18]... As for our VLM-driven monocular-stereo fusion network, we benefit from the vision and language foundation model and use Depthanything V2 [45, 46] as a pre-trained monocular model, Qwen Vl2-7B [40, 1] as pre-trained VLM, and FLUX [22] as a diffusion model. |
| Experiment Setup | Yes | The supervised loss function consists of two main components: one (Ld) for the disparity maps and the other (Lc) for the confidence map: L = Ld + w Lc, (12) where w is a manually set weighting factor for balancing the confidence map loss. For disparity supervision, we use the L1 loss... For confidence map supervision, we adopt the Focal Loss, where the ground-truth for confidence is derived based on the disparity difference between the final stereo prediction DT s and the ground-truth... We use Lora [15] to fine-tune the last layer of Qwen Vl2-7B and the Q$$V projection layer of FLUX on 4 H100 with a batch size of 6 on each GPU. The entire training takes almost 20 days. |