Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fin3R: Fine-tuning Feed-forward 3D Reconstruction Models via Monocular Knowledge Distillation
Authors: Weining Ren, Hongjun Wang, Xiao Tan, Kai Han
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our method on a wide range of models, including DUSt3R, MASt3R, CUT3R, and VGGT. The fine-tuned models consistently deliver sharper boundaries, recover complex structures, and achieve higher geometric accuracy in both single- and multi-view settings, while adding only the tiny Lo RA weights, which leave test-time memory and latency virtually unchanged. ... We evaluate our approach across three settings: single-view, two-view, and multi-view. In the single-view setting, we focus on monocular depth estimation. The two-view configuration evaluates relative pose estimation... In the multi-view setting, we perform multi-view depth estimation, pointmap estimation, and pose estimation. ... Table 1: Quantitative results for monocular depth estimation. ... Table 7: Ablation Study for Distillation Module. |
| Researcher Affiliation | Collaboration | Weining Ren1 Hongjun Wang1 Xiao Tan2 Kai Han1 1 Visual AI Lab, The University of Hong Kong 2 Department of Computer Vision Technology (VIS), Baidu Inc. |
| Pseudocode | No | The paper describes the methodology and training process in detail using natural language and mathematical equations, such as the loss functions and the weight re-normalization formula. However, it does not include any explicitly labeled pseudocode blocks or algorithm boxes. |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We will open-source the code and the model after the paper is accepted. |
| Open Datasets | Yes | During each epoch, we sample 20,000 images from SA-1B [26], 1,000 from Hypersim [46], and 1,000 from Tartain Air [69]. ... Table 1: Quantitative results for monocular depth estimation. Method NYUv2 KITTI ETH3D iBims-1 DDAD DIODE HAMMER Average Rel δ1 ... Table 2: Relative Camera Pose Evaluation on the Scan Net dataset [10]. ... Table 3: Quantitative Results for Multiview Pose Estimation on Real Estate10k [88]. ... Table 4: Results for Video Depth Estimation. Method ETH3D [49] T&T [27] KITTI [58] Sintel [6] Bonn [40] ... Table 5: Pointmap Regression on on 7-Scenes [52] and NRGBD [2] Datasets. |
| Dataset Splits | Yes | During each epoch, we sample 20,000 images from SA-1B [26], 1,000 from Hypersim [46], and 1,000 from Tartain Air [69]. Training runs for 10 epochs... In the single-view setting, we focus on monocular depth estimation. The two-view configuration evaluates relative pose estimation... In the multi-view setting, we perform multi-view depth estimation... For the ablation study, we replace the SA-1B dataset [26] with a mixed dataset composed of Mega Depth [30], CO3Dv2 [45], ARkit Scene [5], Scannet++ [81], Scannet [10], Virtual KIITIv2 [7], Blended MVS [79], and Static Things3D [51]. Each dataset is equally weighted, providing coverage that is comparable to the DUSt3R training set. |
| Hardware Specification | Yes | Training runs for 10 epochs on four NVIDIA L20 GPUs over a single day. |
| Software Dependencies | No | The paper mentions using LoRA adapters, but does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used for implementation. |
| Experiment Setup | Yes | Implementation Details. We use Mo Ge [66] as the teacher model for pseudo-label generation. ... For DUSt3R [68], we use 2-view data with distillation supervision applied exclusively to the view-1 pointmap head for distillation loss. In contrast, CUT3R [65] and VGGT [61] utilize 2-8 views, with supervision on either the self-view head or the depth head. ... In all experiments, we set both the rank and alpha of Lo RA to 8. ... Training is performed at a resolution of 512 width... The model is fine-tuned for 10 epochs. The learning rate is initialized at 1e-4 with a one-epoch warm-up phase and is gradually decayed to a minimum of 1e-6. A batch size of 2 per GPU is used, and gradients are accumulated over 8 iterations to achieve an effective batch size of 64. ... CUT3R is trained at a resolution of 512 width, while VGGT is trained at a resolution of 518 width. The model is fine-tuned for 10 epochs with an initial learning rate of 1e-4, which is warmed up for one epoch and then gradually decayed to a minimum of 1e-6. Additionally, the sequence length is dynamically selected between 2 and 8, with the product of batch size and sequence length fixed at 8. The accumulation iteration is changed accordingly to ensure an effective total batch size of 64. |