Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory
Authors: Yuqi Wu, Wenzhao Zheng, Jie Zhou, Jiwen Lu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use various 3D/4D tasks (dense 3D reconstruction, monocular depth estimation, video depth estimation, and camera pose estimation) to evaluate our method. We choose DUSt3R [64], MASt3R [29], Mon ST3R [75], Spann3R [59], and CUT3R [63] as our primary baselines. Table 1: Quantitative 3D reconstruction results on 7-scenes and NRGBD datasets. Table 2: Monocular Depth Evaluation on NYU-v2 (static), Sintel, Bonn, and KITTI datasets. Table 3: Video Depth Evaluation. We compare scale-invariant depth (per-sequence alignment) and metric depth (no alignment) results on Sintel, Bonn, and KITTI datasets. Table 4: Camera Pose Estimation Evaluation on Scan Net, Sintel, and TUM-dynamics datasets. |
| Researcher Affiliation | Academia | Yuqi Wu1, Wenzhao Zheng1, , Jie Zhou1 Jiwen Lu1,2 1Department of Automation, Tsinghua University 2Beijing National Research Center for Information Science and Technology |
| Pseudocode | No | The paper describes the methodology using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code: https://github.com/Yki Wu/Point3R. |
| Open Datasets | Yes | During training, we use a combination of 14 datasets, including ARKit Scenes [5], Scan Net [12], Scan Net++ [74], CO3Dv2 [44], Wild RGBD [70], Omni Object3D [69], Hyper Sim (a subset of it) [45], Blended MVS [73], Mega Depth [30], Waymo [55], Virtual KITTI2 [7], Point Odyssey [79], Spring [35], and MVS-Synth [24]. These datasets exhibit highly diverse characteristics, encompassing both indoor and outdoor, static and dynamic, as well as realworld and synthetic scenes. |
| Dataset Splits | Yes | We train the model by sampling 5 frames per sequence in the first stage. The input here is 224 224 resolution. Then we use input with different aspect ratios (set the maximum side to 512) in the second stage, following CUT3R [63]. And finally, we freeze the encoder and fine-tune other parts on 8-frame sequences. We use inputs with minimal overlap [63]: 3 to 5 frames per scene for the 7-scenes datasets and 2 to 4 frames per scene for the NRGBD dataset. |
| Hardware Specification | Yes | We train our model on 8 A800 NVIDIA GPUs for 15 days, which is a relatively low cost. |
| Software Dependencies | No | The paper mentions software components like "Vi T-Large [15, 64] image encoder, Vi T-Base interaction decoders [64, 66], and DPT [43] heads" and "Adam W optimizer [32]" but does not specify their version numbers or the versions of underlying frameworks like PyTorch or TensorFlow. |
| Experiment Setup | Yes | We initialize our Vi T-Large [15, 64] image encoder, Vi T-Base interaction decoders [64, 66], and DPT [43] heads with pre-trained weights from DUSt3R [64]. Our memory encoder is composed of a light-weight Vi T (6 blocks) and a 2-layer MLP. Each memory feature has a dimensionality of 768. We use the Adam W optimizer [32] and the learning rate warms up to a maximum value of 5e-5 and decreases according to a cosine schedule. We train our model on 8 A800 NVIDIA GPUs for 15 days, which is a relatively low cost. |