Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Puzzles: Unbounded Video-Depth Augmentation for Scalable End-to-End 3D Reconstruction

Authors: Jiahao Ma, Lei Wang, Miaomiao Liu, David Ahmedt-Aristizabal, Chuong Nguyen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that integrating Puzzles into existing video-based 3D reconstruction pipelines consistently boosts performance, all without modifying the underlying network architecture. Notably, models trained on only 10% of the original data, augmented with Puzzles, still achieve accuracy comparable to those trained on the full dataset.
Researcher Affiliation	Academia	Jiahao Ma1,3, Lei Wang2,3, Miaomiao Liu1, David Ahmedt-Aristizabal3, Chuong Nguyen1,3 1Australian National University, 2Griffith University, 3Data61/CSIRO EMAIL, EMAIL EMAIL
Pseudocode	No	The paper describes methods and processes but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	Yes	The official source code for our method is available on Git Hub: https://github.com/Jiahao-Ma/puzzles-code.
Open Datasets	Yes	We train on a blended corpus comprising ScanNet-v2 [12], ARKit Scenes [45], a selected Habitat subset [46], and the in-the-wild/object dataset Blended MVS [47], totaling approximately 14 million images. For evaluations in Sections 4.2 and 4.3, we draw uniform subsamples of varying size to study the scaling behavior of Puzzles. To assess the impact of our augmentation, we evaluate on three unseen datasets: 7Scenes [48], NRGBD [49] and DTU [50] using Accuracy (Acc) and Completion (Comp) metrics from [9].
Dataset Splits	Yes	Notably, models trained with Puzzles on just 1/10 of the data match or exceed the performance of full-data baselines. Beyond 1/10, Puzzles consistently outperforms training on the entire dataset without augmentation. The experiment illustrated in the middle of Figure 7.B, compares Puzzles augmentation (black) to the baseline without augmentation (blue), over data fractions ranging from 1/20 to full.
Hardware Specification	Yes	Training was conducted on eight H100 GPUs for 120 epochs.
Software Dependencies	No	The paper mentions applying a "standard Py Torch data augmentation pipeline" but does not specify version numbers for PyTorch or any other software libraries.
Experiment Setup	Yes	We adopt three representative video-based 3R-series methods, Spann3R [9], SLAM3R [10], and Fast3R [11], as baselines. Since each was originally trained on a distinct dataset, we preserve their published training protocols and architectures, and retrain them on a unified dataset both with and without our Puzzles data augmentation. ... Using Spann3R as an example, we follow the official training setup and train for 120 epochs on the same dataset with the same 14M training samples. The only change is the application of Puzzles augmentation.