Monocular Scene Reconstruction with 3D SDF Transformers

Authors: Weihao Yuan, Xiaodong Gu, Heng Li, Zilong Dong, Siyu Zhu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction, which outperforms previous methods by a large margin. Remarkably, the mesh accuracy is improved by 41.8%, and the mesh completeness is improved by 25.3% on the Scan Net dataset.
Researcher Affiliation Industry Weihao Yuan, Xiaodong Gu, Heng Li, Zilong Dong, Siyu Zhu Alibaba Group {qianmu.ywh, dadong.gxd, baoshu.lh, list.dzl, siting.zsy} @alibaba-inc.com
Pseudocode No The paper describes the architecture and components in text and figures (e.g., Figure 1, Figure 3), but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes 1Project Page: https://weihaosky.github.io/former3d
Open Datasets Yes Scan Net (Dai et al., 2017) is a large-scale indoor dataset composed of 1613 RGB-D videos of 806 indoor scenes. ... TUM-RGBD (Sturm et al., 2012) and ICL-NUIM (Handa et al., 2014) are also two datasets composed of RGB-D videos but with small-number scenes.
Dataset Splits No The paper states 'We follow the official train/test split, where there are 1513 scans used for training and 100 scans used for testing.' It explicitly mentions train and test splits, but no separate validation split details are provided.
Hardware Specification Yes Our work is implemented in Pytorch and trained on Nvidia V100 GPUs. ... The runtime analysis is presented in Table 5. For a fair comparison to previous methods, the time is tested on a chunk of size 1.5 1.5 1.5 m3 with an Nvidia RTX 3090 GPU.
Software Dependencies No The paper states 'Our work is implemented in Pytorch' but does not provide a specific version number for Pytorch or any other software dependencies with version information.
Experiment Setup Yes The network is optimized with the Adam optimizer (β1 = 0.9, β2 = 0.999) with learning rate of 1 10 4. For a fair comparison with previous methods, the voxel size of the fine level is set to 4cm, and the TSDF truncation distance is set to triple the voxel size. Thus the voxel size of the medium and the coarse levels are 8 cm and 16 cm, respectively. For the balance of efficiency and receptive field, the window size of the sparse window attention is set to 10. ... The view limit is set to 20 in the training, which means twenty images are input to the network for one iteration, while the limit for testing is set to 150.