Cross-modal Learning for Image-Guided Point Cloud Shape Completion

Authors: Emanuele Aiello, Diego Valsesia, Enrico Magli

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show significant improvements over state-of-the-art supervised methods for both unimodal and multimodal completion. and 4 Experimental results
Researcher Affiliation Academia Emanuele Aiello Politecnico di Torino, Italy emanuele.aiello@polito.it Diego Valsesia Politecnico di Torino, Italy diego.valsesia@polito.it Enrico Magli Politecnico di Torino, Italy enrico.magli@polito.it
Pseudocode No The paper provides an 'Architecture overview' diagram (Figure 1) but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code of the project: https://github.com/diegovalsesia/XMFnet
Open Datasets Yes All the experiments are conducted on the Shape Net-Vi PC[12] dataset. The dataset contains 38,328 objects from 13 categories; for each object it comprises 24 partial point clouds with occlusions generated under 24 viewpoints, using the same settings as Shape Net Rendering [35].
Dataset Splits No we used 31,650 objects from eight categories, with 80% of them for training and 20% for testing.
Hardware Specification Yes The proposed framework is implemented in Py Torch and trained on an Nvidia V100 GPU.
Software Dependencies No The proposed framework is implemented in Py Torch and trained on an Nvidia V100 GPU. The differentiable renderer has been implemented with Py Torch3D[37]. Specific version numbers for PyTorch or PyTorch3D are not mentioned.
Experiment Setup Yes The partial point cloud is downsampled by farthest point sampling to N 1 1024 points and concatenated to the output of the decoder that produces N 1 1024 points leading to a completed point cloud with 2048 points. The decoder has K 8 branches, each of them producing M 128 points. The point cloud encoder employs Edge Conv and SAGPooling layers; the Edge Conv layers selects k 20 nearest neighbors, while the two pooling layers use k 16 and k 6 nearest neighbors, respectively. The original point cloud is overall downsampled by a factor of 16, resulting in NX 128 points with FX 256 features. The image encoder is built with a Res Net18 [36] as backbone, it extracts NI 14 ˆ 14 196 pixels with FI 256 features. The multihead attention has 4 attention heads, with embedding size F 256. In the LI loss we use λ 0.15. The mask factor for the edge detector has been set to ε 0.4. The differentiable renderer has been implemented with Py Torch3D[37]. The rendered silhouettes H ˆ W has size 224 ˆ 224 that is the same size of input views in our experiments. We adopt radius ρ 0.025 in point rasterization. Class-specific training is performed for all models, using the Adam optimizer [38] for roughly 200 epochs with a batch size of 128. The learning rate is initialized to 0.001 and reduced by a factor of 10 at epoch 25 and 125.