Cross-modal Learning for Image-Guided Point Cloud Shape Completion
Authors: Emanuele Aiello, Diego Valsesia, Enrico Magli
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show significant improvements over state-of-the-art supervised methods for both unimodal and multimodal completion. and 4 Experimental results |
| Researcher Affiliation | Academia | Emanuele Aiello Politecnico di Torino, Italy emanuele.aiello@polito.it Diego Valsesia Politecnico di Torino, Italy diego.valsesia@polito.it Enrico Magli Politecnico di Torino, Italy enrico.magli@polito.it |
| Pseudocode | No | The paper provides an 'Architecture overview' diagram (Figure 1) but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code of the project: https://github.com/diegovalsesia/XMFnet |
| Open Datasets | Yes | All the experiments are conducted on the Shape Net-Vi PC[12] dataset. The dataset contains 38,328 objects from 13 categories; for each object it comprises 24 partial point clouds with occlusions generated under 24 viewpoints, using the same settings as Shape Net Rendering [35]. |
| Dataset Splits | No | we used 31,650 objects from eight categories, with 80% of them for training and 20% for testing. |
| Hardware Specification | Yes | The proposed framework is implemented in Py Torch and trained on an Nvidia V100 GPU. |
| Software Dependencies | No | The proposed framework is implemented in Py Torch and trained on an Nvidia V100 GPU. The differentiable renderer has been implemented with Py Torch3D[37]. Specific version numbers for PyTorch or PyTorch3D are not mentioned. |
| Experiment Setup | Yes | The partial point cloud is downsampled by farthest point sampling to N 1 1024 points and concatenated to the output of the decoder that produces N 1 1024 points leading to a completed point cloud with 2048 points. The decoder has K 8 branches, each of them producing M 128 points. The point cloud encoder employs Edge Conv and SAGPooling layers; the Edge Conv layers selects k 20 nearest neighbors, while the two pooling layers use k 16 and k 6 nearest neighbors, respectively. The original point cloud is overall downsampled by a factor of 16, resulting in NX 128 points with FX 256 features. The image encoder is built with a Res Net18 [36] as backbone, it extracts NI 14 ˆ 14 196 pixels with FI 256 features. The multihead attention has 4 attention heads, with embedding size F 256. In the LI loss we use λ 0.15. The mask factor for the edge detector has been set to ε 0.4. The differentiable renderer has been implemented with Py Torch3D[37]. The rendered silhouettes H ˆ W has size 224 ˆ 224 that is the same size of input views in our experiments. We adopt radius ρ 0.025 in point rasterization. Class-specific training is performed for all models, using the Adam optimizer [38] for roughly 200 epochs with a batch size of 128. The learning rate is initialized to 0.001 and reduced by a factor of 10 at epoch 25 and 125. |