Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Authors: Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluated on two benchmark datasets, Scan Refer and Refer It3D, our proposed Spa Cap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5Io U, respectively.
Researcher Affiliation Academia Heng Wang , Chaoyi Zhang , Jianhui Yu and Weidong Cai School of Computer Science, University of Sydney, Australia {heng.wang, chaoyi.zhang, jianhui.yu, tom.cai}@sydney.edu.au
Pseudocode No The paper describes its methods in text and uses architectural diagrams (e.g., Figure 2, 5, 6) but does not provide pseudocode or algorithm blocks.
Open Source Code Yes Our project page with source code and supplementary files is available at https://Spa Cap3D.github.io/.
Open Datasets Yes We evaluate our proposed method on Scan Refer [Chen et al., 2020] and Nr3D from Refer It3D [Achlioptas et al., 2020], both of which provide free-form human descriptions for objects in Scan Net [Dai et al., 2017].
Dataset Splits Yes Same as Scan2Cap [Chen et al., 2021], for Scan Refer/Nr3D, we train on 36,665/32,919 captions for 7,875/4,664 objects from 562/511 scenes and evaluate on 9,508/8,584 descriptions for 2,068/1,214 objects from 141/130 scenes. while the model is checked and saved when it reaches the best CIDEr@0.5Io U on val split every 2000 iterations.
Hardware Specification Yes All experiments were trained on a single Ge Force RTX 2080Ti GPU with a batch size of 8 samples for 50 epochs
Software Dependencies No The paper mentions using PyTorch and ADAM but does not provide specific version numbers for these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes We set the number of encoder and decoder blocks n as 6 and the number of heads in multi-head attentions as 8. The dimensionality of input and output of each layer is 128 except that for the inner-layer of feedforward networks as 2048. train end-to-end with ADAM [Kingma and Ba, 2015] in a learning rate of 1 10 3. weight decay factor 1 10 5 and the same data augmentation as Scan2Cap. batch size of 8 samples for 50 epochs.