Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds
Authors: Heng Wang, Chaoyi Zhang, Jianhui Yu, Weidong Cai
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluated on two benchmark datasets, Scan Refer and Refer It3D, our proposed Spa Cap3D outperforms the baseline method Scan2Cap by 4.94% and 9.61% in CIDEr@0.5Io U, respectively. |
| Researcher Affiliation | Academia | Heng Wang , Chaoyi Zhang , Jianhui Yu and Weidong Cai School of Computer Science, University of Sydney, Australia {heng.wang, chaoyi.zhang, jianhui.yu, tom.cai}@sydney.edu.au |
| Pseudocode | No | The paper describes its methods in text and uses architectural diagrams (e.g., Figure 2, 5, 6) but does not provide pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our project page with source code and supplementary files is available at https://Spa Cap3D.github.io/. |
| Open Datasets | Yes | We evaluate our proposed method on Scan Refer [Chen et al., 2020] and Nr3D from Refer It3D [Achlioptas et al., 2020], both of which provide free-form human descriptions for objects in Scan Net [Dai et al., 2017]. |
| Dataset Splits | Yes | Same as Scan2Cap [Chen et al., 2021], for Scan Refer/Nr3D, we train on 36,665/32,919 captions for 7,875/4,664 objects from 562/511 scenes and evaluate on 9,508/8,584 descriptions for 2,068/1,214 objects from 141/130 scenes. while the model is checked and saved when it reaches the best CIDEr@0.5Io U on val split every 2000 iterations. |
| Hardware Specification | Yes | All experiments were trained on a single Ge Force RTX 2080Ti GPU with a batch size of 8 samples for 50 epochs |
| Software Dependencies | No | The paper mentions using PyTorch and ADAM but does not provide specific version numbers for these software dependencies, which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | We set the number of encoder and decoder blocks n as 6 and the number of heads in multi-head attentions as 8. The dimensionality of input and output of each layer is 128 except that for the inner-layer of feedforward networks as 2048. train end-to-end with ADAM [Kingma and Ba, 2015] in a learning rate of 1 10 3. weight decay factor 1 10 5 and the same data augmentation as Scan2Cap. batch size of 8 samples for 50 epochs. |