CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

Authors: Eslam Mohamed BAKR, Mohamed Ayman Mohamed, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach through comprehensive experiments on the Nr3D, Sr3D, and Scanrefer benchmarks and show consistent performance gains compared to existing methods without requiring manually annotated data.
Researcher Affiliation Academia Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny King Abdullah University of Science and Technology (KAUST) {eslam.abdelrahman, mohamed.mohamed.2, mahmoud.ahmed, habib.slim, mohamed.elhoseiny}@kaust.edu.sa
Pseudocode Yes We summarize our localization method in Algorithm 1.
Open Source Code Yes The code is available at github.com/eslambakr/Co T3DV G.
Open Datasets Yes Datasets. To probe the effectiveness of our proposed framework, Co T3DRef, we conduct evaluations on three 3D visual-grounding benchmarks, namely Nr3D, Sr3D (Achlioptas et al., 2020) and Scan Refer (Chen et al., 2020).
Dataset Splits Yes As shown in Table 1, on the challenging setup, where we assume access for only 10% of the training data while testing on the entire testing dataset, the parallel variant boosts the performance by 4% and 6.5% over the vanilla MVT using Nr3D and Sr3D, respectively (row b). On the other hand, our Co T3DRef framework surpasses the vanilla MVT by 10% and 16.4% using Nr3D and Sr3D, respectively (row c).
Hardware Specification Yes We used the Py Torch framework and a single NVIDIA A6000 GPU for training.
Software Dependencies No The paper mentions "Py Torch framework" but does not specify a version number. Other software like BERT, GPT-3.5, and Adam optimizer are mentioned without specific versions.
Experiment Setup Yes The number of heads used are 7 and 16 for the Pathway module and Co T decoder, respectively. The number of proposals L and the maximum sentence length W are 52 and 24, respectively. ... The maximum number M of objects in the sentence, the output sequence length for our Co T decoder, is 8 and 3 for Nr3D and Sr3D, respectively. Following previous works... we randomly sample 1024 points for each proposal, set the hidden dimensions d to 768, and train the model for 100 epochs from scratch using the weight initialization strategy described in (He et al., 2015). The initial learning rate is set to 10 4 and decreases by 0.65 every ten epochs. The Adam optimizer (Kingma & Ba, 2014) and a mini-batch size of 24 per GPU are used for training all the models. We set the losses weights as follows: λV = 5, λT = 0.5, Lref = 5, and λdist = 1.