CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
Authors: Eslam Mohamed BAKR, Mohamed Ayman Mohamed, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach through comprehensive experiments on the Nr3D, Sr3D, and Scanrefer benchmarks and show consistent performance gains compared to existing methods without requiring manually annotated data. |
| Researcher Affiliation | Academia | Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny King Abdullah University of Science and Technology (KAUST) {eslam.abdelrahman, mohamed.mohamed.2, mahmoud.ahmed, habib.slim, mohamed.elhoseiny}@kaust.edu.sa |
| Pseudocode | Yes | We summarize our localization method in Algorithm 1. |
| Open Source Code | Yes | The code is available at github.com/eslambakr/Co T3DV G. |
| Open Datasets | Yes | Datasets. To probe the effectiveness of our proposed framework, Co T3DRef, we conduct evaluations on three 3D visual-grounding benchmarks, namely Nr3D, Sr3D (Achlioptas et al., 2020) and Scan Refer (Chen et al., 2020). |
| Dataset Splits | Yes | As shown in Table 1, on the challenging setup, where we assume access for only 10% of the training data while testing on the entire testing dataset, the parallel variant boosts the performance by 4% and 6.5% over the vanilla MVT using Nr3D and Sr3D, respectively (row b). On the other hand, our Co T3DRef framework surpasses the vanilla MVT by 10% and 16.4% using Nr3D and Sr3D, respectively (row c). |
| Hardware Specification | Yes | We used the Py Torch framework and a single NVIDIA A6000 GPU for training. |
| Software Dependencies | No | The paper mentions "Py Torch framework" but does not specify a version number. Other software like BERT, GPT-3.5, and Adam optimizer are mentioned without specific versions. |
| Experiment Setup | Yes | The number of heads used are 7 and 16 for the Pathway module and Co T decoder, respectively. The number of proposals L and the maximum sentence length W are 52 and 24, respectively. ... The maximum number M of objects in the sentence, the output sequence length for our Co T decoder, is 8 and 3 for Nr3D and Sr3D, respectively. Following previous works... we randomly sample 1024 points for each proposal, set the hidden dimensions d to 768, and train the model for 100 epochs from scratch using the weight initialization strategy described in (He et al., 2015). The initial learning rate is set to 10 4 and decreases by 0.65 every ten epochs. The Adam optimizer (Kingma & Ba, 2014) and a mini-batch size of 24 per GPU are used for training all the models. We set the losses weights as follows: λV = 5, λT = 0.5, Lref = 5, and λdist = 1. |