Exploiting Contextual Objects and Relations for 3D Visual Grounding
Authors: Li Yang, chunfeng yuan, Ziqi Zhang, Zhongang Qi, Yan Xu, Wei Liu, Ying Shan, Bing Li, Weiping Yang, Peng Li, Yan Wang, Weiming Hu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our model on the challenging Nr3D, Sr3D, and Scan Refer datasets and demonstrate state-of-the-art performance. |
| Researcher Affiliation | Collaboration | 1State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA 2ARC Lab, Tencent PCG, 3The Chinese University of Hong Kong 4Education Management Information Center, Ministry of Education 5Alibaba Group 6Zhejiang Linkheer Science And Technology Co., Ltd. 7School of Artificial Intelligence, University of Chinese Academy of Sciences 8School of Information Science and Technology, Shanghai Tech University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code will be public at https://github.com/ yangli18/CORE-3DVG. |
| Open Datasets | Yes | Nr3D [6] is built on 3D indoor scene dataset Scan Net [35]. It contains 41,503 human-annotated text descriptions, covering 76 object categories and 707 indoor scenes. Sr3D [6] contains 83,572 descriptions that are automatically generated using specific templates. Scan Refer [7] provides 51,583 text descriptions of 11,046 objects in 800 3D scenes from the Scan Net. |
| Dataset Splits | Yes | Nr3D is divided into Easy and Hard subsets depending on whether there are objects that share the same category as the target in the scene. ... Sr3D dataset is divided into multiple subsets for evaluation. ... Scan Refer ... The official division takes 36,665 samples as the training set and 9,508 as the test set. According to whether the target object category is unique in the scene, the dataset is divided into "Unique" and "Multiple" subsets. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. It only mentions general training without hardware specifics. |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimizer', 'Ro BERTa', and 'Point Net++ network' but does not provide specific version numbers for these or for general software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | We utilize the Adam W optimizer [36] to train our model with a batch size of 24. For visual feature encoding, we utilize the Point Net++ network [37] with an initial learning rate of 10 3. The rest of the model has an initial learning rate of 10 4, and the weight decay value is set to 5 10 4. ... We train our model for 120 epochs on the Nr3D dataset, 60 epochs on the Sr3D dataset, and 100 epochs on the Scan Refer dataset. The hyper-parameter in Equation 6 is set to 0.4, and we set the weight hyper-parameters λgiou = 1 and λL1 = 5 for the Lbox loss. The numbers of decoder layers for the text-guided object detection, relation matching, and target identification networks are set to Nd = 3, Nr = 2, and Ng = 3, respectively. |