Cross-Modal Match for Language Conditioned 3D Object Grounding
Authors: Yachao Zhang, Runze Hu, Ronghui Li, Yanyun Qu, Yuan Xie, Xiu Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method. |
| Researcher Affiliation | Academia | 1Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China 2 School of Information and Electronics, Beijing Institute of Technology, Beijing, 100081, China 3School of Informatics, Xiamen University, Xiamen, 361000, China 4School of Computer Science and Technology, East China Normal University, Shanghai, 200062, China |
| Pseudocode | No | The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | We leverage three recently released datasets, i.e., Nr3D (Achlioptas et al. 2020), Sr3D (Achlioptas et al. 2020) and Scan Refer (Chen, Chang, and Nießner 2020) built on the 3D scenes of Scan Net (Dai et al. 2017) to evaluate performance. We follow the official split for training and validation. |
| Dataset Splits | Yes | We follow the official split for training and validation. Additional split validation subsets. For Nr3D and Sr3D datasets, two splits during evaluation are introduced. 1) According to the number of distractors (more distractors indicate more difficulty), the sentences are split into an easy subset (less than or equal to 2 distractors) and a hard subset (more than 2 distractors) in evaluation. 2) According to whether the sentence requires a specific viewpoint to ground the referred object, the dataset can also be partitioned into view-dependent and view-independent subsets. |
| Hardware Specification | Yes | It is trained and evaluated on one NVIDIA RTX 3090 GPU with 24GB RAM. |
| Software Dependencies | Yes | We implement our model by using Py Torch based on Python 3.8. |
| Experiment Setup | Yes | We set batch size as 128, and learning rate as 0.0005 with a warm-up of 5, 000 iterations and cosine decay scheduling. Our model is trained 100 epochs using Adam optimizer. We directly set αa = 1 and αb = 1. We set the grid w of BEV as 0.5m. |