Cross-Modal Match for Language Conditioned 3D Object Grounding

Authors: Yachao Zhang, Runze Hu, Ronghui Li, Yanyun Qu, Yuan Xie, Xiu Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method.
Researcher Affiliation Academia 1Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China 2 School of Information and Electronics, Beijing Institute of Technology, Beijing, 100081, China 3School of Informatics, Xiamen University, Xiamen, 361000, China 4School of Computer Science and Technology, East China Normal University, Shanghai, 200062, China
Pseudocode No The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes We leverage three recently released datasets, i.e., Nr3D (Achlioptas et al. 2020), Sr3D (Achlioptas et al. 2020) and Scan Refer (Chen, Chang, and Nießner 2020) built on the 3D scenes of Scan Net (Dai et al. 2017) to evaluate performance. We follow the official split for training and validation.
Dataset Splits Yes We follow the official split for training and validation. Additional split validation subsets. For Nr3D and Sr3D datasets, two splits during evaluation are introduced. 1) According to the number of distractors (more distractors indicate more difficulty), the sentences are split into an easy subset (less than or equal to 2 distractors) and a hard subset (more than 2 distractors) in evaluation. 2) According to whether the sentence requires a specific viewpoint to ground the referred object, the dataset can also be partitioned into view-dependent and view-independent subsets.
Hardware Specification Yes It is trained and evaluated on one NVIDIA RTX 3090 GPU with 24GB RAM.
Software Dependencies Yes We implement our model by using Py Torch based on Python 3.8.
Experiment Setup Yes We set batch size as 128, and learning rate as 0.0005 with a warm-up of 5, 000 iterations and cosine decay scheduling. Our model is trained 100 epochs using Adam optimizer. We directly set αa = 1 and αb = 1. We set the grid w of BEV as 0.5m.