Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
Authors: Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform ablation studies showing advantages of our approach. We also demonstrate our model to significantly outperform the state of the art on the challenging Nr3D, Sr3D and Scan Refer 3D object grounding datasets. |
| Researcher Affiliation | Academia | Shizhe Chen1, Pierre-Louis Guhur1, Makarand Tapaswi2, Cordelia Schmid1, Ivan Laptev1 1Inria, École normale supérieure, CNRS, PSL Research University, 2IIIT Hyderabad |
| Pseudocode | No | The paper describes the model architecture and training process in detail through prose and diagrams (e.g., Figure 2 and Figure 3), but it does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models are available on the project webpage [22]. Project webpage. https://cshizhe.github.io/projects/vil3dref.html. |
| Open Datasets | Yes | Nr3D dataset [8] contains 37,842 human-written sentences that refer to annotated objects in the 3D indoor scene dataset ScanNet [25]. |
| Dataset Splits | Yes | Nr3D dataset... It includes 641 scenes with 511 (resp. 130) scenes for training (resp. validation). Sr3D dataset... It has 1,018 training scenes and 255 validation scenes from ScanNet and 83,570 sentences in total. Scan Refer dataset... We follow the official split and use 36,665 and 9,508 samples for training and validation respectively. |
| Hardware Specification | Yes | All models are trained on a single NVIDIA RTX A6000 GPU. |
| Software Dependencies | No | The paper mentions using pre-trained BERT model [39], PointNet++ [40], Glove word vectors [41], and the AdamW algorithm [42], but it does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For the model architecture, we set the dimension d = 768 and use 12 heads for all the transformer layers. The text encoding module is a three-layer transformer initialized from BERT [39], and the multimodal fusion module contains four layers. The object encoding module PointNet++ [40] samples 1024 points for all the objects. The hyper-parameters in the loss function Eq (8) are set to λa = 1 and λh = 0.02. We train the model with a batch size of 128 and a learning rate of 0.0005 with warm-up of 5000 iterations and cosine decay scheduling. We train for 50 epochs for the teacher model and 100 epochs for the student model on Nr3D and Scan Refer datasets. |