3D-Aware Visual Question Answering about Parts, Poses and Occlusions
Authors: Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, Alan L. Yuille
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an important open research area. |
| Researcher Affiliation | Academia | 1 Johns Hopkins University 2 Max Planck Institute for Informatics 3 University of Freiburg |
| Pseudocode | No | The paper describes its methods in prose and with diagrams, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/Xingrui Wang/3D-Aware-VQA. |
| Open Datasets | No | We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes. First, we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains questions about object parts, their 3D poses, and occlusions. |
| Dataset Splits | Yes | The dataset splits follow the Super-CLEVR dataset, where we have 20k images for training, 5k for validation, and 5k for testing. |
| Hardware Specification | Yes | We train the 6D pose estimator (including the contrastive feature backbone and the nerual mesh models for each of the 5 classes) for 15k iterations with batch size 15, which takes around 2 hours on NVIDIA RTX A5000 for each class. |
| Software Dependencies | No | The paper mentions using a 'Res Net50' for the attribute classifier, but does not provide specific software library names with version numbers (e.g., TensorFlow, PyTorch) or other dependencies. |
| Experiment Setup | Yes | We train the 6D pose estimator (including the contrastive feature backbone and the nerual mesh models for each of the 5 classes) for 15k iterations with batch size 15, which takes around 2 hours on NVIDIA RTX A5000 for each class. The attribute classifier, which is a Res Net50, is shared for objects and parts. It is trained for 100 epochs with batch size 64. During inference of the 6D pose estimator, we assume the theta is 0. During 3D NMS filtering, we choose the radius r as 2, and we also filter the object proposals with a threshold of 15 on the score map. |