reproducibilityindex.ai

3D-Aware Visual Question Answering about Parts, Poses and Occlusions

Authors: Xingrui Wang, Wufei Ma, Zhuowan Li, Adam Kortylewski, Alan L. Yuille

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks, indicating that 3D-aware VQA remains an important open research area.
Researcher Affiliation	Academia	1 Johns Hopkins University 2 Max Planck Institute for Informatics 3 University of Freiburg
Pseudocode	No	The paper describes its methods in prose and with diagrams, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/Xingrui Wang/3D-Aware-VQA.
Open Datasets	No	We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes. First, we introduce Super-CLEVR-3D, a compositional reasoning dataset that contains questions about object parts, their 3D poses, and occlusions.
Dataset Splits	Yes	The dataset splits follow the Super-CLEVR dataset, where we have 20k images for training, 5k for validation, and 5k for testing.
Hardware Specification	Yes	We train the 6D pose estimator (including the contrastive feature backbone and the nerual mesh models for each of the 5 classes) for 15k iterations with batch size 15, which takes around 2 hours on NVIDIA RTX A5000 for each class.
Software Dependencies	No	The paper mentions using a 'Res Net50' for the attribute classifier, but does not provide specific software library names with version numbers (e.g., TensorFlow, PyTorch) or other dependencies.
Experiment Setup	Yes	We train the 6D pose estimator (including the contrastive feature backbone and the nerual mesh models for each of the 5 classes) for 15k iterations with batch size 15, which takes around 2 hours on NVIDIA RTX A5000 for each class. The attribute classifier, which is a Res Net50, is shared for objects and parts. It is trained for 100 epochs with batch size 64. During inference of the 6D pose estimator, we assume the theta is 0. During 3D NMS filtering, we choose the radius r as 2, and we also filter the object proposals with a threshold of 15 on the score map.