reproducibilityindex.ai

EQA-MX: Embodied Question Answering using Multimodal Expression

Authors: Md Mofijul Islam, Alexi Gladstone, Riashat Islam, Tariq Iqbal

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experimental results suggest that VQ-Fusion can improve the performance of existing visual-language models up to 13% across EQA tasks. In this section, we have presented experimental analyses on our EQA-MX dataset to evaluate the impact of VQ-Fusion in VL models for EQA tasks.
Researcher Affiliation	Collaboration	Md Mofijul Islam , Alexi Gladstone , Riashat Islam , Tariq Iqbal University of Virginia, Amazon Gen AI, Mc Gill University, Mila, Quebec AI Institute
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Source code of VQ-Fusion, benchmark models, dataset processing, and dataset analyses: https://bit.ly/eqa-repo
Open Datasets	Yes	We have developed a novel large-scale dataset, EQA-MX, with over 8 million diverse embodied QA data samples involving multimodal expressions from multiple visual and verbal perspectives. EQA-MX dataset (162 GB): https://bit.ly/eqa-mx-dataset
Dataset Splits	Yes	The training, validation, and test set splits for each of these tasks is shown in Table 2. Table 2: EQA-MX dataset splits for 8 EQA tasks. Splits EP OG POG OC OAQ OAC PG RG Train 1060k 1060k 1060k 1060k 1060k 218k 785k 349k Valid 126k 126k 126k 126k 126k 27k 93k 41k Test 126k 126k 126k 126k 126k 28k 93k 42k
Hardware Specification	Yes	Lastly, all models are trained in distributed GPU clusters, where each node contains 8 A100 GPUs.
Software Dependencies	Yes	We developed all the models using the Pytorch (version: 1.12.1+cu113) (Paszke et al., 2019) and Pytorch-Lightning (version: 1.7.1) (Falcon, 2019) deep learning frameworks. We also used Hugging Face library (version: 4.21.1) for pre-trained models (BERT 1 (Devlin et al., 2018), Vi T 2 (Dosovitskiy et al., 2020), Visual BERT 3 (Li et al., 2019), Dual Encoder 4, Vi LT 5(Kim et al., 2021), and CLIP 6 (Radford et al., 2021)).
Experiment Setup	Yes	For the Dual-Encoder and CLIP models, we used an embedding size of 512, and for Visual BERT and Vi LT, we used an embedding size of 768. We train models using the Adam optimizer with a weight decay regularization (Loshchilov & Hutter, 2017) and cosine annealing warm restarts at an initial learning rate: 3e 4, cycle length (T0): 4, and cycle multiplier (Tmult): 2. We used batch size 128 and trained models for 8 epochs. We used the same fixed random seed (33) for all the experiments to ensure reproducibility.