EQA-MX: Embodied Question Answering using Multimodal Expression
Authors: Md Mofijul Islam, Alexi Gladstone, Riashat Islam, Tariq Iqbal
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experimental results suggest that VQ-Fusion can improve the performance of existing visual-language models up to 13% across EQA tasks. In this section, we have presented experimental analyses on our EQA-MX dataset to evaluate the impact of VQ-Fusion in VL models for EQA tasks. |
| Researcher Affiliation | Collaboration | Md Mofijul Islam , Alexi Gladstone , Riashat Islam , Tariq Iqbal University of Virginia, Amazon Gen AI, Mc Gill University, Mila, Quebec AI Institute |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Source code of VQ-Fusion, benchmark models, dataset processing, and dataset analyses: https://bit.ly/eqa-repo |
| Open Datasets | Yes | We have developed a novel large-scale dataset, EQA-MX, with over 8 million diverse embodied QA data samples involving multimodal expressions from multiple visual and verbal perspectives. EQA-MX dataset (162 GB): https://bit.ly/eqa-mx-dataset |
| Dataset Splits | Yes | The training, validation, and test set splits for each of these tasks is shown in Table 2. Table 2: EQA-MX dataset splits for 8 EQA tasks. Splits EP OG POG OC OAQ OAC PG RG Train 1060k 1060k 1060k 1060k 1060k 218k 785k 349k Valid 126k 126k 126k 126k 126k 27k 93k 41k Test 126k 126k 126k 126k 126k 28k 93k 42k |
| Hardware Specification | Yes | Lastly, all models are trained in distributed GPU clusters, where each node contains 8 A100 GPUs. |
| Software Dependencies | Yes | We developed all the models using the Pytorch (version: 1.12.1+cu113) (Paszke et al., 2019) and Pytorch-Lightning (version: 1.7.1) (Falcon, 2019) deep learning frameworks. We also used Hugging Face library (version: 4.21.1) for pre-trained models (BERT 1 (Devlin et al., 2018), Vi T 2 (Dosovitskiy et al., 2020), Visual BERT 3 (Li et al., 2019), Dual Encoder 4, Vi LT 5(Kim et al., 2021), and CLIP 6 (Radford et al., 2021)). |
| Experiment Setup | Yes | For the Dual-Encoder and CLIP models, we used an embedding size of 512, and for Visual BERT and Vi LT, we used an embedding size of 768. We train models using the Adam optimizer with a weight decay regularization (Loshchilov & Hutter, 2017) and cosine annealing warm restarts at an initial learning rate: 3e 4, cycle length (T0): 4, and cycle multiplier (Tmult): 2. We used batch size 128 and trained models for 8 epochs. We used the same fixed random seed (33) for all the experiments to ensure reproducibility. |