MultiModalQA: complex question answering over text, tables and images

Authors: Alon Talmor, Ori Yoran, Amnon Catav, Dan Lahav, Yizhong Wang, Akari Asai, Gabriel Ilharco, Hannaneh Hajishirzi, Jonathan Berant

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We create 29,918 questions through this procedure, and empirically demonstrate the necessity of a multi-modal multi-hop approach to solve our task: our multi-hop model, Implicit Decomp, achieves an average F1 of 51.7 over cross-modal questions, substantially outperforming a strong baseline that achieves 38.2 F1, but still lags significantly behind human performance, which is at 90.1 F1.
Researcher Affiliation Collaboration 1The Allen Institute for AI, 2Tel-Aviv University, 3University of Washington
Pseudocode No The paper describes the model components and their interaction in prose, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Our dataset and code are available at https://allenai.github.io/multimodalqa.
Open Datasets Yes Our dataset and code are available at https://allenai.github.io/multimodalqa.
Dataset Splits Yes We split the dataset into 23,817 training, 2,441 development (dev.), and 3,660 test set examples.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions pre-trained models like RoBERTa-large and VILBERT-MT1, but does not provide specific version numbers for any software dependencies, libraries, or frameworks used for implementation or experimentation.
Experiment Setup No The paper describes the training strategy and loss functions (e.g., cross entropy loss) but does not provide specific experimental setup details such as hyperparameter values (learning rate, batch size, number of epochs), optimizer settings, or detailed training schedules.