LOVA3: Learning to Visual Question Answering, Asking and Assessment

Authors: Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate this hypothesis, we train MLLMs using the LOVA3 framework and evaluate them on a range of multimodal datasets and benchmarks. Our results demonstrate consistent performance gains, underscoring the critical role of these additional tasks in fostering comprehensive intelligence in MLLMs.
Researcher Affiliation Academia Henry Hengyuan Zhao1, Pan Zhou2 , Difei Gao1, Zechen Bai1, Mike Zheng Shou1 1Show Lab, National University of Singapore, 2Singapore Management University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/showlab/LOVA3.
Open Datasets Yes VQAv2 [26] and GQA [30] are two large-scale annotated VQA datasets comprising 430K and 943K instances. (2) Viz Wiz [27] is a challenging dataset comprising 8000 instances of test-dev set. (3) Science QA [50] is a benchmark comprising 21k multimodal multiple-choice questions with diverse science topics. (4) POPE [39] is a benchmark for evaluating the object hallucination in the MLLM. (5) MME [19], SEED-Bench [36], MMBench [45], LLa VA-Bench [43], MM-Vet [92] are five prominent multimodal benchmarks designed to evaluate various capabilities of MLLMs, including object existence, color recognition, counting, OCR, etc.
Dataset Splits Yes Our approach involves the creation of two supplementary training tasks Gen QA and Eval QA, aiming at fostering the skills of asking and assessing questions in the context of images. To develop the questioning ability, we compile a comprehensive set of multimodal foundational tasks. For assessment, we introduce a new benchmark called Eval QABench, comprising 64,000 training samples (split evenly between positive and negative samples) and 5,000 validation and testing samples.
Hardware Specification Yes The training process takes 24.5 hours on an 8 Nvidia A100 (40G) GPU setup.
Software Dependencies No The paper mentions models like LLa VA-1.5, Vicuna-7B, Phi-1.5B, Llama 2, Fuyu-8B, and CLIP, and the Adam W optimizer. However, it does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA, which are necessary for full reproducibility.
Experiment Setup Yes The model is trained for one epoch across three tasks: VQA, Gen QA, and Eval QA. Specifically, we employ the Adam W [46] optimizer with a learning rate of 2 10 5 and a total batch size of 128. The hyperparameters of LOVA3 are aligned with those of LLa VA1.5 to ensure a fair comparison, as illustrated in Tab 10. (Table 10 details: batch size 128, learning rate 2e-5, learning rate schedule cosine decay, learning rate warmup ratio 0.03, weight decay 0, epoch 1, optimizer Adam W, Deep Speed stage 3).