ConVQG: Contrastive Visual Question Generation with Multimodal Guidance

Authors: Li Mi, Syrielle Montariol, Javiera Castillo Navarro, Xianjie Dai, Antoine Bosselut, Devis Tuia

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that Con VQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions.
Researcher Affiliation Academia Li Mi*, Syrielle Montariol*, Javiera Castillo-Navarro*, Xianjie Dai, Antoine Bosselut and Devis Tuia EPFL, Switzerland {li.mi, antoine.bosselut, devis.tuia}@epfl.ch
Pseudocode No The paper describes the model architecture and method through text and diagrams (Figure 2), but does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using publicly available BLIP models and thanks others for providing baseline codes, but there is no explicit statement or link indicating that the authors' own Con VQG source code is openly available.
Open Datasets Yes We evaluate our VQG method on three public datasets: a knowledge-aware benchmark (K-VQG) and two standard VQG benchmarks (VQA 2.0 and VQG COCO). K-VQG (Uehara and Harada 2023) is a knowledge-aware VQG dataset. It is a large-scale, humanly annotated dataset, where image-grounded questions are tied to structured knowledge (knowledge triplets). Each sample consists of an image, a question, an answer, and a knowledge triplet. KVQG contains 13K images and 16K (question, answer) pairs, related to 6K knowledge triplets. VQA 2.0 (Goyal et al. 2017) with more than 1M (image, question, answer) triplets, it is the largest and most commonly used dataset for VQG evaluation. Images come from the COCO dataset (Lin et al. 2014), and three (question, answer) pairs were collected per image. VQG COCO (Mostafazadeh et al. 2016) was created to generate natural and engaging questions for images.
Dataset Splits Yes VQG COCO (Mostafazadeh et al. 2016) was created to generate natural and engaging questions for images. It contains 2500 training images, 1250 validation images, and 1250 testing images.
Hardware Specification Yes Training was done on six NVIDIA A100-SXM4-40GB with a batch size of 24 each (VQA 2.0 dataset) and four NVIDIA V100-SXM2-32GB with a batch size of 16 each (K-VQG dataset, VQG-COCO dataset).
Software Dependencies No The paper mentions several software components like BLIP, BERT, ViT, sentence-BERT, and pycocoevalcap, but it does not specify exact version numbers for these or other software dependencies.
Experiment Setup Yes The image encoder is a Vi T-B/16, i.e., a Vi T architecture with 12 attention heads, 12 hidden layers, and images divided into 16 16 patches. The text encoder and the question decoder are BERTbase models, i.e., transformer encoder with 12 attention heads and 12 hidden layers. The number of epochs varies depending on the dataset (10 for VQA 2.0, 5 for K-VQG, 5 for VQG-COCO). The starting learning rate is 2e-5 with a weight decay of 0.05. In the proposed Con VQG method, there are three core parameters: α (Eq. (5)), β (Eq. (6)), both balancing the different parts of the loss and the margin m (Eq. (3) and (4)).