Learning to Generate Visual Questions with Noisy Supervision

Authors: Shen Kai, Lingfei Wu, Siliang Tang, Yueting Zhuang, zhen he, Zhuoye Ding, Yun Xiao, Bo Long

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on two benchmark datasets show that our proposed method outperforms the state-of-the-art approaches by a large margin on a variety of metrics, including both automatic machine metrics and human evaluation.
Researcher Affiliation Collaboration Kai Shen , Lingfei Wu , Siliang Tang , Yueting Zhuang , Zhen He , Zhuoye Ding , Yun Xiao , and Bo LongZhejiang University JD.COM shenkai@zju.edu.cn, lwu@email.wm.edu, {siliang,yzhuang}@zju.edu.cn, {bjhezhen,dingzhuoye,xiaoyun1,bo.long}@jd.com
Pseudocode No The paper provides mathematical formulations and descriptions of modules but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes 1The code and data for our model are provided for research purposes: DH-GAN for VQG Github Repo.
Open Datasets Yes We conduct the experiments on the VQA2.0 [3] and COCO-QA [39] datasets.
Dataset Splits Yes After pre-processing, the VQA2.0 has 278707/135584, and the COCO-QA dataset has 58979/29017 examples for training/validation split, respectively. Since the test splits (for these two datasets) are not open for the public, we divide the validation set to 10% validation split and 90% test split.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software tools like 'pre-trained Faster-RCNN' but does not specify version numbers for any key software components or libraries.
Experiment Setup Yes For text data, we truncate the questions longer than 20 words and build the vocabulary on the words with at least 3 occurrences. We train it with the cross-entropy loss function denoted as Llm. The loss function of the generator derived from Eq. 11 can be written as: Lrl = [R({I, Q, A}) R({I, ˆQ, A})][log P(Q|I, A, V) + β log P(V|I, A)], where β is the hyper-parameter, log P(Q|I, A, V) is the question generation loss with target question Q in Sec. 2.2.2 and log P(V|I, A) is the visual hints prediction loss given target visual hints V in Sec. 2.2.1. Practically, we find that it is unstable to update the generator by minimizing the loss Lrl. Thus we combine both the teacher-forcing loss Lsup in Eq. 6 and the reinforcement loss as: LG = γLrl + (1 γ)Lsup, where γ is a scaling factor controlling the trade-off between teacher-forcing loss and RL loss.