Learning to Generate Visual Questions with Noisy Supervision
Authors: Shen Kai, Lingfei Wu, Siliang Tang, Yueting Zhuang, zhen he, Zhuoye Ding, Yun Xiao, Bo Long
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on two benchmark datasets show that our proposed method outperforms the state-of-the-art approaches by a large margin on a variety of metrics, including both automatic machine metrics and human evaluation. |
| Researcher Affiliation | Collaboration | Kai Shen , Lingfei Wu , Siliang Tang , Yueting Zhuang , Zhen He , Zhuoye Ding , Yun Xiao , and Bo LongZhejiang University JD.COM shenkai@zju.edu.cn, lwu@email.wm.edu, {siliang,yzhuang}@zju.edu.cn, {bjhezhen,dingzhuoye,xiaoyun1,bo.long}@jd.com |
| Pseudocode | No | The paper provides mathematical formulations and descriptions of modules but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1The code and data for our model are provided for research purposes: DH-GAN for VQG Github Repo. |
| Open Datasets | Yes | We conduct the experiments on the VQA2.0 [3] and COCO-QA [39] datasets. |
| Dataset Splits | Yes | After pre-processing, the VQA2.0 has 278707/135584, and the COCO-QA dataset has 58979/29017 examples for training/validation split, respectively. Since the test splits (for these two datasets) are not open for the public, we divide the validation set to 10% validation split and 90% test split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software tools like 'pre-trained Faster-RCNN' but does not specify version numbers for any key software components or libraries. |
| Experiment Setup | Yes | For text data, we truncate the questions longer than 20 words and build the vocabulary on the words with at least 3 occurrences. We train it with the cross-entropy loss function denoted as Llm. The loss function of the generator derived from Eq. 11 can be written as: Lrl = [R({I, Q, A}) R({I, ˆQ, A})][log P(Q|I, A, V) + β log P(V|I, A)], where β is the hyper-parameter, log P(Q|I, A, V) is the question generation loss with target question Q in Sec. 2.2.2 and log P(V|I, A) is the visual hints prediction loss given target visual hints V in Sec. 2.2.1. Practically, we find that it is unstable to update the generator by minimizing the loss Lrl. Thus we combine both the teacher-forcing loss Lsup in Eq. 6 and the reinforcement loss as: LG = γLrl + (1 γ)Lsup, where γ is a scaling factor controlling the trade-off between teacher-forcing loss and RL loss. |