VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

Authors: Xiaowei Hu, Xi Yin, Kevin Lin, Lei Zhang, Jianfeng Gao, Lijuan Wang, Zicheng Liu1575-1583

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.
Researcher Affiliation Collaboration Xiaowei Hu, Xi Yin, Kevin Lin, Lei Zhang, Jianfeng Gao, Lijuan Wang, Zicheng Liu Microsoft Corporation {xiaowh, keli, lijuanw, leizhang, jfgao, zliu}@microsoft.com, yinxi.whu@gmail.com
Pseudocode No The paper describes the model architecture and training steps in textual form and through diagrams, but it does not include a formal pseudocode block or algorithm block.
Open Source Code No The paper discusses future work related to leveraging more data but does not provide an explicit statement or link for the open-source code of the described methodology within the document.
Open Datasets Yes We use the Open Images V5 challenge training set, which has 1.7M images, for VIVO pre-training. ... In the fine-tuning stage, the training data is the COCO training set of 118K images, each with 5 captions. We evaluate our model on the validation and test sets of nocaps, which consist of 4.5K and 10.6K images from the Open Images validation and test sets, respectively.
Dataset Splits Yes We evaluate our model on the validation and test sets of nocaps, which consist of 4.5K and 10.6K images from the Open Images validation and test sets, respectively.
Hardware Specification No The paper describes the software components and experimental setup, but it does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions using specific models like Up Down (Anderson et al. 2018) and BERT-base (Devlin et al. 2018), but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes In VIVO pre-training, we use a maximum of 50 image regions and 15 tag tokens per image. The model is trained for 160K iterations (about 100 epochs) with a batch size of 1024 and a learning rate of 5e-5. In fine-tuning, we set the maximum caption length to 40 and the maximum tag length to 30. The model is trained for 30 epochs with a batch size of 256 and a learning rate of 5e-5, optimized using the cross-entropy loss. To further boost the performance, we perform the SCST optimization (Rennie et al. 2017) with a learning rate of 2e-6 for 5 epochs. During inference, we use greedy decoding to generate image captions with a maximum length of 20.