reproducibilityindex.ai

VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning

Authors: Xiaowei Hu, Xi Yin, Kevin Lin, Lei Zhang, Jianfeng Gao, Lijuan Wang, Zicheng Liu1575-1583

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.
Researcher Affiliation	Collaboration	Xiaowei Hu, Xi Yin, Kevin Lin, Lei Zhang, Jianfeng Gao, Lijuan Wang, Zicheng Liu Microsoft Corporation {xiaowh, keli, lijuanw, leizhang, jfgao, zliu}@microsoft.com, yinxi.whu@gmail.com
Pseudocode	No	The paper describes the model architecture and training steps in textual form and through diagrams, but it does not include a formal pseudocode block or algorithm block.
Open Source Code	No	The paper discusses future work related to leveraging more data but does not provide an explicit statement or link for the open-source code of the described methodology within the document.
Open Datasets	Yes	We use the Open Images V5 challenge training set, which has 1.7M images, for VIVO pre-training. ... In the fine-tuning stage, the training data is the COCO training set of 118K images, each with 5 captions. We evaluate our model on the validation and test sets of nocaps, which consist of 4.5K and 10.6K images from the Open Images validation and test sets, respectively.
Dataset Splits	Yes	We evaluate our model on the validation and test sets of nocaps, which consist of 4.5K and 10.6K images from the Open Images validation and test sets, respectively.
Hardware Specification	No	The paper describes the software components and experimental setup, but it does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies	No	The paper mentions using specific models like Up Down (Anderson et al. 2018) and BERT-base (Devlin et al. 2018), but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	In VIVO pre-training, we use a maximum of 50 image regions and 15 tag tokens per image. The model is trained for 160K iterations (about 100 epochs) with a batch size of 1024 and a learning rate of 5e-5. In fine-tuning, we set the maximum caption length to 40 and the maximum tag length to 30. The model is trained for 30 epochs with a batch size of 256 and a learning rate of 5e-5, optimized using the cross-entropy loss. To further boost the performance, we perform the SCST optimization (Rennie et al. 2017) with a learning rate of 2e-6 for 5 epochs. During inference, we use greedy decoding to generate image captions with a maximum length of 20.