Memory-Augmented Image Captioning

Authors: Zhengcong Fei1317-1324

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To better measure its effects, we conduct an extensive empirical evaluation on the MS COCO benchmark (Chen et al. 2015).
Researcher Affiliation Academia 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China
Pseudocode Yes Algorithm 1: Memory Augmented Image Caption Generation
Open Source Code No The paper mentions using FAISS, an open-source library, but does not provide access to its own implementation code.
Open Datasets Yes We utilize the most popular image captioning dataset MSCOCO (Chen et al. 2015) to evaluate the performance of our proposed method.
Dataset Splits Yes This split contains 113,287 images for training and 5,000 respectively for validation and test.
Hardware Specification Yes Once the keys are saved, for the MS COCO dataset, building the cache with 328M entries takes roughly one hour on a single 1080Ti GPU.
Software Dependencies No The paper mentions using FAISS, an open-source library, and pre-trained Faster-RCNN, but does not specify version numbers for any software dependencies.
Experiment Setup Yes Following (Anderson et al. 2018), the keys used for knowledge retrieval are the 1024-dimensional representations copied from context vectors. We perform a single forward pass over the total training set with the trained captioning model, in order to create the keys and values. A FAISS index is then created using 1.5M randomly sampled keys to learn 2K cluster centroids, and keys are quantized to 64-bytes. During inference, we query the memory with k = 512 most similar entries, and the index looks up 32 cluster centroids while searching for the next word candidates. The tempreture T is set to 100 and the balancing parameter λ is selected based on the CIDEr score on the validation set.