Memory-Augmented Image Captioning
Authors: Zhengcong Fei1317-1324
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To better measure its effects, we conduct an extensive empirical evaluation on the MS COCO benchmark (Chen et al. 2015). |
| Researcher Affiliation | Academia | 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China |
| Pseudocode | Yes | Algorithm 1: Memory Augmented Image Caption Generation |
| Open Source Code | No | The paper mentions using FAISS, an open-source library, but does not provide access to its own implementation code. |
| Open Datasets | Yes | We utilize the most popular image captioning dataset MSCOCO (Chen et al. 2015) to evaluate the performance of our proposed method. |
| Dataset Splits | Yes | This split contains 113,287 images for training and 5,000 respectively for validation and test. |
| Hardware Specification | Yes | Once the keys are saved, for the MS COCO dataset, building the cache with 328M entries takes roughly one hour on a single 1080Ti GPU. |
| Software Dependencies | No | The paper mentions using FAISS, an open-source library, and pre-trained Faster-RCNN, but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | Following (Anderson et al. 2018), the keys used for knowledge retrieval are the 1024-dimensional representations copied from context vectors. We perform a single forward pass over the total training set with the trained captioning model, in order to create the keys and values. A FAISS index is then created using 1.5M randomly sampled keys to learn 2K cluster centroids, and keys are quantized to 64-bytes. During inference, we query the memory with k = 512 most similar entries, and the index looks up 32 cluster centroids while searching for the next word candidates. The tempreture T is set to 100 and the balancing parameter λ is selected based on the CIDEr score on the validation set. |