Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Memory-Augmented Image Captioning
Authors: Zhengcong Fei1317-1324
AAAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To better measure its effects, we conduct an extensive empirical evaluation on the MS COCO benchmark (Chen et al. 2015). |
| Researcher Affiliation | Academia | 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China 2University of Chinese Academy of Sciences, Beijing 100049, China |
| Pseudocode | Yes | Algorithm 1: Memory Augmented Image Caption Generation |
| Open Source Code | No | The paper mentions using FAISS, an open-source library, but does not provide access to its own implementation code. |
| Open Datasets | Yes | We utilize the most popular image captioning dataset MSCOCO (Chen et al. 2015) to evaluate the performance of our proposed method. |
| Dataset Splits | Yes | This split contains 113,287 images for training and 5,000 respectively for validation and test. |
| Hardware Specification | Yes | Once the keys are saved, for the MS COCO dataset, building the cache with 328M entries takes roughly one hour on a single 1080Ti GPU. |
| Software Dependencies | No | The paper mentions using FAISS, an open-source library, and pre-trained Faster-RCNN, but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | Following (Anderson et al. 2018), the keys used for knowledge retrieval are the 1024-dimensional representations copied from context vectors. We perform a single forward pass over the total training set with the trained captioning model, in order to create the keys and values. A FAISS index is then created using 1.5M randomly sampled keys to learn 2K cluster centroids, and keys are quantized to 64-bytes. During inference, we query the memory with k = 512 most similar entries, and the index looks up 32 cluster centroids while searching for the next word candidates. The tempreture T is set to 100 and the balancing parameter λ is selected based on the CIDEr score on the validation set. |