Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning

Authors: Riccardo Del Chiaro, Bartłomiej Twardowski, Andrew Bagdanov, Joost van de Weijer

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We apply our approaches to incremental image captioning problem on two new continual learning benchmarks we define using the MS-COCO and Flickr30 datasets. Our results demonstrate that RATT is able to sequentially learn five captioning tasks while incurring no forgetting of previously learned ones.
Researcher Affiliation Academia Riccardo Del Chiaro MICC, University of Florence Florence 50134, FI, Italy EMAIL Bartłomiej Twardowski CVC, Universitat Autónoma de Barcelona 08193 Barcelona, Spain EMAIL Andrew D. Bagdanov MICC, University of Florence Florence 50134, FI, Italy EMAIL Joost van de Weijer CVC, Universitat Autónoma de Barcelona 08193 Barcelona, Spain EMAIL
Pseudocode No The paper describes mathematical equations and processes, such as LSTM definitions and attention mask applications, but does not include a distinct block of pseudocode or a clearly labeled algorithm.
Open Source Code Yes Code for experiments available here: https://github.com/delchiaro/RATT
Open Datasets Yes We applied all techniques on the Flickr30K [31] and MS-COCO [20] captioning datasets (...) [31] Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 123(1):74 93, 2017. [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740 755. Springer, 2014.
Dataset Splits Yes Task Train Valid Test Vocab (words) transport 14,266 3,431 3,431 3,116 (...) Models where trained for a fixed number of epochs and the best model according to BLEU-4 performance on the validation set were chosen for each task.
Hardware Specification Yes We thank NVIDIA Corporation for donating the Titan XP GPU that was used to conduct the experiments.
Software Dependencies No The paper mentions using "Py Torch" and the "Adam [16] optimizer" but does not specify version numbers for any software components.
Experiment Setup Yes All experiments were conducted using Py Torch, networks were trained using the Adam [16] optimizer, all hyperparameters were tuned over validation sets. Batch size, learning rate and max-decode length for evaluation were set, respectively, to 128, 4e-4, and 26 for MS-COCO, and 32, 1e-4 and 40 for Flickr30k. (...) We apply s = 1 smax + smax 1 smax b 1 B 1 where b is the batch index and B is the total number of batches for the epoch. We used smax = 2000 and smax = 400 for experiments on Flickr30k and MS-COCO, respectively.