Reinforcing an Image Caption Generator Using Off-Line Human Feedback

Authors: Paul Hongsuck Seo, Piyush Sharma, Tomer Levinboim, Bohyung Han, Radu Soricut2693-2700

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evidence indicates that the proposed method learns to generalize the human raters judgments to a previously unseen set of images, as judged by a different set of human judges, and additionally on a different, multidimensional side-by-side human evaluation procedure.
Researcher Affiliation Collaboration Paul Hongsuck Seo,1,3 Piyush Sharma,2 Tomer Levinboim,2 Bohyung Han,3 Radu Soricut2 1Computer Vision Lab., POSTECH, Korea 2Google Research, USA 3Computer Vision Lab., ECE, & ASRI, Seoul National University, Korea
Pseudocode No The paper describes algorithms and methods in text and mathematical formulas but does not include structured pseudocode or algorithm blocks (e.g., clearly labeled 'Algorithm 1').
Open Source Code No The paper does not contain any explicit statements about releasing code or links to a code repository.
Open Datasets Yes In the experiments, we use Conceptual Captions (Sharma et al. 2018), a large-scale captioning dataset that consists of images crawled from the Internet, with captions derived from corresponding Alt-text labels on the webpages. The training and validation splits have approximately 3.3M and 16K samples, respectively. In our experiments, we use the Caption-Quality dataset (Levinboim et al. 2019)... The dataset is divided into training, validation and test splits containing approximately 130K, 7K and 7K rated captions, respectively. To evaluate our models, we run human evaluation studies on the T2 test dataset used in the CVPR 2019 Conceptual Captions Challenge... The dataset contains 1K images sampled from the Open Images Dataset (Kuznetsova et al. 2018).
Dataset Splits Yes The training and validation splits have approximately 3.3M and 16K samples, respectively. The dataset is divided into training, validation and test splits containing approximately 130K, 7K and 7K rated captions, respectively.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory) used for running the experiments. It mentions using "Google Cloud Vision API" which implies cloud resources, but no specifications are given.
Software Dependencies No The paper mentions several software components and frameworks such as 'Adam optimizer (Kingma and Ba 2014)', 'Transformer Network (Vaswani et al. 2017)', 'BERT encoder (Devlin et al. 2018)', and 'faster-RCNN (Ren et al. 2015)', but it does not specify version numbers for these or other ancillary software dependencies.
Experiment Setup Yes We train Baseline using the Adam optimizer (Kingma and Ba 2014) on the training split of the Conceptual dataset for 3M iterations with the batch size of 4,096 and the learning rate of 3.2 × 10−5. The learning rate is warmed up for 20 epochs and exponentially decayed by a factor of 0.95 every 25 epochs. Baseline+(t) are obtained by fine-tuning Baseline on the merged dataset for 1M iterations, with the learning rate of 3.2 × 10−7 and the same decaying factor. For On PG... we reduce the batch size for training this model by a 0.25 factor; the value of b in Eq. (2) is set to the moving average of the rating estimates. During Off PG training, for each batch, we sample half of the examples from the Conceptual dataset and the other half from Caption-Quality dataset; b is set to the average of the ratings in the dataset.