Teaching Machines to Describe Images with Natural Language Feedback

Authors: huan ling, Sanja Fidler

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a hierarchical phrase-based RNN as our image captioning model, and design a feedback network that provides reward to the learner by conditioning on the human-provided feedback. We show that by exploiting descriptive feedback on new images our model learns to perform better than when given human written captions on these images.
Researcher Affiliation Academia Huan Ling1, Sanja Fidler1,2 University of Toronto1, Vector Institute2 {linghuan,fidler}@cs.toronto.edu
Pseudocode No The paper describes the model computationally with equations and function definitions, but does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Our code and data will be released (http://www.cs.toronto.edu/~linghuan/feedbackImageCaption/) to facilitate more human-like training of captioning models.
Open Datasets Yes To train our hierarchical model, we first process MS-COCO image caption data [20] using the Stanford Core NLP toolkit [23].
Dataset Splits Yes We use 82K images for training, 2K for validation, and 4K for testing. In particular, we randomly chose 2K val and 4K test images from the official validation split.
Hardware Specification No The paper mentions 'NVIDIA for their donation of the GPUs' in the acknowledgment section, but does not specify the exact GPU models, CPU, or other hardware components used for experiments.
Software Dependencies No The paper mentions tools like 'Stanford Core NLP toolkit' and 'ADAM optimizer' but does not specify version numbers for any software dependencies.
Experiment Setup Yes We use the ADAM optimizer [9] with learning rate 0.001. We use Adam with learning rate 1e 6 and batch size 50. As in [29], we follow an annealing schedule. We first optimize the cross entropy loss for the first K epochs, then for the following t = 1, . . . , T epochs, we use cross entropy loss for the first (P floor(t/m)) phrases (where P denotes the number of phrases), and the policy gradient algorithm for the remaining floor(t/m) phrases. We choose m = 5.