Teaching Machines to Describe Images with Natural Language Feedback
Authors: huan ling, Sanja Fidler
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a hierarchical phrase-based RNN as our image captioning model, and design a feedback network that provides reward to the learner by conditioning on the human-provided feedback. We show that by exploiting descriptive feedback on new images our model learns to perform better than when given human written captions on these images. |
| Researcher Affiliation | Academia | Huan Ling1, Sanja Fidler1,2 University of Toronto1, Vector Institute2 {linghuan,fidler}@cs.toronto.edu |
| Pseudocode | No | The paper describes the model computationally with equations and function definitions, but does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our code and data will be released (http://www.cs.toronto.edu/~linghuan/feedbackImageCaption/) to facilitate more human-like training of captioning models. |
| Open Datasets | Yes | To train our hierarchical model, we first process MS-COCO image caption data [20] using the Stanford Core NLP toolkit [23]. |
| Dataset Splits | Yes | We use 82K images for training, 2K for validation, and 4K for testing. In particular, we randomly chose 2K val and 4K test images from the official validation split. |
| Hardware Specification | No | The paper mentions 'NVIDIA for their donation of the GPUs' in the acknowledgment section, but does not specify the exact GPU models, CPU, or other hardware components used for experiments. |
| Software Dependencies | No | The paper mentions tools like 'Stanford Core NLP toolkit' and 'ADAM optimizer' but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | We use the ADAM optimizer [9] with learning rate 0.001. We use Adam with learning rate 1e 6 and batch size 50. As in [29], we follow an annealing schedule. We first optimize the cross entropy loss for the first K epochs, then for the following t = 1, . . . , T epochs, we use cross entropy loss for the first (P floor(t/m)) phrases (where P denotes the number of phrases), and the policy gradient algorithm for the remaining floor(t/m) phrases. We choose m = 5. |