Partially-Supervised Image Captioning

Authors: Peter Anderson, Stephen Gould, Mark Johnson

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Applying our approach to an existing neural captioning model, we achieve state of the art results on the novel object captioning task using the COCO dataset. We further show that we can train a captioning model to describe new visual concepts from the Open Images dataset while maintaining competitive COCO evaluation scores.
Researcher Affiliation Academia Peter Anderson Macquarie University Sydney, Australia p.anderson@mq.edu.au [...] Stephen Gould Australian National University Canberra, Australia stephen.gould@anu.edu.au [...] Mark Johnson Macquarie University Sydney, Australia mark.johnson@mq.edu.au [...] Now at Georgia Tech (peter.anderson@gatech.edu)
Pseudocode Yes Algorithm 1 Beam search decoding [...] Algorithm 2 Constrained beam search decoding [13]
Open Source Code Yes To encourage future work, we have released our code and trained models via the project website2. [Footnote 2]: www.panderson.me/constrained-beam-search
Open Datasets Yes We use the COCO 2014 captions dataset [52] containing 83K training images and 41K validation images, each labeled with five human-annotated captions. [...] object annotation labels for 25 additional animal classes from the Open Images V4 dataset [14].
Dataset Splits Yes We use the splits proposed by Hendricks et al. [21] for novel object captioning, in which all images with captions that mention one of eight selected objects (including synonyms and plural forms) are removed from the caption training set, which is reduced to 70K images. The original COCO validation set is split 50% for validation and 50% for testing.
Hardware Specification Yes Training (after initialization) takes around 8 hours using two Titan X GPUs.
Software Dependencies No The paper mentions software like 'Faster R-CNN object detector', 'Res Net-101 CNN', 'Long Short-Term Memory (LSTM) network', and 'Glo Ve', but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes When training on image labels, we use the online version of our proposed training algorithm, constructing each minibatch of 100 with an equal number of complete and partially-specified training examples. We use SGD with an initial learning rate of 0.001, decayed to zero over 5K iterations, with a lower learning rate for the pre-trained word embeddings. In beam search and constrained beam search decoding we use a beam size of 5.