Partially-Supervised Image Captioning
Authors: Peter Anderson, Stephen Gould, Mark Johnson
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying our approach to an existing neural captioning model, we achieve state of the art results on the novel object captioning task using the COCO dataset. We further show that we can train a captioning model to describe new visual concepts from the Open Images dataset while maintaining competitive COCO evaluation scores. |
| Researcher Affiliation | Academia | Peter Anderson Macquarie University Sydney, Australia p.anderson@mq.edu.au [...] Stephen Gould Australian National University Canberra, Australia stephen.gould@anu.edu.au [...] Mark Johnson Macquarie University Sydney, Australia mark.johnson@mq.edu.au [...] Now at Georgia Tech (peter.anderson@gatech.edu) |
| Pseudocode | Yes | Algorithm 1 Beam search decoding [...] Algorithm 2 Constrained beam search decoding [13] |
| Open Source Code | Yes | To encourage future work, we have released our code and trained models via the project website2. [Footnote 2]: www.panderson.me/constrained-beam-search |
| Open Datasets | Yes | We use the COCO 2014 captions dataset [52] containing 83K training images and 41K validation images, each labeled with five human-annotated captions. [...] object annotation labels for 25 additional animal classes from the Open Images V4 dataset [14]. |
| Dataset Splits | Yes | We use the splits proposed by Hendricks et al. [21] for novel object captioning, in which all images with captions that mention one of eight selected objects (including synonyms and plural forms) are removed from the caption training set, which is reduced to 70K images. The original COCO validation set is split 50% for validation and 50% for testing. |
| Hardware Specification | Yes | Training (after initialization) takes around 8 hours using two Titan X GPUs. |
| Software Dependencies | No | The paper mentions software like 'Faster R-CNN object detector', 'Res Net-101 CNN', 'Long Short-Term Memory (LSTM) network', and 'Glo Ve', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | When training on image labels, we use the online version of our proposed training algorithm, constructing each minibatch of 100 with an equal number of complete and partially-specified training examples. We use SGD with an initial learning rate of 0.001, decayed to zero over 5K iterations, with a lower learning rate for the pre-trained word embeddings. In beam search and constrained beam search decoding we use a beam size of 5. |