Adaptively Aligned Image Captioning via Adaptive Attention Time

Authors: Lun Huang, Wenmin Wang, Yaxian Xia, Jie Chen

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed method on the popular MS COCO dataset [16]. MS COCO dataset contains 123,287 images labeled with at least 5 captions, including 82,783 for training and 40,504 for validation. MS COCO also provides 40,775 images as the test set for online evaluation. We use the Karpathy data split [13] for the performance comparisons, where 5,000 images are used for validation, 5,000 images for testing, and the rest for training.
Researcher Affiliation Academia School of Electronic and Computer Engineering, Peking University 2Peng Cheng Laboratory 3Macau University of Science and Technology
Pseudocode No The paper provides mathematical equations and model diagrams (Figure 1) but does not contain a dedicated pseudocode or algorithm block.
Open Source Code Yes Code is available at https://github.com/husthuaan/AAT.
Open Datasets Yes We evaluate our proposed method on the popular MS COCO dataset [16]. MS COCO dataset contains 123,287 images labeled with at least 5 captions, including 82,783 for training and 40,504 for validation.
Dataset Splits Yes We use the Karpathy data split [13] for the performance comparisons, where 5,000 images are used for validation, 5,000 images for testing, and the rest for training.
Hardware Specification No The paper mentions using a pre-trained Faster-RCNN [20] model to extract features but does not provide specific details on the hardware used for training or evaluating their proposed model.
Software Dependencies No The paper mentions the use of 'ADAM [14] optimizer' and 'LSTM layers', but does not provide specific version numbers for any software dependencies or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We train our model under cross-entropy loss for 20 epochs with a minibatch size of 10, and ADAM [14] optimizer is used with a learning rate initialized with 1e-4 and annealed by 0.8 every 2 epochs. We increase the probability of feeding back a sample of the word posterior by 0.05 every 3 epochs [4]. Then we use self-critical sequence training (SCST) [21] to optimize the CIDEr-D score with REINFORCE for another 20 epochs with an initial learning rate of 1e-5 and annealed by 0.5 when the CIDEr-D score on the validation split has not improved for some training steps.