Human Consensus-Oriented Image Captioning

Authors: Ziwei Wang, Zi Huang, Yadan Luo

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted on MS-COCO Image Captioning dataset demonstrating the proposed human consensus-oriented training method can significantly improve the training efficiency and model effectiveness.
Researcher Affiliation Academia Ziwei Wang , Zi Huang and Yadan Luo School of Information Technology and Electrical Engineering The University of Queensland, Australia ziwei.wang@uq.edu.au, huang@itee.uq.edu.au, lyadanluol@gmail.com
Pseudocode No The paper describes methods verbally and with equations but does not include any pseudocode or algorithm blocks.
Open Source Code No The details will be released in the Github repository.
Open Datasets Yes Dataset We evaluate the proposed HCO model on the MS-COCO [Lin et al., 2014] image captioning dataset following the Karpathy [Karpathy and Fei-Fei, 2017] split.
Dataset Splits Yes The Train, Val, Test splits contain 113 287, 5 000, 5 000 images, respectively.
Hardware Specification No The paper mentions 'Faster-RCNN' and 'Res Net-101 convolutional neural network' for feature extraction, but does not specify the hardware (GPUs, CPUs, etc.) used to train or run the models.
Software Dependencies No The paper mentions 'Adam' for optimization and 'CNN-LSTM' and 'Transformer' architectures, but does not provide specific version numbers for any software libraries (e.g., PyTorch, TensorFlow, scikit-learn) or programming languages.
Experiment Setup Yes The model is optimised using Adam with learning rate of 5e-4. CNN-LSTM. In the LSTM language decoder, the hidden state is empirically set as 1024, with 1-layer. In the attention module, the encoding size is 1024. Transformer. In the Transformer attention framework, the embedding dimension is 512, the positional encoding size is 2048, and the number of attention layer is 6.