TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning

Authors: Zhihao Fan, Zhongyu Wei, Siyuan Wang, Ruize Wang, Zejun Li, Haijun Shan, Xuanjing Huang

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.
Researcher Affiliation Academia 1Fudan University, 2Zhejiang Lab, 3Research Institute of Intelligent and Complex Systems, Fudan University, China
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code No The paper does not provide concrete access to its own source code. It only refers to external GitHub repositories for COCO Caption evaluation and Scene Graph building, which are not the authors' implementation of TCIC.
Open Datasets Yes We evaluate our proposed model on MS COCO [Lin et al., 2014].
Dataset Splits Yes We split the dataset following [Karpathy and Fei-Fei, 2015] with 113,287 images in the training set and 5,000 images in validation and test sets respectively.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. Only general information about training steps is given.
Software Dependencies No The paper mentions the use of Faster-RCNN and Adam optimizer, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes Our encoder has 3 layers and the decoder has 1 layer, the hidden dimension is 1024, the head of attention is 8 and the inner dimension of feed-forward network is 2,048. The number of parameters in our model is 23.2M. The dropout rate here is 0.3. We first train our proposed model with cross-entropy with 0.2 label smoothing, (λ1, λ2) = (0.5, 10.0) for 10k update steps, 1k warm-up steps, and then train it with reinforcement learning for 40 epochs, 40k update steps, K in Eq. (16) is 5. We use a linear-decay learning rate scheduler with 4k warm-up steps, the learning rates for cross-entropy and reinforcement learning are 1e-3 and 8e-5, respectively. The optimizer of our model is Adam [Kingma and Ba, 2014] with (0.9, 0.999). The maximal region numbers per batch are 32,768 and 4,096. During decoding, the size of beam search is 3 and the length penalty is 0.1.