reproducibilityindex.ai

MAT: A Multimodal Attentive Translator for Image Captioning

Authors: Chang Liu, Fuchun Sun, Changhu Wang, Feng Wang, Alan Yuille

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted to validate the proposed approach on popular benchmark dataset, i.e., MS COCO, and the proposed model surpasses the state-of-the-art methods in all metrics following the dataset splits of previous work.
Researcher Affiliation	Collaboration	1Department of Computer Science, Tsinghua University 2Toutiao AI Lab, 3Department of Electronic Engineering, UESTC 4Cognitive Science & Computer Science, Johns Hopkins University
Pseudocode	No	The paper describes the model and its components using equations and diagrams, but does not include structured pseudocode or an algorithm block.
Open Source Code	No	The paper does not state that its own source code for the proposed methodology is publicly available. It only refers to publicly available evaluation toolkits (MS COCO evaluation toolkit and SPICE evaluation tool).
Open Datasets	Yes	MSCOCO [Lin et al., 2014] contains 82,783 training, 40,504 validation and 40,775 testing images, which are withheld in MS COCO server.
Dataset Splits	Yes	MSCOCO [Lin et al., 2014] contains 82,783 training, 40,504 validation and 40,775 testing images... To compare with previous methods, we follow the split from previous work [Karpathy and Fei-Fei, 2015; Xu et al., 2015], i.e., we use 5000 images for validation and 5000 images for testing from the 40504 validation set.
Hardware Specification	Yes	On a Titan X Maxwell computer, the training process takes about 12 hours.
Software Dependencies	No	The paper mentions using R-FCN and Resnet101 architectures but does not specify any software dependencies (e.g., libraries, frameworks) with version numbers.
Experiment Setup	Yes	The hidden state size is set to 512. To cope with variable length of both the source sequence and the target sequence for batch training, we leverage a bucket and padding method, where the sequences are split to different buckets and zero padded to bucket length according to the length of the source sequence as well as the length of the target sequence. Speciﬁcally, in training, we use four buckets, i.e., {(2, 10), (4,15), (6,20), (8,30)}. We use SGD with batch size of 64 to train the network. The learning rate is set to 0.1, and halved when training loss stops to decrease. To avoid overﬁtting, we leverage dropout at 0.5 for all layers, and early stops the training on validation split with 5000 images. We use Beam Search of size 20, which considers iteratively the best b candidates to generate next word.