MAT: A Multimodal Attentive Translator for Image Captioning
Authors: Chang Liu, Fuchun Sun, Changhu Wang, Feng Wang, Alan Yuille
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are conducted to validate the proposed approach on popular benchmark dataset, i.e., MS COCO, and the proposed model surpasses the state-of-the-art methods in all metrics following the dataset splits of previous work. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Tsinghua University 2Toutiao AI Lab, 3Department of Electronic Engineering, UESTC 4Cognitive Science & Computer Science, Johns Hopkins University |
| Pseudocode | No | The paper describes the model and its components using equations and diagrams, but does not include structured pseudocode or an algorithm block. |
| Open Source Code | No | The paper does not state that its own source code for the proposed methodology is publicly available. It only refers to publicly available evaluation toolkits (MS COCO evaluation toolkit and SPICE evaluation tool). |
| Open Datasets | Yes | MSCOCO [Lin et al., 2014] contains 82,783 training, 40,504 validation and 40,775 testing images, which are withheld in MS COCO server. |
| Dataset Splits | Yes | MSCOCO [Lin et al., 2014] contains 82,783 training, 40,504 validation and 40,775 testing images... To compare with previous methods, we follow the split from previous work [Karpathy and Fei-Fei, 2015; Xu et al., 2015], i.e., we use 5000 images for validation and 5000 images for testing from the 40504 validation set. |
| Hardware Specification | Yes | On a Titan X Maxwell computer, the training process takes about 12 hours. |
| Software Dependencies | No | The paper mentions using R-FCN and Resnet101 architectures but does not specify any software dependencies (e.g., libraries, frameworks) with version numbers. |
| Experiment Setup | Yes | The hidden state size is set to 512. To cope with variable length of both the source sequence and the target sequence for batch training, we leverage a bucket and padding method, where the sequences are split to different buckets and zero padded to bucket length according to the length of the source sequence as well as the length of the target sequence. Specifically, in training, we use four buckets, i.e., {(2, 10), (4,15), (6,20), (8,30)}. We use SGD with batch size of 64 to train the network. The learning rate is set to 0.1, and halved when training loss stops to decrease. To avoid overfitting, we leverage dropout at 0.5 for all layers, and early stops the training on validation split with 5000 images. We use Beam Search of size 20, which considers iteratively the best b candidates to generate next word. |