Visual Agreement Regularized Training for Multi-Modal Machine Translation

Authors: Pengcheng Yang, Boxing Chen, Pei Zhang, Xu Sun9418-9425

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results show that our approaches can outperform competitive baselines by a large margin on the Multi30k dataset. Further analysis demonstrates that the proposed regularized training can effectively improve the agreement of attention on the image, leading to better use of visual information.
Researcher Affiliation Collaboration Pengcheng Yang,1,2 Boxing Chen,3 Pei Zhang,3 Xu Sun1,2 1Center for Data Science , Peking University 2MOE Key Lab of Computational Linguistics, School of EECS, Peking University 3Alibaba DAMO Academy, Hangzhou, China
Pseudocode No The paper describes its methods and equations but does not present them in a structured pseudocode or algorithm block.
Open Source Code No The paper does not include an unambiguous statement that the authors are releasing their source code for the work described, nor does it provide a direct link to a repository containing their implementation.
Open Datasets Yes Following previous work (Calixto, Liu, and Campbell 2017), we evaluate both our approach and all baselines on the Multi30K dataset (Elliott, Frank, and Specia 2016), which contains 29,000 instances for training and 1,014 for development. We use test-2017 for evaluation, which consists of 1,000 testing instances.
Dataset Splits Yes Following previous work (Calixto, Liu, and Campbell 2017), we evaluate both our approach and all baselines on the Multi30K dataset (Elliott, Frank, and Specia 2016), which contains 29,000 instances for training and 1,014 for development. We use test-2017 for evaluation, which consists of 1,000 testing instances.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'Moses SMT Toolkit', 'fast align', and 'Faster R-CNN' as tools but does not specify their version numbers or versions for other key software components like programming languages or deep learning frameworks.
Experiment Setup Yes For each image, we consistently keep 36 highest probability objects. λ1 in Eq. (19) is set to 0.2 and 0.5 for EN DE and EN FR translations,respectively. λ2 in Eq. (20) is set to 0.2 and 0.1 for DE EN and FR EN translations,respectively. For both source and target language, we limit the vocabulary size to 10,000. The size of word embedding is set to 512 and embeddings are learned from scratch. An extra linear layer are utilized to project all visual features into 512dim. For the Seq2Seq version of our approach, the textual encoder and decoder are all a 2-layer LSTM with hidden size 512. We set the textual encoder to be bidirectional. For the transformer version of our approach, we set the hidden size of multi-head attention layer to 512 and the hidden size of the feed-forward layer to 2,048. The number of heads in multi-head attention is set to 8, while a transformer layer consists of 6 blocks. We adopt the Adam optimization method with the initial learning rate 0.0003 for training and the learning rate is halved after each epoch. We also make use of dropout to avoid over-fitting.