MULE: Multimodal Universal Language Embedding

Authors: Donghyun Kim, Kuniaki Saito, Kate Saenko, Stan Sclaroff, Bryan Plummer11254-11261

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2% on a single language compared to prior work, with the most significant gains seen on languages with relatively few annotations.
Researcher Affiliation Academia Boston University {donhk, keisaito, saenko, sclaroff, bplum}@bu.edu
Pseudocode No The paper describes the training process and loss functions but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is publicly available1. 1http://cs-people.bu.edu/donhk/research/MULE.html
Open Datasets Yes Datasets Multi30K (Elliott et al. 2016; 2017; Barrault et al. 2018). The Multi30K dataset augments Flickr30K (Young et al. 2014) with image descriptions in German, French, and Czech. [...] MSCOCO (Lin et al. 2014). MSCOCO is a large-scale dataset which contains 123,287 images and each image is paired with 5 English sentences. [...] (Miyazaki and Shimizu 2016) released the YJ Captions 26K dataset which contains about 26K images in MSCOCO where each image is paired with independent 5 Japanese descriptions. (Li et al. 2019) provides 22,218 independent Chinese image descriptions for 20,341 images in MSCOCO.
Dataset Splits Yes We use the dataset s provided splits which uses 29K/1K/1K images for training/test/validation. [...] We randomly selected 1K images for the testing and validation sets from the images which contain descriptions across all three languages, for a total of 2K images, and used the rest for training.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions various models and frameworks (e.g., Fast Text, BERT, MUSE, ResNet, LSTM) but does not provide specific software dependencies with version numbers.
Experiment Setup Yes Following (Wang et al. 2019), we enumerate all positive and negative pairs in a minibatch and use the top K most violated constraints, where K = 10 in our experiments. [...] We also kept image representation fixed, and only the two fully connected layers after the CNN in Fig. 2 were trained. [...] Finally, our overall objective function is to find: ˆθ = argmin θ λ1LLM λ2LLC + λ3Ltriplet ˆ Wlc = argmin Wlc λ2LLC (4) where θ includes all parameters in our network except for the language classifier, Wlc contains the parameters of the language classifier, and λ determines weights on each loss.