reproducibilityindex.ai

MULE: Multimodal Universal Language Embedding

Authors: Donghyun Kim, Kuniaki Saito, Kate Saenko, Stan Sclaroff, Bryan Plummer11254-11261

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2% on a single language compared to prior work, with the most signiﬁcant gains seen on languages with relatively few annotations.
Researcher Affiliation	Academia	Boston University {donhk, keisaito, saenko, sclaroff, bplum}@bu.edu
Pseudocode	No	The paper describes the training process and loss functions but does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is publicly available1. 1http://cs-people.bu.edu/donhk/research/MULE.html
Open Datasets	Yes	Datasets Multi30K (Elliott et al. 2016; 2017; Barrault et al. 2018). The Multi30K dataset augments Flickr30K (Young et al. 2014) with image descriptions in German, French, and Czech. [...] MSCOCO (Lin et al. 2014). MSCOCO is a large-scale dataset which contains 123,287 images and each image is paired with 5 English sentences. [...] (Miyazaki and Shimizu 2016) released the YJ Captions 26K dataset which contains about 26K images in MSCOCO where each image is paired with independent 5 Japanese descriptions. (Li et al. 2019) provides 22,218 independent Chinese image descriptions for 20,341 images in MSCOCO.
Dataset Splits	Yes	We use the dataset s provided splits which uses 29K/1K/1K images for training/test/validation. [...] We randomly selected 1K images for the testing and validation sets from the images which contain descriptions across all three languages, for a total of 2K images, and used the rest for training.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions various models and frameworks (e.g., Fast Text, BERT, MUSE, ResNet, LSTM) but does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	Following (Wang et al. 2019), we enumerate all positive and negative pairs in a minibatch and use the top K most violated constraints, where K = 10 in our experiments. [...] We also kept image representation ﬁxed, and only the two fully connected layers after the CNN in Fig. 2 were trained. [...] Finally, our overall objective function is to ﬁnd: ˆθ = argmin θ λ1LLM λ2LLC + λ3Ltriplet ˆ Wlc = argmin Wlc λ2LLC (4) where θ includes all parameters in our network except for the language classiﬁer, Wlc contains the parameters of the language classiﬁer, and λ determines weights on each loss.