MULE: Multimodal Universal Language Embedding
Authors: Donghyun Kim, Kuniaki Saito, Kate Saenko, Stan Sclaroff, Bryan Plummer11254-11261
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our embeddings on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 20.2% on a single language compared to prior work, with the most significant gains seen on languages with relatively few annotations. |
| Researcher Affiliation | Academia | Boston University {donhk, keisaito, saenko, sclaroff, bplum}@bu.edu |
| Pseudocode | No | The paper describes the training process and loss functions but does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is publicly available1. 1http://cs-people.bu.edu/donhk/research/MULE.html |
| Open Datasets | Yes | Datasets Multi30K (Elliott et al. 2016; 2017; Barrault et al. 2018). The Multi30K dataset augments Flickr30K (Young et al. 2014) with image descriptions in German, French, and Czech. [...] MSCOCO (Lin et al. 2014). MSCOCO is a large-scale dataset which contains 123,287 images and each image is paired with 5 English sentences. [...] (Miyazaki and Shimizu 2016) released the YJ Captions 26K dataset which contains about 26K images in MSCOCO where each image is paired with independent 5 Japanese descriptions. (Li et al. 2019) provides 22,218 independent Chinese image descriptions for 20,341 images in MSCOCO. |
| Dataset Splits | Yes | We use the dataset s provided splits which uses 29K/1K/1K images for training/test/validation. [...] We randomly selected 1K images for the testing and validation sets from the images which contain descriptions across all three languages, for a total of 2K images, and used the rest for training. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions various models and frameworks (e.g., Fast Text, BERT, MUSE, ResNet, LSTM) but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | Following (Wang et al. 2019), we enumerate all positive and negative pairs in a minibatch and use the top K most violated constraints, where K = 10 in our experiments. [...] We also kept image representation fixed, and only the two fully connected layers after the CNN in Fig. 2 were trained. [...] Finally, our overall objective function is to find: ˆθ = argmin θ λ1LLM λ2LLC + λ3Ltriplet ˆ Wlc = argmin Wlc λ2LLC (4) where θ includes all parameters in our network except for the language classifier, Wlc contains the parameters of the language classifier, and λ determines weights on each loss. |