Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval

Authors: Liang Zhang, Anwen Hu, Qin Jin

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments, 4.1 Dataset Description, 4.2 Implementation Details, 4.3 Evaluation on Multilingual Image-Text Retrieval, 4.4 Evaluation on Multilingual Video-Text Retrieval, 4.5 Ablation Studies
Researcher Affiliation Academia 1School of Information, Renmin University of China 2Key Laboratory of Data Engineering and Knowledge Engineering (MOE), Renmin University of China
Pseudocode No The paper describes the model architecture and training strategy textually and with equations, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Please check the supplementary material for the code and model.
Open Datasets Yes We train our model with the Conceptual Captions (CC) dataset [34] and two translation enhanced versions of the CC [44, 4]. We use Multi30K [10], MSCOCO [6, 25, 40] and XTD [1] for multilingual image-text retrieval evaluation, and MSRVTT [39, 16] for multilingual video-text retrieval evaluation. ... Dataset released at https://github.com/zmykevin/UC2, under MIT license. Released at https://github.com/Fredde Frallan/Multilingual-CLIP, under MIT license.
Dataset Splits Yes We use the standard train, dev and test splits defined by Young et al. [41]. We follow the standard train, dev and test splits for English and Japanese as in [20]. We follow the standard train/dev splits in [39], and evaluate on the 1K test split as described in [42].
Hardware Specification Yes The whole training process takes about 12 hours to converge on 1 Nvidia V100 GPU.
Software Dependencies No The paper mentions specific models and optimizers (e.g., Adam optimizer [23], M-BERT [8]), but does not provide specific version numbers for software dependencies such as programming languages or deep learning frameworks.
Experiment Setup Yes The hidden dimension of the language acquirers is set to 256, ... optimize multiple language acquirers iteratively with a batch size of 128. The NLT stage performs 117,150 steps with a learning rate of 1e-4, and the LE stage performs 11,715 steps with a learning rate of 3e-6. The temperature τ is set to 0.01. For both stages, we use the Adam optimizer [23] with a linear warm-up for the first 10% of steps.