Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval
Authors: Liang Zhang, Anwen Hu, Qin Jin
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments, 4.1 Dataset Description, 4.2 Implementation Details, 4.3 Evaluation on Multilingual Image-Text Retrieval, 4.4 Evaluation on Multilingual Video-Text Retrieval, 4.5 Ablation Studies |
| Researcher Affiliation | Academia | 1School of Information, Renmin University of China 2Key Laboratory of Data Engineering and Knowledge Engineering (MOE), Renmin University of China |
| Pseudocode | No | The paper describes the model architecture and training strategy textually and with equations, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Please check the supplementary material for the code and model. |
| Open Datasets | Yes | We train our model with the Conceptual Captions (CC) dataset [34] and two translation enhanced versions of the CC [44, 4]. We use Multi30K [10], MSCOCO [6, 25, 40] and XTD [1] for multilingual image-text retrieval evaluation, and MSRVTT [39, 16] for multilingual video-text retrieval evaluation. ... Dataset released at https://github.com/zmykevin/UC2, under MIT license. Released at https://github.com/Fredde Frallan/Multilingual-CLIP, under MIT license. |
| Dataset Splits | Yes | We use the standard train, dev and test splits defined by Young et al. [41]. We follow the standard train, dev and test splits for English and Japanese as in [20]. We follow the standard train/dev splits in [39], and evaluate on the 1K test split as described in [42]. |
| Hardware Specification | Yes | The whole training process takes about 12 hours to converge on 1 Nvidia V100 GPU. |
| Software Dependencies | No | The paper mentions specific models and optimizers (e.g., Adam optimizer [23], M-BERT [8]), but does not provide specific version numbers for software dependencies such as programming languages or deep learning frameworks. |
| Experiment Setup | Yes | The hidden dimension of the language acquirers is set to 256, ... optimize multiple language acquirers iteratively with a batch size of 128. The NLT stage performs 117,150 steps with a learning rate of 1e-4, and the LE stage performs 11,715 steps with a learning rate of 3e-6. The temperature τ is set to 0.01. For both stages, we use the Adam optimizer [23] with a linear warm-up for the first 10% of steps. |