Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Multi-Lingual Acquisition on Multimodal Pre-training for Cross-modal Retrieval
Authors: Liang Zhang, Anwen Hu, Qin Jin
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments, 4.1 Dataset Description, 4.2 Implementation Details, 4.3 Evaluation on Multilingual Image-Text Retrieval, 4.4 Evaluation on Multilingual Video-Text Retrieval, 4.5 Ablation Studies |
| Researcher Affiliation | Academia | 1School of Information, Renmin University of China 2Key Laboratory of Data Engineering and Knowledge Engineering (MOE), Renmin University of China |
| Pseudocode | No | The paper describes the model architecture and training strategy textually and with equations, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Please check the supplementary material for the code and model. |
| Open Datasets | Yes | We train our model with the Conceptual Captions (CC) dataset [34] and two translation enhanced versions of the CC [44, 4]. We use Multi30K [10], MSCOCO [6, 25, 40] and XTD [1] for multilingual image-text retrieval evaluation, and MSRVTT [39, 16] for multilingual video-text retrieval evaluation. ... Dataset released at https://github.com/zmykevin/UC2, under MIT license. Released at https://github.com/Fredde Frallan/Multilingual-CLIP, under MIT license. |
| Dataset Splits | Yes | We use the standard train, dev and test splits defined by Young et al. [41]. We follow the standard train, dev and test splits for English and Japanese as in [20]. We follow the standard train/dev splits in [39], and evaluate on the 1K test split as described in [42]. |
| Hardware Specification | Yes | The whole training process takes about 12 hours to converge on 1 Nvidia V100 GPU. |
| Software Dependencies | No | The paper mentions specific models and optimizers (e.g., Adam optimizer [23], M-BERT [8]), but does not provide specific version numbers for software dependencies such as programming languages or deep learning frameworks. |
| Experiment Setup | Yes | The hidden dimension of the language acquirers is set to 256, ... optimize multiple language acquirers iteratively with a batch size of 128. The NLT stage performs 117,150 steps with a learning rate of 1e-4, and the LE stage performs 11,715 steps with a learning rate of 3e-6. The temperature τ is set to 0.01. For both stages, we use the Adam optimizer [23] with a linear warm-up for the first 10% of steps. |