Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval
Authors: Ning Wu, Yaobo Liang, Houxing Ren, Linjun Shou, Nan Duan, Ming Gong, Daxin Jiang
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that model collapse and information leakage are very easy to happen during contrastive training of language model, but language-specific memory bank and asymmetric batch normalization operation play an essential role in preventing collapsing and information leakage, respectively. On the multilingual sentence retrieval task Tatoeba, our model achieves new SOTA results among methods without using bilingual data. Our model also shows larger gain on Tatoeba when transferring between non-English pairs. On two multi-lingual query-passage retrieval tasks, XOR Retrieve and Mr.TYDI, our model even achieves two SOTA results in both zero-shot and supervised setting among all pretraining models using bilingual data. |
| Researcher Affiliation | Industry | Ning Wu1 , Yaobo Liang2 , Houxing Ren1 , Linjun Shou1 , Nan Duan2 , Ming Gong1 and Daxin Jiang1 1Microsoft STCA 2Microsoft Research Asia {wuning, yalia, v-houxingren, lisho, nanduan, migon, djiang}@microsoft.com |
| Pseudocode | Yes | Algorithm 1 The training algorithm for the contrastive context prediction task. |
| Open Source Code | No | The paper does not provide an explicit statement of code release or a link to the source code for the described methodology. |
| Open Datasets | Yes | To better evaluate the performance on massive languages, we adopt the Tatoeba corpus introduced by [Artetxe and Schwenk, 2019]. It consists of 1,000 English-centric sentence pairs for 112 languages... We adopt XOR-QA [Asai et al., 2020] dataset and Mr. TYDI [Zhang et al., 2021] dataset to evaluate our method on the two settings. Both of the two datasets are constructed from TYDI, a question answering dataset covering eleven typologically diverse languages. |
| Dataset Splits | No | The paper mentions training on 'Natural Question data' and testing on 'XOR-QA' and 'Mr. TYDI' datasets, and evaluating on 'Tatoeba'. While these datasets likely have predefined splits, the paper does not explicitly state the specific train/validation/test splits (e.g., percentages, sample counts, or explicit references to standard splits used for their model training and evaluation strategy within those datasets). |
| Hardware Specification | Yes | This costs 7 days on 16 V100 for CCP model. We train the model on 8 NVIDIA Tesla V100 GPUs (with 32GB RAM). |
| Software Dependencies | No | The paper mentions optimizers like 'Adam Optimizer' and 'Adam W Optimizer' but does not specify their versions or any other specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA). |
| Experiment Setup | Yes | Our CCP model has 1024 hidden units, 16 attention heads and 24 layers in encoder. ... We first initialize the CCP model with XLMR[Conneau et al., 2020], and then run continued pre-training with the accumulated 2,048 batch size with gradient accumulation and a memory bank of 32768. ... We use Adam Optimizer with a linear warm-up and set the learning rate to 3e-5. We train the model on 8 NVIDIA Tesla V100 GPUs (with 32GB RAM). We use Adam W Optimizer with a learning rate of 1e-5. The model is trained up to 20 epochs with a mini-batch size of 48. The rest hype-parameters are the same as DPR [Karpukhin et al., 2020]. |