ConTextual Masked Auto-Encoder for Dense Passage Retrieval
Authors: Xing Wu, Guangyuan Ma, Meng Lin, Zijia Lin, Zhongyuan Wang, Songlin Hu
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on large-scale passage retrieval benchmarks and show considerable improvements over strong baselines, demonstrating the high efficiency of Co T-MAE. |
| Researcher Affiliation | Collaboration | Xing Wu1,2,3 *, Guangyuan Ma1,2 *, Meng Lin1,2, Zijia Lin3, Zhongyuan Wang3, Songlin Hu1,2 1 Institute of Information Engineering, Chinese Academy of Sciences 2 School of Cyber Security, University of Chinese Academy of Sciences 3 Kuaishou Technology |
| Pseudocode | No | The paper describes the methodology using text and mathematical formulas but does not provide pseudocode or an algorithm block. |
| Open Source Code | Yes | Our code is available at https://github.com/caskcsg/ir/tree/main/cotmae. |
| Open Datasets | Yes | We fine-tune the pre-trained Co T-MAE on MS-MARCO passage ranking (Nguyen et al. 2016), Natural Question (Kwiatkowski et al. 2019) and TREC Deep Learning (DL) Track 2020 (Craswell et al. 2020) tasks for evaluation. Following co Condenser(Gao and Callan 2021b), we use the MS-MARCO corpus released in (Qu et al. 2020), following Rocket QA(Qu et al. 2020), we use the NQ version created by DPR(Karpukhin et al. 2020). |
| Dataset Splits | No | The paper mentions using a widely adopted evaluation pipeline (Tevatron) and details pre-training steps and batch sizes, but does not explicitly describe train/validation/test splits, their percentages, or sample counts. It refers to using specific versions of datasets but not how they were partitioned for reproducibility. |
| Hardware Specification | Yes | We train for 4 days with a global batch size of 1024 on 8 Tesla A100 GPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies beyond mentioning general tools like NLTK or frameworks like BERT/PyTorch implicitly. |
| Experiment Setup | Yes | We pre-train up to 1200k steps using Adam W optimizer, with a learning rate of 1e-4, and a linear schedule with warmup ratio 0.1. We train for 4 days with a global batch size of 1024 on 8 Tesla A100 GPUs. For fine-tuning, the similarity of a query-passage pair < q, p > is defined as an inner product: s(q, p) = fq(q) fp(p) Query and passage encoders are fine-tuned on the retrieval task s training corpus with a contrastive loss: L = log exp (s (q, p+)) exp (s (q, p+)) + P l exp s q, p l |