LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

Authors: Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And Lex MAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.
Researcher Affiliation Industry Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang Microsoft. {shentao,xigeng,chotao,caxu,xiaolhu, binxjia,linjya,djiang}@microsoft.com
Pseudocode No The paper contains an illustration (Figure 1) and mathematical formulas (e.g., Equation 16), but no clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes 1We released our codes and models at https://github.com/taoshen58/Lex MAE.
Open Datasets Yes Following Formal et al. (2021a), we first employ the widely-used passage retrieval datasets, MS-Marco (Nguyen et al., 2016)... Besides, we evaluate the zero-shot transferability of our model on BEIR benchmark (Thakur et al., 2021).
Dataset Splits Yes We pre-train on the MS-Marco collection (Nguyen et al., 2016)... We report MRR@10 (M@10) and Recall@1/50/100/1K for MS-Marco Dev (passage)... In the first stage, we sample negatives for each query q within top K1 document candidates by BM25 retrieval system... Then, we sample the hard negatives N(hn1) for each query q within top-K2 candidates based on the relevance scores... Lastly, we further sample hard negatives N(hn2) for each query q within top-K3 candidates by the 2nd-stage retriever.
Hardware Specification Yes the pre-training is completed on 8 A100 GPUs within 14h. In contrast to (Wang et al., 2022) using 4 GPUs for fine-tuning, we limited all the fine-tuning experiments on one A100 GPU.
Software Dependencies No The paper mentions initialization by BERTbase (Devlin et al., 2019) and usage of Anserini (Yang et al., 2017), but does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup Yes the batch size is 2048, the max length is 144, the learning rate is 3 10 4, the number of training steps is 80k, the masking percentage (α%) of encoder is 30%, and that (α + β%) of decoder is 50%. Meantime, the random seed is always 42... learning rate is set to 2 10 5 by following Shen et al. (2022), the number of training epochs is set to 3... The batch size (w.r.t the number of queries) is set to 24 with 1 positive and 15 negative documents... λ1 = 0.002, λ2 = 0.008, λ3 = 0.008.