reproducibilityindex.ai

Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT

Authors: Jon Saad-Falcon, Daniel Y Fu, Simran Arora, Neel Guha, Christopher Re

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we validate the M2-BERT retrieval encoder on Lo Co V1, finding that it outperforms competitive Transformer-based models by at least 22.2 points, despite containing 90 fewer parameters.
Researcher Affiliation	Academia	1Stanford University, Computer Science, Stanford, CA.
Pseudocode	No	The paper describes algorithms (MNRL, OPL, PL) using mathematical notation and descriptive text, but it does not include structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	1The M2-BERT code and Lo Co V1 datasets are publically available on Github and Hugging Face, respectively.
Open Datasets	Yes	1The M2-BERT code and Lo Co V1 datasets are publically available on Github and Hugging Face, respectively.
Dataset Splits	Yes	For training evaluation, we use the C4 validation set with an MLM probability of 0.15.
Hardware Specification	Yes	For all our efficiency experiments, we run each of the models on a single A100 GPU with 80GB of memory, running CUDA 11.7, Python 3.10, and Py Torch 1.13.1 (Paszke et al., 2019).
Software Dependencies	Yes	For all our efficiency experiments, we run each of the models on a single A100 GPU with 80GB of memory, running CUDA 11.7, Python 3.10, and Py Torch 1.13.1 (Paszke et al., 2019).
Experiment Setup	Yes	For pretraining the M2-BERT encoders, we use the C4, Wikipedia, and Bookcorpus datasets for training examples. For our dataset split, we sample each dataset equally (e.g. 33% each). For our example length ratio, we selected 0.3 variable length examples (e.g. short examples) and 0.7 maximum concatenated examples (e.g. long examples). We utilize the masked-language modeling (MLM) pretraining objective with an MLM probability of 0.3 to prepare the encoders for downstream language modeling. For training evaluation, we use the C4 validation set with an MLM probability of 0.15. For our scheduler, we use linear decay with warmup, where warmup is 0.06 of the total training duration. For our optimizer, we use a learning rate of 5.0e 4 with an epsilon of 1e 06, betas of 0.9 and 0.98, a weight decay of 1e 5. For fine-tuning the M2-BERT encoders, we use the Sentence Transformers library (Reimers & Gurevych, 2019). For all M2-BERT configurations, we use a learning rate of 5e 6, a true batch size of 32, 1 epoch of fine-tuning, a maximum gradient norm of 1.0, and a ratio of 32 negative passages per query-positive passage pair.