Benchmarking and Building Long-Context Retrieval Models with LoCo and M2-BERT
Authors: Jon Saad-Falcon, Daniel Y Fu, Simran Arora, Neel Guha, Christopher Re
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we validate the M2-BERT retrieval encoder on Lo Co V1, finding that it outperforms competitive Transformer-based models by at least 22.2 points, despite containing 90 fewer parameters. |
| Researcher Affiliation | Academia | 1Stanford University, Computer Science, Stanford, CA. |
| Pseudocode | No | The paper describes algorithms (MNRL, OPL, PL) using mathematical notation and descriptive text, but it does not include structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | 1The M2-BERT code and Lo Co V1 datasets are publically available on Github and Hugging Face, respectively. |
| Open Datasets | Yes | 1The M2-BERT code and Lo Co V1 datasets are publically available on Github and Hugging Face, respectively. |
| Dataset Splits | Yes | For training evaluation, we use the C4 validation set with an MLM probability of 0.15. |
| Hardware Specification | Yes | For all our efficiency experiments, we run each of the models on a single A100 GPU with 80GB of memory, running CUDA 11.7, Python 3.10, and Py Torch 1.13.1 (Paszke et al., 2019). |
| Software Dependencies | Yes | For all our efficiency experiments, we run each of the models on a single A100 GPU with 80GB of memory, running CUDA 11.7, Python 3.10, and Py Torch 1.13.1 (Paszke et al., 2019). |
| Experiment Setup | Yes | For pretraining the M2-BERT encoders, we use the C4, Wikipedia, and Bookcorpus datasets for training examples. For our dataset split, we sample each dataset equally (e.g. 33% each). For our example length ratio, we selected 0.3 variable length examples (e.g. short examples) and 0.7 maximum concatenated examples (e.g. long examples). We utilize the masked-language modeling (MLM) pretraining objective with an MLM probability of 0.3 to prepare the encoders for downstream language modeling. For training evaluation, we use the C4 validation set with an MLM probability of 0.15. For our scheduler, we use linear decay with warmup, where warmup is 0.06 of the total training duration. For our optimizer, we use a learning rate of 5.0e 4 with an epsilon of 1e 06, betas of 0.9 and 0.98, a weight decay of 1e 5. For fine-tuning the M2-BERT encoders, we use the Sentence Transformers library (Reimers & Gurevych, 2019). For all M2-BERT configurations, we use a learning rate of 5e 6, a true batch size of 32, 1 epoch of fine-tuning, a maximum gradient norm of 1.0, and a ratio of 32 negative passages per query-positive passage pair. |