Efficient Training of Retrieval Models using Negative Cache
Authors: Erik Lindgren, Sashank Reddi, Ruiqi Guo, Sanjiv Kumar
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we experimentally validate our approach and show that it is efficient and compares favorably with more complex, state-of-the-art approaches. |
| Researcher Affiliation | Industry | Erik M. Lindgren Google Research, New York erikml@google.com Sashank Reddi Google Research, New York sashank@google.com Ruiqi Guo Google Research, New York guorq@google.com Sanjiv Kumar Google Research, New York sanjivk@google.com |
| Pseudocode | Yes | Algorithm 1 Cached Gumbel-Max Gradient Descent |
| Open Source Code | No | The paper states: 'Model obtained from https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4.' which refers to a third-party model used, not the authors' own implementation code for their proposed method. There is no explicit statement or link to their code. |
| Open Datasets | Yes | We analyse the performance of our approach on the MS MARCO passage retrieval task [3] and the TREC 2019 passage retrieval task [8]. |
| Dataset Splits | No | The paper mentions training for '250,000 steps' and evaluating on the 'development set of MS MARCO passage retrieval task' but does not provide explicit numerical percentages or counts for training, validation, or test splits. |
| Hardware Specification | Yes | Our experiments use 8 V2 Cloud TPUs. Each replica on the TPU has 8GB memory, for a total of 64GB memory. |
| Software Dependencies | No | The paper mentions starting with a 'pretrained BERT-base model' and using the 'Adam optimizer' but does not specify versions for programming languages, libraries, or other software components used in their implementation. |
| Experiment Setup | Yes | We use a global batch size of 8, the Adam optimizer [23], and we train for 250, 000 steps. We normalize the output embeddings and have a trainable parameter β scale the scores. We use a learning rate of 1 × 10−5 for all experiments except when we train with a cache of 2 million elements, where we use a learning rate of 5 × 10−5. |