Efficient Training of Retrieval Models using Negative Cache

Authors: Erik Lindgren, Sashank Reddi, Ruiqi Guo, Sanjiv Kumar

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we experimentally validate our approach and show that it is efficient and compares favorably with more complex, state-of-the-art approaches.
Researcher Affiliation Industry Erik M. Lindgren Google Research, New York erikml@google.com Sashank Reddi Google Research, New York sashank@google.com Ruiqi Guo Google Research, New York guorq@google.com Sanjiv Kumar Google Research, New York sanjivk@google.com
Pseudocode Yes Algorithm 1 Cached Gumbel-Max Gradient Descent
Open Source Code No The paper states: 'Model obtained from https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4.' which refers to a third-party model used, not the authors' own implementation code for their proposed method. There is no explicit statement or link to their code.
Open Datasets Yes We analyse the performance of our approach on the MS MARCO passage retrieval task [3] and the TREC 2019 passage retrieval task [8].
Dataset Splits No The paper mentions training for '250,000 steps' and evaluating on the 'development set of MS MARCO passage retrieval task' but does not provide explicit numerical percentages or counts for training, validation, or test splits.
Hardware Specification Yes Our experiments use 8 V2 Cloud TPUs. Each replica on the TPU has 8GB memory, for a total of 64GB memory.
Software Dependencies No The paper mentions starting with a 'pretrained BERT-base model' and using the 'Adam optimizer' but does not specify versions for programming languages, libraries, or other software components used in their implementation.
Experiment Setup Yes We use a global batch size of 8, the Adam optimizer [23], and we train for 250, 000 steps. We normalize the output embeddings and have a trainable parameter β scale the scores. We use a learning rate of 1 × 10−5 for all experiments except when we train with a cache of 2 million elements, where we use a learning rate of 5 × 10−5.