reproducibilityindex.ai

Efficient Training of Retrieval Models using Negative Cache

Authors: Erik Lindgren, Sashank Reddi, Ruiqi Guo, Sanjiv Kumar

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we experimentally validate our approach and show that it is efﬁcient and compares favorably with more complex, state-of-the-art approaches.
Researcher Affiliation	Industry	Erik M. Lindgren Google Research, New York erikml@google.com Sashank Reddi Google Research, New York sashank@google.com Ruiqi Guo Google Research, New York guorq@google.com Sanjiv Kumar Google Research, New York sanjivk@google.com
Pseudocode	Yes	Algorithm 1 Cached Gumbel-Max Gradient Descent
Open Source Code	No	The paper states: 'Model obtained from https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4.' which refers to a third-party model used, not the authors' own implementation code for their proposed method. There is no explicit statement or link to their code.
Open Datasets	Yes	We analyse the performance of our approach on the MS MARCO passage retrieval task [3] and the TREC 2019 passage retrieval task [8].
Dataset Splits	No	The paper mentions training for '250,000 steps' and evaluating on the 'development set of MS MARCO passage retrieval task' but does not provide explicit numerical percentages or counts for training, validation, or test splits.
Hardware Specification	Yes	Our experiments use 8 V2 Cloud TPUs. Each replica on the TPU has 8GB memory, for a total of 64GB memory.
Software Dependencies	No	The paper mentions starting with a 'pretrained BERT-base model' and using the 'Adam optimizer' but does not specify versions for programming languages, libraries, or other software components used in their implementation.
Experiment Setup	Yes	We use a global batch size of 8, the Adam optimizer [23], and we train for 250, 000 steps. We normalize the output embeddings and have a trainable parameter β scale the scores. We use a learning rate of 1 × 10−5 for all experiments except when we train with a cache of 2 million elements, where we use a learning rate of 5 × 10−5.