reproducibilityindex.ai

In defense of dual-encoders for neural ranking

Authors: Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, we establish theoretically that with a sufﬁciently large encoder size, DE models can capture a broad class of scores without cross-attention. Second, we show that on real-world problems, the gap between CA and DE models may be due to the latter overﬁtting to the training set. To mitigate this, we propose a distillation strategy that focuses on preserving the ordering amongst documents, and conﬁrm its efﬁcacy on neural re-ranking benchmarks.
Researcher Affiliation	Industry	1Google Research, New York, USA.
Pseudocode	No	The paper does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We present results on MSMARCO-Passage (Nguyen et al., 2016) and Natural Questions (NQ) (Kwiatkowski et al., 2019).
Dataset Splits	Yes	We train a series of BERT-based CA and DE models on the ( small ) triplets training set, employing 6-layer BERT models... For each model, we compute the mean reciprocal rank (MRR)@10 (Radev et al., 2002) on the provided train and dev set. (We shall refer to the dev set as the test set for simplicity.)
Hardware Specification	No	The paper mentions using "BERT-based CA and DE models" and "6-layer BERT models" but does not specify any hardware details such as GPU/CPU models, memory, or cloud instance types.
Software Dependencies	No	The paper mentions using "transformer encoders initialised with the standard pre-trained BERT model checkpoints" but does not provide specific versions for any software components, libraries, or programming languages.
Experiment Setup	Yes	We optimise all methods for a maximum of 3 × 10^5 steps using Adam with weight decay, with a batch size of 128 and a learning rate of 2.8 × 10^-5 (i.e., a 4 scaling of the choices in Hofstätter et al. (2020a)). For all models, at the output layer we apply dropout at rate 0.1 and layer normalisation. We use a sequence length of 30 for queries, and 200 for passages.