In defense of dual-encoders for neural ranking

Authors: Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we establish theoretically that with a sufficiently large encoder size, DE models can capture a broad class of scores without cross-attention. Second, we show that on real-world problems, the gap between CA and DE models may be due to the latter overfitting to the training set. To mitigate this, we propose a distillation strategy that focuses on preserving the ordering amongst documents, and confirm its efficacy on neural re-ranking benchmarks.
Researcher Affiliation Industry 1Google Research, New York, USA.
Pseudocode No The paper does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We present results on MSMARCO-Passage (Nguyen et al., 2016) and Natural Questions (NQ) (Kwiatkowski et al., 2019).
Dataset Splits Yes We train a series of BERT-based CA and DE models on the ( small ) triplets training set, employing 6-layer BERT models... For each model, we compute the mean reciprocal rank (MRR)@10 (Radev et al., 2002) on the provided train and dev set. (We shall refer to the dev set as the test set for simplicity.)
Hardware Specification No The paper mentions using "BERT-based CA and DE models" and "6-layer BERT models" but does not specify any hardware details such as GPU/CPU models, memory, or cloud instance types.
Software Dependencies No The paper mentions using "transformer encoders initialised with the standard pre-trained BERT model checkpoints" but does not provide specific versions for any software components, libraries, or programming languages.
Experiment Setup Yes We optimise all methods for a maximum of 3 × 10^5 steps using Adam with weight decay, with a batch size of 128 and a learning rate of 2.8 × 10^-5 (i.e., a 4 scaling of the choices in Hofstätter et al. (2020a)). For all models, at the output layer we apply dropout at rate 0.1 and layer normalisation. We use a sequence length of 30 for queries, and 200 for passages.