In defense of dual-encoders for neural ranking
Authors: Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we establish theoretically that with a sufficiently large encoder size, DE models can capture a broad class of scores without cross-attention. Second, we show that on real-world problems, the gap between CA and DE models may be due to the latter overfitting to the training set. To mitigate this, we propose a distillation strategy that focuses on preserving the ordering amongst documents, and confirm its efficacy on neural re-ranking benchmarks. |
| Researcher Affiliation | Industry | 1Google Research, New York, USA. |
| Pseudocode | No | The paper does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We present results on MSMARCO-Passage (Nguyen et al., 2016) and Natural Questions (NQ) (Kwiatkowski et al., 2019). |
| Dataset Splits | Yes | We train a series of BERT-based CA and DE models on the ( small ) triplets training set, employing 6-layer BERT models... For each model, we compute the mean reciprocal rank (MRR)@10 (Radev et al., 2002) on the provided train and dev set. (We shall refer to the dev set as the test set for simplicity.) |
| Hardware Specification | No | The paper mentions using "BERT-based CA and DE models" and "6-layer BERT models" but does not specify any hardware details such as GPU/CPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions using "transformer encoders initialised with the standard pre-trained BERT model checkpoints" but does not provide specific versions for any software components, libraries, or programming languages. |
| Experiment Setup | Yes | We optimise all methods for a maximum of 3 × 10^5 steps using Adam with weight decay, with a batch size of 128 and a learning rate of 2.8 × 10^-5 (i.e., a 4 scaling of the choices in Hofstätter et al. (2020a)). For all models, at the output layer we apply dropout at rate 0.1 and layer normalisation. We use a sequence length of 30 for queries, and 200 for passages. |