TriSampler: A Better Negative Sampling Principle for Dense Retrieval

Authors: Zhen Yang, Zhou Shao, Yuxiao Dong, Jie Tang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluation show that Tri Sampler consistently attains superior retrieval performance across a diverse of representative retrieval models. ... Experiments
Researcher Affiliation Academia Department of Computer Science and Technology, Tsinghua University, Beijing, China
Pseudocode Yes Algorithm 1: Algorithm of Tri Sampler
Open Source Code No The paper does not provide any statement or link indicating that the source code for the methodology is openly available.
Open Datasets Yes We conduct experiments on the first retrieval stage of four benchmarks: three passage retrieval datasets: MS MARCO passage (MS Pas) (Nguyen et al. 2016), Natural Questions (NQ) (Kwiatkowski et al. 2019), and Trivia QA (TQA) (Joshi et al. 2017), and a document retrieval dataset: MS MARCO document (MS Doc) (Nguyen et al. 2016).
Dataset Splits Yes Datasets Training Dev Test Documents NQ 58,880 8,757 3,610 21,015,324 TQA 60,413 8,837 11,313 21,015,324 MS Pas 502,939 6,980 8,841,823 MS Doc 367,013 5,193 3,213,835
Hardware Specification Yes We implement Tri Sampler based on SOTA dense retrieval model AR2 (Zhang et al. 2021) and run all experiments on 8 NVIDIA Tesla A100 GPUs.
Software Dependencies No The paper mentions 'ERNIE-2.0-base' and 'Faiss' but does not specify their version numbers or any other software dependencies with version information.
Experiment Setup Yes In our experiments, the ratio of positive to negative pairs is set to 1 : 15, the inner product is leveraged to estimate the relevance score and Faiss (Johnson, Douze, and J egou 2019) is adopted for efficient similarity search. We utilize the top-200 passages for NQ and TQA datasets and the top-400 documents for MS Pas and MS Doc datasets as negative candidates.