Generative Retrieval Meets Multi-Graded Relevance

Authors: Yubao Tang, Ruqing Zhang, Jiafeng Guo, Maarten Rijke, Wei Chen, Xueqi Cheng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR2.
Researcher Affiliation Academia 1CAS Key Lab of Network Data Science and Technology, ICT, CAS 2University of Chinese Academy of Sciences 3University of Amsterdam
Pseudocode No The paper presents mathematical formulas and figures, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes The NeurIPS checklist states: 'Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Section 5'.
Open Datasets Yes We select three widely-used multi-graded relevance datasets: Gov2 [18], Clue Web09-B [19] and Robust04 [82]... Furthermore, we consider two binary relevance datasets: MS MARCO Document Ranking [57] and Natural Questions (NQ 320K) [38].
Dataset Splits Yes The value of |r| is tuned on the validation set to optimize the trade-off between relevance and distinctness.
Hardware Specification Yes We train GR2 on eight NVIDIA Tesla A100 80GB GPUs.
Software Dependencies Yes GR2 and the reproduced baselines are implemented with Py Torch 1.9.0 and Hugging Face transformers 4.16.2;
Experiment Setup Yes For hyperparameters, we use the Adam optimizer with a linear warm-up over the first 10% steps. The learning rate is 5e-5, label smoothing is 0.1, weight decay is 0.01, sequence length of documents is 512, max training steps are 50K, and batch size is 60. We train GR2 on eight NVIDIA Tesla A100 80GB GPUs. For more details, please see Appendix F.