Learning to Tokenize for Generative Retrieval

Authors: Weiwei Sun, Lingyong Yan, Zheng Chen, Shuaiqiang Wang, Haichao Zhu, Pengjie Ren, Zhumin Chen, Dawei Yin, Maarten Rijke, Zhaochun Ren

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets.
Researcher Affiliation Collaboration 1Shandong University, China 2Baidu Inc., China 3University of Amsterdam, The Netherlands 4Leiden University, The Netherlands
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes 2The code of this work is available at www.github.com/sunnweiwei/Gen Ret.
Open Datasets Yes We conduct experiments on three well-known document retrieval benchmark datasets, NQ320K [15, 37], MS MARCO [4, 46], and BEIR [38].
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits, percentages, or absolute sample counts for each split in the main text. It mentions 'training data' and 'test sets' but not the specific splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies Yes The proposed models and the reproduced baselines are implemented with Py Torch 1.7.1 and Hugging Face transformers 4.22.2.
Experiment Setup Yes We utilize the T5-Base model [27] as the base Transformer and initialize a new codebook embedding Et for each time step. We set the number of clusters to be K = 512 for all datasets, with the length of the docid M being dependent on the number of documents present. In the docid re-assignment, the hyper-parameter ϵ is set to 1.0, and the Sinkhorn-Knopp algorithm is executed for 100 iterations. We optimize the model using Adam W and set the learning rate to 5e 4. The batch size is 256, and the model is optimized for up to 500k steps for each timestep. We add a factor of 0.1 to the reconstruction losses to balance the scale.