Learning to Tokenize for Generative Retrieval
Authors: Weiwei Sun, Lingyong Yan, Zheng Chen, Shuaiqiang Wang, Haichao Zhu, Pengjie Ren, Zhumin Chen, Dawei Yin, Maarten Rijke, Zhaochun Ren
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets. |
| Researcher Affiliation | Collaboration | 1Shandong University, China 2Baidu Inc., China 3University of Amsterdam, The Netherlands 4Leiden University, The Netherlands |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 2The code of this work is available at www.github.com/sunnweiwei/Gen Ret. |
| Open Datasets | Yes | We conduct experiments on three well-known document retrieval benchmark datasets, NQ320K [15, 37], MS MARCO [4, 46], and BEIR [38]. |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits, percentages, or absolute sample counts for each split in the main text. It mentions 'training data' and 'test sets' but not the specific splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | Yes | The proposed models and the reproduced baselines are implemented with Py Torch 1.7.1 and Hugging Face transformers 4.22.2. |
| Experiment Setup | Yes | We utilize the T5-Base model [27] as the base Transformer and initialize a new codebook embedding Et for each time step. We set the number of clusters to be K = 512 for all datasets, with the length of the docid M being dependent on the number of documents present. In the docid re-assignment, the hyper-parameter ϵ is set to 1.0, and the Sinkhorn-Knopp algorithm is executed for 100 iterations. We optimize the model using Adam W and set the learning rate to 5e 4. The batch size is 256, and the model is optimized for up to 500k steps for each timestep. We add a factor of 0.1 to the reconstruction losses to balance the scale. |