Knowledge Distillation for High Dimensional Search Index
Authors: Zepu Lu, Jin Chen, Defu Lian, ZAIXI ZHANG, Yong Ge, Enhong Chen
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results demonstrate that KDindex outperforms existing learnable quantization-based indexes and is 40 lighter than the state-of-the-art non-exhaustive methods while achieving comparable recall quality. |
| Researcher Affiliation | Academia | 1School of Computer Science and Technology, University of Science and Technology of China 2State Key Laboratory of Cognitive Intelligence, Hefei, Anhui, China 3University of Electronic Science and Technology of China 4University of Arizona |
| Pseudocode | Yes | Algorithm 1: Posting List Balance |
| Open Source Code | No | The paper does not include an unambiguous statement that the authors' own source code for the described methodology is being released, nor does it provide a direct link to a code repository for KDindex. |
| Open Datasets | Yes | Four large-scale retrieval benchmarks, including SIFT1M, GIST1M from ANN datasets [2], MS MARCO Doc and MS MARCO Passage from the TREC 2019 Deep Learning Track [9], are used to validate the effectiveness of the proposed KDindex. |
| Dataset Splits | Yes | Document Retrieval consists of 3.2M documents, 0.36M training queries, and 5K development queries. Passage Retrieval has a corpus of 8.8M passages, 0.8M training queries, and 0.1M development queries. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or detailed cloud/cluster resource configurations used for running the experiments. |
| Software Dependencies | No | The paper mentions 'The baselines are implemented based on the Faiss ANNS library [26]' but does not provide specific version numbers for Faiss or any other software dependencies, which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | Each vector is quantized by B = 8 codebooks, each of which contains W = 256 codewords by default. The centroids are trained with a learning rate of 0.01 and optimized by the Adam [28] optimizer. The batch size is set to 64. |