Knowledge Distillation for High Dimensional Search Index

Authors: Zepu Lu, Jin Chen, Defu Lian, ZAIXI ZHANG, Yong Ge, Enhong Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results demonstrate that KDindex outperforms existing learnable quantization-based indexes and is 40 lighter than the state-of-the-art non-exhaustive methods while achieving comparable recall quality.
Researcher Affiliation Academia 1School of Computer Science and Technology, University of Science and Technology of China 2State Key Laboratory of Cognitive Intelligence, Hefei, Anhui, China 3University of Electronic Science and Technology of China 4University of Arizona
Pseudocode Yes Algorithm 1: Posting List Balance
Open Source Code No The paper does not include an unambiguous statement that the authors' own source code for the described methodology is being released, nor does it provide a direct link to a code repository for KDindex.
Open Datasets Yes Four large-scale retrieval benchmarks, including SIFT1M, GIST1M from ANN datasets [2], MS MARCO Doc and MS MARCO Passage from the TREC 2019 Deep Learning Track [9], are used to validate the effectiveness of the proposed KDindex.
Dataset Splits Yes Document Retrieval consists of 3.2M documents, 0.36M training queries, and 5K development queries. Passage Retrieval has a corpus of 8.8M passages, 0.8M training queries, and 0.1M development queries.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or detailed cloud/cluster resource configurations used for running the experiments.
Software Dependencies No The paper mentions 'The baselines are implemented based on the Faiss ANNS library [26]' but does not provide specific version numbers for Faiss or any other software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes Each vector is quantized by B = 8 codebooks, each of which contains W = 256 codewords by default. The centroids are trained with a learning rate of 0.01 and optimized by the Adam [28] optimizer. The batch size is set to 64.