Random Cycle Coding: Lossless Compression of Cluster Assignments via Bits-Back Coding

Authors: Daniel Severo, Ashish Khisti, Alireza Makhzani

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show RCC can save up to 2 bytes per element when applied to vector databases, and removes the need for assigning integer ids to identify vectors, translating to savings of up to 70% in vector database systems for similarity search applications. 5 Experiments
Researcher Affiliation Academia University of Toronto Department of Electrical and Computer Engineering {d.severo@mail, akhisti@, a.makhzani@}.utoronto.ca
Pseudocode Yes Algorithm 1: Pseudo-code for encoding with RCC. Algorithm 2: Pseudo-code for decoding with RCC.
Open Source Code Yes Python code for encoding and decoding are given in Appendix A.
Open Datasets Yes We experimented applying ROC and RCC to FAISS [Johnson et al., 2019] databases of varying size. Results are shown in Table 1. Scalar quantization [Cover, 1999, Lloyd, 1982] was used to partition the set of vectors into disjoint clusters. This results in clusters of approximately the same number of elements, which is the worst-case savings for both ROC and RCC. RCC achieves optimal savings for all combinations of datasets, number of elements, and clusters. ROC-2 has similar performance to RCC but requires significantly more compute as shown in Figure 2. The total savings will depend on the cluster sizes, the number of bytes used to encode each element (i.e., FAISS vector/embedding), as well as the size of id numbers in the database. Cluster sizes are often set to n resulting in log|Π| = n log(( n 1)!) [Johnson et al., 2019]. A vast literature exists on encoding methods for vectors [Chen et al., 2010, Martinez et al., 2016, Babenko and Lempitsky, 2014, Jegou et al., 2010, Huijben et al., 2024] with typical values ranging from 8 to 16 bytes for Big ANN and 4 to 8 for SIFT1M.
Dataset Splits No The paper describes applying the method to existing datasets (SIFT1M, Big ANN) but does not specify train/validation/test splits for its own experimental evaluation, as it's a compression algorithm, not a machine learning model being trained.
Hardware Specification No The paper includes 'wall-time plots' in the experiment section, which indicate execution time, but it does not specify concrete hardware details such as exact CPU or GPU models, processor types, or memory used for the experiments.
Software Dependencies No The paper provides pseudocode and mentions standard algorithms (e.g., ANS, ROC) and tools (FAISS), but it does not specify the version numbers of any specific software libraries or dependencies used for the experiments.
Experiment Setup Yes Figure 2: Median encoding plus decoding times, across 100 runs, for Random Order Coding (ROC) [Severo et al., 2023] and our method Random Cycle Coding (RCC). The number of elements n increases from left-to-right across plots. Clusters are fixed to have roughly the same size, n/k, mirroring vector database applications discussed in Section 5.3.