Clustering the Sketch: Dynamic Compression for Embedding Tables

Authors: Henry Tsang, Thomas Ahle

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally CCE achieves the best of both worlds: The high compression rate of codebook-based quantization, but dynamically like hashing-based methods, so it can be used during training.Our primary experimental finding, illustrated in Table 1 and Figure 4a, indicates that CCE enables training a model with Binary Cross Entropy matching a full table baseline, using only a half the parameters required by the next best compression method.
Researcher Affiliation Industry Henry Ling-Hei Tsang Meta henrylhtsang@meta.com Thomas Dybdahl Ahle Meta Normal Computing thomas@ahle.dk
Pseudocode Yes Algorithm 1 Dense CCE for Least Squares, Algorithm 2 Sparse CCE for Least Squares, Algorithm 3 Clustered Compositional Embeddings with c columns and 2k rows
Open Source Code Yes An implementation of our methods and related work is available at github.com/thomasahle/cce.
Open Datasets Yes We used two public click log datasets from Criteo: the Kaggle and Terabyte datasets.
Dataset Splits Yes For both Kaggle and Terabyte dataset, we partitioned the data from the final day into validation and test sets. We measure the performance of the model in BCE every 50,000 batches (around one-sixth of one epoch) using the validation set.
Hardware Specification Yes We ran the Kaggle dataset experiments on a single A100 GPU. For the Terabyte dataset experiments, we ran them on two A100 GPUs using model parallelism.
Software Dependencies No The paper mentions software like PyTorch, FAISS K-means, and Scikit-learn but does not provide specific version numbers for these software components.
Experiment Setup Yes In our experiments, we adhered to the setup from the open-source Deep Learning Recommendation Model (DLRM) by Naumov et al. [2019], including the choice of optimizer (SGD) and learning rate. For the K-means from FAISS, we use max_points_per_centroid=256 and niter=50.