Clustering the Sketch: Dynamic Compression for Embedding Tables
Authors: Henry Tsang, Thomas Ahle
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally CCE achieves the best of both worlds: The high compression rate of codebook-based quantization, but dynamically like hashing-based methods, so it can be used during training.Our primary experimental finding, illustrated in Table 1 and Figure 4a, indicates that CCE enables training a model with Binary Cross Entropy matching a full table baseline, using only a half the parameters required by the next best compression method. |
| Researcher Affiliation | Industry | Henry Ling-Hei Tsang Meta henrylhtsang@meta.com Thomas Dybdahl Ahle Meta Normal Computing thomas@ahle.dk |
| Pseudocode | Yes | Algorithm 1 Dense CCE for Least Squares, Algorithm 2 Sparse CCE for Least Squares, Algorithm 3 Clustered Compositional Embeddings with c columns and 2k rows |
| Open Source Code | Yes | An implementation of our methods and related work is available at github.com/thomasahle/cce. |
| Open Datasets | Yes | We used two public click log datasets from Criteo: the Kaggle and Terabyte datasets. |
| Dataset Splits | Yes | For both Kaggle and Terabyte dataset, we partitioned the data from the final day into validation and test sets. We measure the performance of the model in BCE every 50,000 batches (around one-sixth of one epoch) using the validation set. |
| Hardware Specification | Yes | We ran the Kaggle dataset experiments on a single A100 GPU. For the Terabyte dataset experiments, we ran them on two A100 GPUs using model parallelism. |
| Software Dependencies | No | The paper mentions software like PyTorch, FAISS K-means, and Scikit-learn but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In our experiments, we adhered to the setup from the open-source Deep Learning Recommendation Model (DLRM) by Naumov et al. [2019], including the choice of optimizer (SGD) and learning rate. For the K-means from FAISS, we use max_points_per_centroid=256 and niter=50. |