Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models
Authors: Dongwon Jo, Taesu Kim, Yulhwa Kim, jae-joon kim
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results reveal that Binary Mo S surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques. |
| Researcher Affiliation | Collaboration | 1 Seoul National University 2 Squeeze Bits Inc. 3 Sungkyunkwan University {dongwonjo, kimjaejoon}@snu.ac.kr {taesu.kim}@squeezebits.com {yulhwakim}@skku.edu |
| Pseudocode | No | The paper describes the operations using mathematical equations but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code (e.g., a specific repository link or an explicit statement of code release) within its text. |
| Open Datasets | Yes | We measure language modeling capabilities of these models by evaluating their perplexity on the Wiki Text2 [24] and C4 [25] datasets. |
| Dataset Splits | No | The paper mentions using a 'mixed dataset composed of the Wiki Text2 training dataset and a selected partition from the C4 training dataset', but it does not specify explicit train/validation/test splits (e.g., percentages or sample counts) needed to reproduce the data partitioning for training, nor does it explicitly mention a dedicated validation set split. |
| Hardware Specification | Yes | All training sessions are conducted on NVIDIA A100 GPUs. All experiments are conducted on NVIDIA A6000 GPUs. |
| Software Dependencies | No | The paper mentions using the 'Adam W [18] optimizer' but does not specify any software dependencies with their version numbers (e.g., PyTorch version, Python version, specific library versions). |
| Experiment Setup | Yes | The training is conducted over three epochs using the Adam W [18] optimizer, with hyperparameters set to β1 = 0.9, β2 = 0.999, and zero weight decay. We implement a cosine decay learning rate scheduler, preceded by a warm-up phase constituting 0.03 of the total training duration. For the training of Binary Mo S, we empirically set α = 10. We empirically find that using four scaling experts each for the input and output dimensions provides the optimal compromise between increasing model size and improving accuracy. |