Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Authors: Dongwon Jo, Taesu Kim, Yulhwa Kim, jae-joon kim

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results reveal that Binary Mo S surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.
Researcher Affiliation Collaboration 1 Seoul National University 2 Squeeze Bits Inc. 3 Sungkyunkwan University {dongwonjo, kimjaejoon}@snu.ac.kr {taesu.kim}@squeezebits.com {yulhwakim}@skku.edu
Pseudocode No The paper describes the operations using mathematical equations but does not provide structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code (e.g., a specific repository link or an explicit statement of code release) within its text.
Open Datasets Yes We measure language modeling capabilities of these models by evaluating their perplexity on the Wiki Text2 [24] and C4 [25] datasets.
Dataset Splits No The paper mentions using a 'mixed dataset composed of the Wiki Text2 training dataset and a selected partition from the C4 training dataset', but it does not specify explicit train/validation/test splits (e.g., percentages or sample counts) needed to reproduce the data partitioning for training, nor does it explicitly mention a dedicated validation set split.
Hardware Specification Yes All training sessions are conducted on NVIDIA A100 GPUs. All experiments are conducted on NVIDIA A6000 GPUs.
Software Dependencies No The paper mentions using the 'Adam W [18] optimizer' but does not specify any software dependencies with their version numbers (e.g., PyTorch version, Python version, specific library versions).
Experiment Setup Yes The training is conducted over three epochs using the Adam W [18] optimizer, with hyperparameters set to β1 = 0.9, β2 = 0.999, and zero weight decay. We implement a cosine decay learning rate scheduler, preceded by a warm-up phase constituting 0.03 of the total training duration. For the training of Binary Mo S, we empirically set α = 10. We empirically find that using four scaling experts each for the input and output dimensions provides the optimal compromise between increasing model size and improving accuracy.