Intriguing Properties of Quantization at Scale
Authors: Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blunsom, Ahmet Üstün, Sara Hooker
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We successfully quantize models ranging in size from 410M to 52B with minimal degradation in performance. We conduct a controlled large scale study At 6B, we maintain the same architecture and vary key optimization choices such as weight decay, gradient clipping, dropout and precision of training representation. We present results with optimal hyper-parameters across models varying from 410 million to 52 billion parameters, with each experiment variant trained from random initialization. Evaluation We evaluate each model variant on Copa (test and dev set) (Wang et al., 2019), Hella Swag (Zellers et al., 2019), PIQAValidation (Bisk et al., 2020), Story Cloze (Mostafazadeh et al., 2016), Wino Grande (Sakaguchi et al., 2019), Paralex (Fader et al., 2013), and LAMBADA (Paperno et al., 2016). |
| Researcher Affiliation | Collaboration | Arash Ahmadian Cohere For AI arash@cohere.com Saurabh Dash Cohere saurabh@cohere.com Hongyu Chen Cohere charlie@cohere.com Bharat Venkitesh Cohere bharat@cohere.com Stephen Gou Cohere stephen@cohere.com Phil Blunsom Cohere phil@cohere.com Ahmet Üstün Cohere For AI ahmet@cohere.com Sara Hooker Cohere For AI sarahooker@cohere.com Equal Contribution Also affiliated with the University of Toronto & the Vector Institute for Artificial Intelligence. |
| Pseudocode | No | The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks. It describes processes and equations in paragraph form, but not structured as an algorithm. |
| Open Source Code | No | The paper refers to third-party tools and frameworks like 'llama.cpp' and 'FAX (Yoo et al., 2022) framework' but does not provide any statement or link indicating that the authors' own implementation code for the described methodology is publicly available or open-sourced. |
| Open Datasets | Yes | We pre-train models using a mixture of datasets from Common Crawl and C4 (Raffel et al., 2020) with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 256. |
| Dataset Splits | No | The paper states that evaluation was done on various benchmarks, including 'Copa (test and dev set)' and 'PIQAValidation'. While these contain validation-like sets, the paper does not specify the train/validation/test splits for their primary pre-training datasets (Common Crawl and C4 mixture) which would be necessary for reproduction. |
| Hardware Specification | Yes | We use TPU-v4 chips (Jouppi et al., 2017) to train, and Nvidia A100 GPUs to evaluate our models. |
| Software Dependencies | No | The paper mentions software components like 'Sentence Piece (Kudo & Richardson, 2018) tokenizer' and 'FAX (Yoo et al., 2022) framework' but does not provide specific version numbers for these or other software dependencies, which would be required for reproducible setup. |
| Experiment Setup | Yes | We pre-train models using a mixture of datasets from Common Crawl and C4 (Raffel et al., 2020) with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 256. We use a cosine learning rate scheduler with 1500 warm-up steps. We use Ge LU activations (Hendrycks & Gimpel, 2016). Table 1: Optimization choices that are explored for pre-training in our controlled setup. Weight decay 0.001, 0.01, 0.1 Gradient clipping None, 1 Dropout 0, 0.1, 0.4, 0.8 Half-precision bf16, fp16. |