reproducibilityindex.ai

Intriguing Properties of Quantization at Scale

Authors: Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blunsom, Ahmet Üstün, Sara Hooker

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We successfully quantize models ranging in size from 410M to 52B with minimal degradation in performance. We conduct a controlled large scale study At 6B, we maintain the same architecture and vary key optimization choices such as weight decay, gradient clipping, dropout and precision of training representation. We present results with optimal hyper-parameters across models varying from 410 million to 52 billion parameters, with each experiment variant trained from random initialization. Evaluation We evaluate each model variant on Copa (test and dev set) (Wang et al., 2019), Hella Swag (Zellers et al., 2019), PIQAValidation (Bisk et al., 2020), Story Cloze (Mostafazadeh et al., 2016), Wino Grande (Sakaguchi et al., 2019), Paralex (Fader et al., 2013), and LAMBADA (Paperno et al., 2016).
Researcher Affiliation	Collaboration	Arash Ahmadian Cohere For AI arash@cohere.com Saurabh Dash Cohere saurabh@cohere.com Hongyu Chen Cohere charlie@cohere.com Bharat Venkitesh Cohere bharat@cohere.com Stephen Gou Cohere stephen@cohere.com Phil Blunsom Cohere phil@cohere.com Ahmet Üstün Cohere For AI ahmet@cohere.com Sara Hooker Cohere For AI sarahooker@cohere.com Equal Contribution Also affiliated with the University of Toronto & the Vector Institute for Artificial Intelligence.
Pseudocode	No	The paper does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks. It describes processes and equations in paragraph form, but not structured as an algorithm.
Open Source Code	No	The paper refers to third-party tools and frameworks like 'llama.cpp' and 'FAX (Yoo et al., 2022) framework' but does not provide any statement or link indicating that the authors' own implementation code for the described methodology is publicly available or open-sourced.
Open Datasets	Yes	We pre-train models using a mixture of datasets from Common Crawl and C4 (Raffel et al., 2020) with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 256.
Dataset Splits	No	The paper states that evaluation was done on various benchmarks, including 'Copa (test and dev set)' and 'PIQAValidation'. While these contain validation-like sets, the paper does not specify the train/validation/test splits for their primary pre-training datasets (Common Crawl and C4 mixture) which would be necessary for reproduction.
Hardware Specification	Yes	We use TPU-v4 chips (Jouppi et al., 2017) to train, and Nvidia A100 GPUs to evaluate our models.
Software Dependencies	No	The paper mentions software components like 'Sentence Piece (Kudo & Richardson, 2018) tokenizer' and 'FAX (Yoo et al., 2022) framework' but does not provide specific version numbers for these or other software dependencies, which would be required for reproducible setup.
Experiment Setup	Yes	We pre-train models using a mixture of datasets from Common Crawl and C4 (Raffel et al., 2020) with Adam W (Loshchilov & Hutter, 2019) optimizer and a batch size of 256. We use a cosine learning rate scheduler with 1500 warm-up steps. We use Ge LU activations (Hendrycks & Gimpel, 2016). Table 1: Optimization choices that are explored for pre-training in our controlled setup. Weight decay 0.001, 0.01, 0.1 Gradient clipping None, 1 Dropout 0, 0.1, 0.4, 0.8 Half-precision bf16, fp16.