Rethinking Optimization and Architecture for Tiny Language Models
Authors: Yehui Tang, Kai Han, Fangcheng Liu, Yunsheng Ni, Yuchuan Tian, Zheyuan Bai, Yi-Qi Hu, Sichao Liu, Shangling Jui, Yunhe Wang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, based on a tiny language model with 1B parameters, we carefully design a series of empirical study to analyze the effect of each component. Three perspectives are mainly discussed, i.e., neural architecture, parameter initialization, and optimization strategy. Several design formulas are empirically proved especially effective for tiny language models, including tokenizer compression, architecture tweaking, parameter inheritance and multiple-round training. Then we train Pan Gu-π1B Pro and Pan Gu-π-1.5B Pro on 1.6T multilingual corpora, following the established formulas. Experimental results demonstrate the improved optimization and architecture yield a notable average improvement of 8.87 on benchmark evaluation sets for Pan Gu-π-1B Pro. |
| Researcher Affiliation | Collaboration | 1Huawei Noah s Ark Lab 2Peking University 3Consumer Business Group, Huawei 4Huawei Kirin Solution. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available1. 1https://github.com/Yuchuan Tian/ Rethink Tiny LM |
| Open Datasets | Yes | The pre-training data, which consists of 1.6T tokens, is gathered from diverse sources from the Internet, covering English and Chinese corpus with around 1 : 1 scale. The used 48k tokenizer is built by byte-pair encoding (BPE, Shibata et al. (1999)) from Sentence Piece (Kudo & Richardson, 2018) upon our data. Mobile LLa MA-1.4B and Mobile LLa MA-2.7B that were trained from scratch on the Red Pajama dataset (Computer, 2023). |
| Dataset Splits | Yes | The models constructed with different strategies are compared on ARC Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019) and C3 (Sun et al., 2020). We use Open Compass (Contributors, 2023) to evaluate on an extensive suite of downstream tasks, covering examination, knowledg, reasoning, and understanding abilities for a comprehensive comparison. C-Eval (Huang et al., 2023) is a Chinese benchmark to evaluate the knowledge and reasoning abilities. CMMLU (Li et al., 2023a) covers 67 topics including science, engineering, and humanities. MMLU (Hendrycks et al., 2021) proposes an English benchmark for measuring LLM s multitask accuracy by covering 57 tasks including mathematics, history, computer science, and law. AGI-Eval (Zhong et al., 2023) is a benchmark specifically designed to evaluate the general abilities in tasks pertinent to human cognition and problemsolving. Bool Q (Clark et al., 2019) is a reading comprehension dataset to evaluate the difficult entailment-like inference ability of LLMs. AX-b (Wang et al., 2020) is a broadcoverage diagnostic task and PIQA (Bisk et al., 2020) is a physical interaction question-answering task. EPRSTM (Xu et al., 2021) is a binary sentiment analysis dataset based on product reviews. XSum (Narayan et al., 2018) is a summarization task collected from the British Broadcasting Corporation and C3 (Sun et al., 2020) contains 13,369 documents and their associated 19,577 multiple-choice questions. |
| Hardware Specification | Yes | The speed is tested on a single NVIDIA V100 GPU with batch size 20 using FP16. We use the Huawei Ascend 910 card to train and evaluate the proposed Pan Gu-π Pro. |
| Software Dependencies | No | The paper mentions software components like AdamW optimizer and SentencePiece, but does not specify their version numbers (e.g., 'Python 3.8, PyTorch 1.9, and CUDA 11.1'). |
| Experiment Setup | Yes | Our models are trained using the Adam W optimizer (Loshchilov & Hutter, 2017) with β1 = 0.9, β2 = 0.95 utilizing the cosine learning rate decay (Loshchilov & Hutter, 2016) with an initial learning rate 2 10 4. The total batch size for the training process is 2M. |