GLM-130B: An Open Bilingual Pre-trained Model
Authors: Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan Liu, Peng Zhang, Yuxiao Dong, Jie Tang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks... It also consistently and significantly outperforms ERNIE TITAN 3.0 260B... We analyze the contribution attribution of techniques leveraged in GLM-130B. A series of ablation studies have been presented in the paper... |
| Researcher Affiliation | Collaboration | Tsinghua University Zhipu.AI |
| Pseudocode | No | The paper does not contain a clearly labeled pseudocode block or algorithm figure. Methodological steps are described in narrative text. |
| Open Source Code | Yes | The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at https://github.com/THUDM/GLM-130B/. |
| Open Datasets | Yes | The pre-training data includes 1.2T Pile (train split) (Gao et al., 2020) English, 1.0T Chinese Wudao Corpora (Yuan et al., 2021), and 250G Chinese corpora (including online forums, encyclopedia, and QA) we crawl from the web, which form a balanced composition of English and Chinese contents. |
| Dataset Splits | Yes | For the 5-shot MMLU (Hendrycks et al., 2021) tasks, it is better than GPT-3 175B (+0.9%) and BLOOM-176B (+12.7%). |
| Hardware Specification | Yes | pre-trained over 400 billion tokens on a cluster of 96 NVIDIA DGX-A100 (8 40G) GPU nodes between May 6 and July 3, 2022. |
| Software Dependencies | No | The paper mentions software like PyTorch, Huggingface, Faster Transformer, icetk, sentencepiece, Jinja, and cuBLAS but does not provide specific version numbers for these software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | We warm-up the batch size from 192 to 4224 over the first 2.5% samples. We use Adam W (Loshchilov & Hutter, 2019) as our optimizer with β1 and β2 set to 0.9 and 0.95, and a weight decay value of 0.1. We warm up the learning rate from 10 7 to 8 10 5 over the first 0.5% samples, then decay it by a 10 cosine schedule. We use a dropout rate of 0.1 and clip gradients using a clipping value of 1.0 (Cf. Table 11 for the full configurations). |