Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
Authors: Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, Tengyu Ma
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. |
| Researcher Affiliation | Academia | Stanford University {hliu99, zhiyuanli, dlwh, pliang, tengyuma}@cs.stanford.edu |
| Pseudocode | Yes | See Algorithm 3 for the pseudo-code. |
| Open Source Code | No | The paper references third-party codebases used (nano GPT, levanter) but does not provide concrete access to the open-source code for the Sophia methodology described in this paper. |
| Open Datasets | Yes | We train autoregressive models on Open Web Text (Gokaslan & Cohen, 2019) and the Pile (Gao et al., 2020) from scratch. |
| Dataset Splits | Yes | We use the train and validation split from nano GPT. The training set contains 9B tokens, and the validation set contains 4.4M tokens. |
| Hardware Specification | Yes | 125M and 355M models are trained on A5000 GPUs, while the 770M models are trained on A100 GPUs. We use a TPU v3-128 slice to train 1.5B and 6.6B GPT Neo X. |
| Software Dependencies | No | The paper mentions using 'Py Torch' and 'JAX' but does not provide specific version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | We use batch size 480 for GPT-2 and 2048 for GPT Neo X. We use cosine LR schedule with the final LR equaling 0.05 times the peak LR with a fixed 2k steps of LR warm-up... We use standard gradient clipping (by norm) threshold 1.0. For Sophia, we use β1 = 0.96, β2 = 0.99, ϵ =1e-12 and update diagonal Hessian every 10 steps. |