Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Authors: Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, Tengyu Ma

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time.
Researcher Affiliation Academia Stanford University {hliu99, zhiyuanli, dlwh, pliang, tengyuma}@cs.stanford.edu
Pseudocode Yes See Algorithm 3 for the pseudo-code.
Open Source Code No The paper references third-party codebases used (nano GPT, levanter) but does not provide concrete access to the open-source code for the Sophia methodology described in this paper.
Open Datasets Yes We train autoregressive models on Open Web Text (Gokaslan & Cohen, 2019) and the Pile (Gao et al., 2020) from scratch.
Dataset Splits Yes We use the train and validation split from nano GPT. The training set contains 9B tokens, and the validation set contains 4.4M tokens.
Hardware Specification Yes 125M and 355M models are trained on A5000 GPUs, while the 770M models are trained on A100 GPUs. We use a TPU v3-128 slice to train 1.5B and 6.6B GPT Neo X.
Software Dependencies No The paper mentions using 'Py Torch' and 'JAX' but does not provide specific version numbers for these software components or any other libraries.
Experiment Setup Yes We use batch size 480 for GPT-2 and 2048 for GPT Neo X. We use cosine LR schedule with the final LR equaling 0.05 times the peak LR with a fixed 2k steps of LR warm-up... We use standard gradient clipping (by norm) threshold 1.0. For Sophia, we use β1 = 0.96, β2 = 0.99, ϵ =1e-12 and update diagonal Hessian every 10 steps.