reproducibilityindex.ai

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Authors: Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, Tengyu Ma

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time.
Researcher Affiliation	Academia	Stanford University {hliu99, zhiyuanli, dlwh, pliang, tengyuma}@cs.stanford.edu
Pseudocode	Yes	See Algorithm 3 for the pseudo-code.
Open Source Code	No	The paper references third-party codebases used (nano GPT, levanter) but does not provide concrete access to the open-source code for the Sophia methodology described in this paper.
Open Datasets	Yes	We train autoregressive models on Open Web Text (Gokaslan & Cohen, 2019) and the Pile (Gao et al., 2020) from scratch.
Dataset Splits	Yes	We use the train and validation split from nano GPT. The training set contains 9B tokens, and the validation set contains 4.4M tokens.
Hardware Specification	Yes	125M and 355M models are trained on A5000 GPUs, while the 770M models are trained on A100 GPUs. We use a TPU v3-128 slice to train 1.5B and 6.6B GPT Neo X.
Software Dependencies	No	The paper mentions using 'Py Torch' and 'JAX' but does not provide specific version numbers for these software components or any other libraries.
Experiment Setup	Yes	We use batch size 480 for GPT-2 and 2048 for GPT Neo X. We use cosine LR schedule with the ﬁnal LR equaling 0.05 times the peak LR with a ﬁxed 2k steps of LR warm-up... We use standard gradient clipping (by norm) threshold 1.0. For Sophia, we use β1 = 0.96, β2 = 0.99, ϵ =1e-12 and update diagonal Hessian every 10 steps.