Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
Authors: Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, Tengyu Ma
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On language modeling with GPT models of sizes ranging from 125M to 1.5B, Sophia achieves a 2x speed-up compared to Adam in the number of steps, total compute, and wall-clock time, achieving the same perplexity with 50% fewer steps, less total compute, and reduced wall-clock time. |
| Researcher Affiliation | Academia | Stanford University EMAIL |
| Pseudocode | Yes | See Algorithm 3 for the pseudo-code. |
| Open Source Code | No | The paper references third-party codebases used (nano GPT, levanter) but does not provide concrete access to the open-source code for the Sophia methodology described in this paper. |
| Open Datasets | Yes | We train autoregressive models on Open Web Text (Gokaslan & Cohen, 2019) and the Pile (Gao et al., 2020) from scratch. |
| Dataset Splits | Yes | We use the train and validation split from nano GPT. The training set contains 9B tokens, and the validation set contains 4.4M tokens. |
| Hardware Specification | Yes | 125M and 355M models are trained on A5000 GPUs, while the 770M models are trained on A100 GPUs. We use a TPU v3-128 slice to train 1.5B and 6.6B GPT Neo X. |
| Software Dependencies | No | The paper mentions using 'Py Torch' and 'JAX' but does not provide specific version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | We use batch size 480 for GPT-2 and 2048 for GPT Neo X. We use cosine LR schedule with the final LR equaling 0.05 times the peak LR with a fixed 2k steps of LR warm-up... We use standard gradient clipping (by norm) threshold 1.0. For Sophia, we use β1 = 0.96, β2 = 0.99, ϵ =1e-12 and update diagonal Hessian every 10 steps. |