Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ASGO: Adaptive Structured Gradient Optimization
Authors: Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, Tong Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also discuss practical modifications of ASGO and empirically verify ASGO s effectiveness on language model tasks. Code is available at https://github.com/infinity-stars/ASGO. ... We further empirically validate the effectiveness and efficiency of this implementation on language model tasks, demonstrating the algorithm s great potential in real applications. ... 7 Empirical Results We empirically evaluated the effectiveness of ASGO (Algorithm 2) and DASGO (Algorithm 4) on pretraining and finetuning tasks for Large Language models (LLMs). |
| Researcher Affiliation | Collaboration | 1Rice University 2University of Illinois Urbana-Champaign 3Meta Platforms, Inc. 4Columbia University |
| Pseudocode | Yes | Algorithm 1 ASGO (Adaptive Structured Gradient Optimization) ... Algorithm 2 A Practical Implementation of ASGO (for a layer W ℓ Rm n such that m n) ... Algorithm 3 Sqrt Inverse Newton Schulz ... Algorithm 4 Implementation of DASGO (Diagonal Adaptive Structured Gradient Optimization) |
| Open Source Code | Yes | Code is available at https://github.com/infinity-stars/ASGO. |
| Open Datasets | Yes | We adopted the configuration of the GPT2 model as described by [Karpathy, 2022]... We fine-tuned the GPT2-Large(774M) model [Radford et al., 2019] using the Wiki Text-2 dataset. ... The FIM objective modifies the training process to enable the model to learn to infill text by rearranging document spans, following the setting in [Bavarian et al., 2022]. ... on the Shakespeare character-level dataset. |
| Dataset Splits | Yes | We trained the model for 2400 steps using a batch size of 64, 4 H100 GPUs, a sequence length of 512, and 8 gradient accumulation steps. This setup corresponds to a total token budget of: 512 × 64 × 4 × 8 × 2400 ≈ 2.5 Billion tokens. ... The final training and validation losses after 2400 steps are summarized in Table 1, and the full training loss dynamics are depicted in Figure 1. ... We fine-tuned GPT2-Large using the standard Causal Language Modeling (CLM) loss... we conducted five experimental runs with different random seeds under the same hyperparameter settings. Table 4 presents the average perplexity results after fine-tuning for 2 epochs. ... For the finetuning experiments on Wiki Text-2... The optimal learning rate for each optimizer and fine-tuning objective combination was determined by selecting the learning rate that yielded the lowest validation perplexity on the Wiki Text-2 validation set after the full 2 epochs of fine-tuning. |
| Hardware Specification | Yes | All experiments were conducted using NVIDIA V100s SMX2 and NVIDIA GH200 GPUs. Specifically, for the larger-scale pretraining of GPT2, we utilized a configuration of four GH200 GPUs, while other experiments were performed on a single V100 GPU. |
| Software Dependencies | No | The paper does not explicitly provide specific version numbers for software dependencies such as PyTorch, Python, or CUDA. It mentions the use of the Hugging Face Transformers library and refers to a PyTorch implementation in citations, but without version details. |
| Experiment Setup | Yes | We trained the model for 2400 steps using a batch size of 64, 4 H100 GPUs, a sequence length of 512, and 8 gradient accumulation steps. ... We used both the Polar Express (PE) and Newton Schultz (NS) algorithms to compute the inverse square root of Vt in ASGO. ... we carefully tuned the learning rates and β2 values (where applicable) for all optimizers. ... a learning rate schedule consisting of 240 linear warm-up steps followed by a cosine decay... Table 6: Optimal hyperparameter selection for pretraining GPT-2. Optimizer Learning Rate β1 β2 Weight Decay Damping ϵ Update Freq. (τ) |