Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Why are Adaptive Methods Good for Attention Models?

Authors: Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, Suvrit Sra

NeurIPS 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD s poor performance. We provide a positive answer to the above question by performing both theoretical and empirical studies of the convergence of optimization methods under heavy-tailed noise.
Researcher Affiliation	Collaboration	Jingzhao Zhang MIT EMAIL Sai Praneeth Karimireddy EPFL EMAIL Andreas Veit Google Research EMAIL Seungyeon Kim Google Research EMAIL Sashank Reddi Google Research EMAIL Sanjiv Kumar Google Research EMAIL Suvrit Sra MIT EMAIL
Pseudocode	Yes	Algorithm 1 ACClip
Open Source Code	No	The paper adapts a third-party GitHub repository ('https://github.com/kimiyoung/transformer-xl/tree/master/pytorch') for some experiments, but it does not provide an explicit statement or link for the source code of their proposed ACClip algorithm.
Open Datasets	Yes	We ﬁrst investigate the distribution of the gradient noise norm g f(x) in the aforementioned neural network models, where g is the stochastic gradient computed from a minibatch sample. In particular, we focus on noise distributions while training two popular deep learning models BERT and Res Net. Note that BERT and Res Net are typically trained with Adam and SGD (with momentum) respectively and can thus, provide insights about difference between these optimizers. We ﬁrst investigate the distribution of the gradient noise norm g f(x) in the aforementioned neural network models, where g is the stochastic gradient computed from a minibatch sample. We train a 6-layer Transformer-XL model[8] on PTB dataset as a proof of concept. We now evaluate the empirical performance of our proposed ACClip algorithm on BERT pre-training as well ﬁne-tuning using the SQUAD v1.1 dataset.
Dataset Splits	Yes	The learning rates and hyperparameters for each method have been extensively tuned to provide best performance on validation set. We again follow the procedure outlined in [9] and present the results on the Dev set in Table 3.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications. It does not mention any specific hardware setup used for running experiments.
Software Dependencies	No	The paper mentions software like 'pytorch' in a GitHub link and refers to the 'Adam optimizer' and 'BERT paper', implying the use of their respective frameworks (likely TensorFlow), but it does not specify version numbers for any software dependencies.
Experiment Setup	Yes	For ACClip, we set τ = 1, learning rate = 1e-4, β1 = 0.9, β2 = 0.99, ϵ = 1e-5 and weight decay = 1e-5.