Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

Authors: Ashok Cutkosky, Harsh Mehta

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To answer this question, we conducted an experimental study on the BERT pretraining task. We chose to experiment on BERT since it has been shown to have heavy tails empirically [43] and the common practice is already to clip the gradients. Figure 3 plots the masked language model accuracy of BERT-Large model when trained with Algorithm 1.
Researcher Affiliation Collaboration Ashok Cutkosky Boston University EMAIL Harsh Mehta Google Research EMAIL
Pseudocode Yes Algorithm 1 Normalized SGD with Clipping and Momentum
Open Source Code No The paper does not explicitly state that its source code is being released, nor does it provide any links to a code repository.
Open Datasets Yes To answer this question, we conducted an experimental study on the BERT pretraining task.
Dataset Splits No The paper mentions "eval accuracy" and discusses training schedules but does not provide specific percentages or sample counts for training, validation, or test dataset splits. It does not describe how the data was partitioned into these sets.
Hardware Specification Yes All our experiments were conducted using the Tensorflow framework on TPUv3 architecture.
Software Dependencies No The paper mentions the "Tensorflow framework" but does not provide specific version numbers for TensorFlow or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes When using Algorithm 1, we employ base learning rate η0 of 0.3 and batch size of 512. We came up with this learning rate by performing grid search η0 [1.0, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001] and choosing the one which attains the best eval accuracy. To obtain the optimal base learning rate of 0.3, we again ran a grid search η0 [0.1, 0.2, 0.3, 0.4, 0.5]. Our warmup baseline employs the standard practice of linear warm-up for 3125 steps and polynomical decay of ηt = η0 (1 t T ) for the rest of the steps.