reproducibilityindex.ai

High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

Authors: Ashok Cutkosky, Harsh Mehta

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To answer this question, we conducted an experimental study on the BERT pretraining task. We chose to experiment on BERT since it has been shown to have heavy tails empirically [43] and the common practice is already to clip the gradients. Figure 3 plots the masked language model accuracy of BERT-Large model when trained with Algorithm 1.
Researcher Affiliation	Collaboration	Ashok Cutkosky Boston University ashok@cutkosky.com Harsh Mehta Google Research harshm@google.com
Pseudocode	Yes	Algorithm 1 Normalized SGD with Clipping and Momentum
Open Source Code	No	The paper does not explicitly state that its source code is being released, nor does it provide any links to a code repository.
Open Datasets	Yes	To answer this question, we conducted an experimental study on the BERT pretraining task.
Dataset Splits	No	The paper mentions "eval accuracy" and discusses training schedules but does not provide specific percentages or sample counts for training, validation, or test dataset splits. It does not describe how the data was partitioned into these sets.
Hardware Specification	Yes	All our experiments were conducted using the Tensorﬂow framework on TPUv3 architecture.
Software Dependencies	No	The paper mentions the "Tensorﬂow framework" but does not provide specific version numbers for TensorFlow or any other software libraries or dependencies used in the experiments.
Experiment Setup	Yes	When using Algorithm 1, we employ base learning rate η0 of 0.3 and batch size of 512. We came up with this learning rate by performing grid search η0 [1.0, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001] and choosing the one which attains the best eval accuracy. To obtain the optimal base learning rate of 0.3, we again ran a grid search η0 [0.1, 0.2, 0.3, 0.4, 0.5]. Our warmup baseline employs the standard practice of linear warm-up for 3125 steps and polynomical decay of ηt = η0 (1 t T ) for the rest of the steps.