High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

Authors: Ashok Cutkosky, Harsh Mehta

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To answer this question, we conducted an experimental study on the BERT pretraining task. We chose to experiment on BERT since it has been shown to have heavy tails empirically [43] and the common practice is already to clip the gradients. Figure 3 plots the masked language model accuracy of BERT-Large model when trained with Algorithm 1.
Researcher Affiliation Collaboration Ashok Cutkosky Boston University ashok@cutkosky.com Harsh Mehta Google Research harshm@google.com
Pseudocode Yes Algorithm 1 Normalized SGD with Clipping and Momentum
Open Source Code No The paper does not explicitly state that its source code is being released, nor does it provide any links to a code repository.
Open Datasets Yes To answer this question, we conducted an experimental study on the BERT pretraining task.
Dataset Splits No The paper mentions "eval accuracy" and discusses training schedules but does not provide specific percentages or sample counts for training, validation, or test dataset splits. It does not describe how the data was partitioned into these sets.
Hardware Specification Yes All our experiments were conducted using the Tensorflow framework on TPUv3 architecture.
Software Dependencies No The paper mentions the "Tensorflow framework" but does not provide specific version numbers for TensorFlow or any other software libraries or dependencies used in the experiments.
Experiment Setup Yes When using Algorithm 1, we employ base learning rate η0 of 0.3 and batch size of 512. We came up with this learning rate by performing grid search η0 [1.0, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001] and choosing the one which attains the best eval accuracy. To obtain the optimal base learning rate of 0.3, we again ran a grid search η0 [0.1, 0.2, 0.3, 0.4, 0.5]. Our warmup baseline employs the standard practice of linear warm-up for 3125 steps and polynomical decay of ηt = η0 (1 t T ) for the rest of the steps.