High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails
Authors: Ashok Cutkosky, Harsh Mehta
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To answer this question, we conducted an experimental study on the BERT pretraining task. We chose to experiment on BERT since it has been shown to have heavy tails empirically [43] and the common practice is already to clip the gradients. Figure 3 plots the masked language model accuracy of BERT-Large model when trained with Algorithm 1. |
| Researcher Affiliation | Collaboration | Ashok Cutkosky Boston University ashok@cutkosky.com Harsh Mehta Google Research harshm@google.com |
| Pseudocode | Yes | Algorithm 1 Normalized SGD with Clipping and Momentum |
| Open Source Code | No | The paper does not explicitly state that its source code is being released, nor does it provide any links to a code repository. |
| Open Datasets | Yes | To answer this question, we conducted an experimental study on the BERT pretraining task. |
| Dataset Splits | No | The paper mentions "eval accuracy" and discusses training schedules but does not provide specific percentages or sample counts for training, validation, or test dataset splits. It does not describe how the data was partitioned into these sets. |
| Hardware Specification | Yes | All our experiments were conducted using the Tensorflow framework on TPUv3 architecture. |
| Software Dependencies | No | The paper mentions the "Tensorflow framework" but does not provide specific version numbers for TensorFlow or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | When using Algorithm 1, we employ base learning rate η0 of 0.3 and batch size of 512. We came up with this learning rate by performing grid search η0 [1.0, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001] and choosing the one which attains the best eval accuracy. To obtain the optimal base learning rate of 0.3, we again ran a grid search η0 [0.1, 0.2, 0.3, 0.4, 0.5]. Our warmup baseline employs the standard practice of linear warm-up for 3125 steps and polynomical decay of ηt = η0 (1 t T ) for the rest of the steps. |