Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails
Authors: Ashok Cutkosky, Harsh Mehta
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To answer this question, we conducted an experimental study on the BERT pretraining task. We chose to experiment on BERT since it has been shown to have heavy tails empirically [43] and the common practice is already to clip the gradients. Figure 3 plots the masked language model accuracy of BERT-Large model when trained with Algorithm 1. |
| Researcher Affiliation | Collaboration | Ashok Cutkosky Boston University EMAIL Harsh Mehta Google Research EMAIL |
| Pseudocode | Yes | Algorithm 1 Normalized SGD with Clipping and Momentum |
| Open Source Code | No | The paper does not explicitly state that its source code is being released, nor does it provide any links to a code repository. |
| Open Datasets | Yes | To answer this question, we conducted an experimental study on the BERT pretraining task. |
| Dataset Splits | No | The paper mentions "eval accuracy" and discusses training schedules but does not provide specific percentages or sample counts for training, validation, or test dataset splits. It does not describe how the data was partitioned into these sets. |
| Hardware Specification | Yes | All our experiments were conducted using the Tensorflow framework on TPUv3 architecture. |
| Software Dependencies | No | The paper mentions the "Tensorflow framework" but does not provide specific version numbers for TensorFlow or any other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | When using Algorithm 1, we employ base learning rate η0 of 0.3 and batch size of 512. We came up with this learning rate by performing grid search η0 [1.0, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001] and choosing the one which attains the best eval accuracy. To obtain the optimal base learning rate of 0.3, we again ran a grid search η0 [0.1, 0.2, 0.3, 0.4, 0.5]. Our warmup baseline employs the standard practice of linear warm-up for 3125 steps and polynomical decay of ηt = η0 (1 t T ) for the rest of the steps. |