Robust Training of Neural Networks Using Scale Invariant Architectures
Authors: Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank Reddi, Sanjiv Kumar
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Sections 4.1 and 4.2, we prove the convergence rate to the approximate first order point for GD and SGD for scale invariant loss. ... In our empirical analysis in Section 5, we demonstrate that SIBERT trained using simple SGD can achieve performance comparable to standard BERT trained by ADAM. Furthermore, we also verify our theoretical claims. |
| Researcher Affiliation | Collaboration | 1Princeton University, the work is done when interning at Google Research New York 2Google Research New York 3Google Deep Mind New York. |
| Pseudocode | Yes | Algorithm 1 C-Clipped SGD + WD |
| Open Source Code | No | No explicit statement or link indicating the release of source code for the methodology described in this paper. |
| Open Datasets | Yes | Next, we compare the downstream performance on three benchmark datasets (SQu ADv1.1 (Rajpurkar et al., 2016), SQu ADv2 (Rajpurkar et al., 2018) and MNLI (Williams et al., 2018)). |
| Dataset Splits | No | The paper mentions pretraining and finetuning on benchmark datasets, but does not explicitly state the train/validation/test dataset splits (percentages or sample counts) used. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running experiments are provided. |
| Software Dependencies | No | The paper mentions optimizers like SGD, Weight Decay, and LAMB, but does not specify software versions for these or any other dependencies. |
| Experiment Setup | Yes | For SIBERT, the scale invariant portion is trained using SGD+WD with a piecewise constant LR schedule and WD of 1e-2. We use LAMB optimizer for the non-scale invariant parts. The initial LR for SGD is 8e-4 without warmup and is divided by 10 at step 600k and 900k. Default training is for 1M steps. For LAMB we use a linear decay schedule with initial learning rate 8e-4 and a linear warmup of 10k steps. |