reproducibilityindex.ai

Robust Training of Neural Networks Using Scale Invariant Architectures

Authors: Zhiyuan Li, Srinadh Bhojanapalli, Manzil Zaheer, Sashank Reddi, Sanjiv Kumar

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Sections 4.1 and 4.2, we prove the convergence rate to the approximate ﬁrst order point for GD and SGD for scale invariant loss. ... In our empirical analysis in Section 5, we demonstrate that SIBERT trained using simple SGD can achieve performance comparable to standard BERT trained by ADAM. Furthermore, we also verify our theoretical claims.
Researcher Affiliation	Collaboration	1Princeton University, the work is done when interning at Google Research New York 2Google Research New York 3Google Deep Mind New York.
Pseudocode	Yes	Algorithm 1 C-Clipped SGD + WD
Open Source Code	No	No explicit statement or link indicating the release of source code for the methodology described in this paper.
Open Datasets	Yes	Next, we compare the downstream performance on three benchmark datasets (SQu ADv1.1 (Rajpurkar et al., 2016), SQu ADv2 (Rajpurkar et al., 2018) and MNLI (Williams et al., 2018)).
Dataset Splits	No	The paper mentions pretraining and finetuning on benchmark datasets, but does not explicitly state the train/validation/test dataset splits (percentages or sample counts) used.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running experiments are provided.
Software Dependencies	No	The paper mentions optimizers like SGD, Weight Decay, and LAMB, but does not specify software versions for these or any other dependencies.
Experiment Setup	Yes	For SIBERT, the scale invariant portion is trained using SGD+WD with a piecewise constant LR schedule and WD of 1e-2. We use LAMB optimizer for the non-scale invariant parts. The initial LR for SGD is 8e-4 without warmup and is divided by 10 at step 600k and 900k. Default training is for 1M steps. For LAMB we use a linear decay schedule with initial learning rate 8e-4 and a linear warmup of 10k steps.