reproducibilityindex.ai

Understanding the Training Speedup from Sampling with Approximate Losses

Authors: Rudrajit Das, Xi Chen, Bertram Ieong, Parikshit Bansal, Sujay Sanghavi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate SIFT on the task of training a 110M parameter 12 layer BERT base model, and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e.g., to reach 64% validation accuracy, SIFT with exit at the first layer takes 43 hours compared to 57 hours of vanilla training.
Researcher Affiliation	Collaboration	1UT Austin 2Amazon. Correspondence to: Rudrajit Das <rdas@utexas.edu>, Xi Chen <xichex@amazon.com>, Sujay Sanghavi <sanghavi@mail.utexas.edu>.
Pseudocode	Yes	The paper includes '4.1. Greedy SGD (GSGD) Algorithm' which describes the algorithm steps using equations (3) and (4), functioning as an algorithm block.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or links to a code repository for the described methodology.
Open Datasets	Yes	We train on Book Corpus (Zhu et al., 2015) and English Wikipedia which are two diverse and extensive standard corpora. ...Here we consider training a slightly modified version of Res Net-50 on CIFAR-100 and Food-101 (Bossard et al., 2014).
Dataset Splits	No	The paper mentions 'The validation set for assessing the model s performance is derived from the development partition of the training corpus' and uses standard datasets like CIFAR-100 and Food-101, but does not explicitly state specific percentages, sample counts, or provide citations for predefined validation splits to reproduce the data partitioning.
Hardware Specification	Yes	Our experiments were conducted on AWS p4d.24xlarge instances (8 NVIDIA A100 Tensor core GPUs).
Software Dependencies	No	The paper mentions 'bert-base-uncased tokenizer from the Hugging Face model repository' and 'Py Torch', but does not provide specific version numbers for these software components, which are necessary for reproducible dependency information.
Experiment Setup	Yes	For Adam W, we used the following hyper-parameter values: learning rate = 1e-4, ℓ2 weight decay = 0.01, β1 = 0.9 and β2 = 0.999. The learning rate warmup was over the first 0.2% of total steps followed by linear decay. We used the GELU activation and a dropout probability of 0.1 on all the layers.