Understanding the Training Speedup from Sampling with Approximate Losses
Authors: Rudrajit Das, Xi Chen, Bertram Ieong, Parikshit Bansal, Sujay Sanghavi
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SIFT on the task of training a 110M parameter 12 layer BERT base model, and show significant gains (in terms of training hours and number of backpropagation steps) without any optimized implementation over vanilla training. For e.g., to reach 64% validation accuracy, SIFT with exit at the first layer takes 43 hours compared to 57 hours of vanilla training. |
| Researcher Affiliation | Collaboration | 1UT Austin 2Amazon. Correspondence to: Rudrajit Das <rdas@utexas.edu>, Xi Chen <xichex@amazon.com>, Sujay Sanghavi <sanghavi@mail.utexas.edu>. |
| Pseudocode | Yes | The paper includes '4.1. Greedy SGD (GSGD) Algorithm' which describes the algorithm steps using equations (3) and (4), functioning as an algorithm block. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | We train on Book Corpus (Zhu et al., 2015) and English Wikipedia which are two diverse and extensive standard corpora. ...Here we consider training a slightly modified version of Res Net-50 on CIFAR-100 and Food-101 (Bossard et al., 2014). |
| Dataset Splits | No | The paper mentions 'The validation set for assessing the model s performance is derived from the development partition of the training corpus' and uses standard datasets like CIFAR-100 and Food-101, but does not explicitly state specific percentages, sample counts, or provide citations for predefined validation splits to reproduce the data partitioning. |
| Hardware Specification | Yes | Our experiments were conducted on AWS p4d.24xlarge instances (8 NVIDIA A100 Tensor core GPUs). |
| Software Dependencies | No | The paper mentions 'bert-base-uncased tokenizer from the Hugging Face model repository' and 'Py Torch', but does not provide specific version numbers for these software components, which are necessary for reproducible dependency information. |
| Experiment Setup | Yes | For Adam W, we used the following hyper-parameter values: learning rate = 1e-4, ℓ2 weight decay = 0.01, β1 = 0.9 and β2 = 0.999. The learning rate warmup was over the first 0.2% of total steps followed by linear decay. We used the GELU activation and a dropout probability of 0.1 on all the layers. |