Learning to Grow Pretrained Models for Efficient Transformer Training
Authors: Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, Yoon Kim
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (Li GO) can save up to 50% computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models. |
| Researcher Affiliation | Collaboration | 1University of Texas at Austin, 2MIT-IBM Watson AI Lab, 3Columbia University, 4MIT |
| Pseudocode | Yes | Algorithm 1 A forward pass of Li GO with transformer. |
| Open Source Code | No | The paper provides a project page URL (https://vita-group.github.io/Li GO/), which is a high-level project overview page and not a direct link to a source-code repository for the methodology. |
| Open Datasets | Yes | We follow Tan & Bansal (2020) and use the English Wikipedia corpus5 for training BERT (Devlin et al., 2019) and Ro BERTa (Liu et al., 2019). We use the public C4 (Raffel et al., 2020) dataset for training GPT2 (Radford et al., 2019). We use Image Net (Deng et al., 2009) for training vision transformers. We use GLUE (Wang et al., 2018), SQu ADv1.1 (Rajpurkar et al., 2016), and SQu ADv2.0 (Rajpurkar et al., 2018) for evaluating pretrained BERT models. |
| Dataset Splits | Yes | We follow Tan & Bansal (2020) and use the English Wikipedia corpus5 for training BERT (Devlin et al., 2019) and Ro BERTa (Liu et al., 2019). We use the public C4 (Raffel et al., 2020) dataset for training GPT2 (Radford et al., 2019). We use Image Net (Deng et al., 2009) for training vision transformers. We use GLUE (Wang et al., 2018), SQu ADv1.1 (Rajpurkar et al., 2016), and SQu ADv2.0 (Rajpurkar et al., 2018) for evaluating pretrained BERT models. |
| Hardware Specification | No | The paper mentions utilizing 'the computational resources on the Ai MOS Supercomputer' but does not provide specific hardware details such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | No | The paper references specific model codebases (e.g., Dei T official codebase, pre-training code by Shen et al. (2022)) but does not list specific software versions for libraries, frameworks, or languages used in the experiments. |
| Experiment Setup | Yes | We always use 100 gradient steps to learn the Li GO for all models... We train both BERT and Ro BERTa models for 400K steps with a warmup of 10K steps... For BERT, we use a batch size of 256 and a learning rate of 2e 4, while we use a batch size of 1024 and a learning rate of 8e 4 for training Ro BERTa models. Following Shen et al. (2022), we train GPT2 models with a batch size of 384 and sequence length of 1024. ...We train all our vision transformers for 300 epochs with a batch size of 1024. |