Learning to Grow Pretrained Models for Efficient Transformer Training

Authors: Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David Daniel Cox, Zhangyang Wang, Yoon Kim

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (Li GO) can save up to 50% computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models.
Researcher Affiliation Collaboration 1University of Texas at Austin, 2MIT-IBM Watson AI Lab, 3Columbia University, 4MIT
Pseudocode Yes Algorithm 1 A forward pass of Li GO with transformer.
Open Source Code No The paper provides a project page URL (https://vita-group.github.io/Li GO/), which is a high-level project overview page and not a direct link to a source-code repository for the methodology.
Open Datasets Yes We follow Tan & Bansal (2020) and use the English Wikipedia corpus5 for training BERT (Devlin et al., 2019) and Ro BERTa (Liu et al., 2019). We use the public C4 (Raffel et al., 2020) dataset for training GPT2 (Radford et al., 2019). We use Image Net (Deng et al., 2009) for training vision transformers. We use GLUE (Wang et al., 2018), SQu ADv1.1 (Rajpurkar et al., 2016), and SQu ADv2.0 (Rajpurkar et al., 2018) for evaluating pretrained BERT models.
Dataset Splits Yes We follow Tan & Bansal (2020) and use the English Wikipedia corpus5 for training BERT (Devlin et al., 2019) and Ro BERTa (Liu et al., 2019). We use the public C4 (Raffel et al., 2020) dataset for training GPT2 (Radford et al., 2019). We use Image Net (Deng et al., 2009) for training vision transformers. We use GLUE (Wang et al., 2018), SQu ADv1.1 (Rajpurkar et al., 2016), and SQu ADv2.0 (Rajpurkar et al., 2018) for evaluating pretrained BERT models.
Hardware Specification No The paper mentions utilizing 'the computational resources on the Ai MOS Supercomputer' but does not provide specific hardware details such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper references specific model codebases (e.g., Dei T official codebase, pre-training code by Shen et al. (2022)) but does not list specific software versions for libraries, frameworks, or languages used in the experiments.
Experiment Setup Yes We always use 100 gradient steps to learn the Li GO for all models... We train both BERT and Ro BERTa models for 400K steps with a warmup of 10K steps... For BERT, we use a batch size of 256 and a learning rate of 2e 4, while we use a batch size of 1024 and a learning rate of 8e 4 for training Ro BERTa models. Following Shen et al. (2022), we train GPT2 models with a batch size of 384 and sequence length of 1024. ...We train all our vision transformers for 300 epochs with a batch size of 1024.