LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models

Authors: Mojan Javaheripi, Gustavo de Rosa, Subhabrata Mukherjee, Shital Shah, Tomasz Religa, Caio Cesar Teodoro Mendes, Sebastien Bubeck, Farinaz Koushanfar, Debadeepta Dey

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5 , 2.5 faster runtime and 1.2 , 2.0 lower peak memory utilization.
Researcher Affiliation Collaboration Mojan Javaheripi1, Gustavo H. de Rosa2, Subhabrata Mukherjee2, Shital Shah2, Tomasz L. Religa3, Caio C.T. Mendes2, Sebastien Bubeck2, Farinaz Koushanfar1, Debadeepta Dey2 1University of California San Diego, 2Microsoft Research, 3Microsoft
Pseudocode Yes Algorithm 1: LTS s training-free NAS
Open Source Code Yes code available at https://github.com/microsoft/archai/tree/neurips_lts/archai/nlp
Open Datasets Yes To corroborate the effectiveness of our proxy, we train over 2900 Transformers on three large language modeling benchmark datasets, namely, Wiki Text-103 [27], One Billion Word [7], and Pile [17].
Dataset Splits No The paper states it used 'Wiki Text-103 [27], One Billion Word [7], and Pile [17]' datasets and refers to 'validation perplexity'. It also notes 'Please refer to Appendix B for information about the benchmarked datasets, along with details of our training and evaluation setup...'. However, the provided text does not explicitly detail the training, validation, or test data splits (e.g., percentages or sample counts) for these datasets.
Hardware Specification Yes We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL.
Software Dependencies No The paper mentions using PyTorch and Hugging Face resources, and refers to Appendix B for details on the training and evaluation setup. However, the provided text does not explicitly list specific software dependencies with their version numbers (e.g., Python version, library versions like PyTorch 1.x, or CUDA version).
Experiment Setup No The paper states: 'Please refer to Appendix B for information about the benchmarked datasets, along with details of our training and evaluation setup, hyperparameter optimization, and evolutionary search algorithm.' While details are promised in the appendix, specific experimental setup details, such as concrete hyperparameter values or training configurations, are not present in the main body of the paper.