MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

Authors: Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments are conducted on 32 A100 GPUs, with each GPU having a batch size of 32. We performed exploratory experiments with 120k iterations on 0.25T tokens. Subsequently, the top models reported in Table 3 and Table 4, are trained with 480k iterations on 1T tokens. We evaluate the pre-trained model on zero-shot common sense reasoning tasks, including ARC-easy, ARC-challenge (Clark et al., 2018), Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Hella Swag (Zellers et al., 2019), OBQA (Mihaylov et al., 2018), Wino Grande (Sakaguchi et al., 2021), as well as question answering and reading comprehension tasks using TQA (Joshi et al., 2017) and RACE dataset (Lai et al., 2017).
Researcher Affiliation Industry Zechun Liu 1 Changsheng Zhao 1 Forrest Iandola 1 Chen Lai 1 Yuandong Tian 1 Igor Fedorov 1 Yunyang Xiong 1 Ernie Chang 1 Yangyang Shi 1 Raghuraman Krishnamoorthi 1 Liangzhen Lai 1 Vikas Chandra 1 1Meta.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper discusses open-source models used for comparison and evaluation but does not provide concrete access to the source code for the Mobile LLM methodology developed in this paper.
Open Datasets Yes We evaluate the pre-trained model on zero-shot common sense reasoning tasks, including ARC-easy, ARC-challenge (Clark et al., 2018), Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Hella Swag (Zellers et al., 2019), OBQA (Mihaylov et al., 2018), Wino Grande (Sakaguchi et al., 2021), as well as question answering and reading comprehension tasks using TQA (Joshi et al., 2017) and RACE dataset (Lai et al., 2017).
Dataset Splits No The paper states models are 'trained with 480k iterations on 1T tokens' and 'evaluate the pre-trained model on zero-shot common sense reasoning tasks', but it does not provide specific details on training, validation, and test dataset splits for the 1T token training data, nor for the zero-shot task datasets beyond their use as evaluation sets.
Hardware Specification Yes Our experiments are conducted on 32 A100 GPUs, with each GPU having a batch size of 32.
Software Dependencies No The paper mentions using 'Adam optimizer' and 'PyTorch edge team', and refers to 'Execu Torch' and 'Metal Performance Shaders (MPS) backend' for profiling, along with 'i OS 17.2.1' for the iPhone, but it does not provide specific version numbers for software dependencies relevant to the core model training or inference setup (e.g., PyTorch version, CUDA version).
Experiment Setup Yes We train Mobile LLM from scratch using Adam optimizer (Kingma & Ba, 2014) with a weight decay of 0.1. The experiments are conducted using 32 A100 GPUs, with a batch size of 32 on each GPU. The initial learning rate is set to 2e-3 and follows a cosine learning-rate decay strategy. We perform quick exploration experiments with 120k iterations on 0.25T tokens and train the best models reported in Tables 3 and 4 with 480k iterations on 1T tokens.