MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
Authors: Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments are conducted on 32 A100 GPUs, with each GPU having a batch size of 32. We performed exploratory experiments with 120k iterations on 0.25T tokens. Subsequently, the top models reported in Table 3 and Table 4, are trained with 480k iterations on 1T tokens. We evaluate the pre-trained model on zero-shot common sense reasoning tasks, including ARC-easy, ARC-challenge (Clark et al., 2018), Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Hella Swag (Zellers et al., 2019), OBQA (Mihaylov et al., 2018), Wino Grande (Sakaguchi et al., 2021), as well as question answering and reading comprehension tasks using TQA (Joshi et al., 2017) and RACE dataset (Lai et al., 2017). |
| Researcher Affiliation | Industry | Zechun Liu 1 Changsheng Zhao 1 Forrest Iandola 1 Chen Lai 1 Yuandong Tian 1 Igor Fedorov 1 Yunyang Xiong 1 Ernie Chang 1 Yangyang Shi 1 Raghuraman Krishnamoorthi 1 Liangzhen Lai 1 Vikas Chandra 1 1Meta. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper discusses open-source models used for comparison and evaluation but does not provide concrete access to the source code for the Mobile LLM methodology developed in this paper. |
| Open Datasets | Yes | We evaluate the pre-trained model on zero-shot common sense reasoning tasks, including ARC-easy, ARC-challenge (Clark et al., 2018), Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Hella Swag (Zellers et al., 2019), OBQA (Mihaylov et al., 2018), Wino Grande (Sakaguchi et al., 2021), as well as question answering and reading comprehension tasks using TQA (Joshi et al., 2017) and RACE dataset (Lai et al., 2017). |
| Dataset Splits | No | The paper states models are 'trained with 480k iterations on 1T tokens' and 'evaluate the pre-trained model on zero-shot common sense reasoning tasks', but it does not provide specific details on training, validation, and test dataset splits for the 1T token training data, nor for the zero-shot task datasets beyond their use as evaluation sets. |
| Hardware Specification | Yes | Our experiments are conducted on 32 A100 GPUs, with each GPU having a batch size of 32. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and 'PyTorch edge team', and refers to 'Execu Torch' and 'Metal Performance Shaders (MPS) backend' for profiling, along with 'i OS 17.2.1' for the iPhone, but it does not provide specific version numbers for software dependencies relevant to the core model training or inference setup (e.g., PyTorch version, CUDA version). |
| Experiment Setup | Yes | We train Mobile LLM from scratch using Adam optimizer (Kingma & Ba, 2014) with a weight decay of 0.1. The experiments are conducted using 32 A100 GPUs, with a batch size of 32 on each GPU. The initial learning rate is set to 2e-3 and follows a cosine learning-rate decay strategy. We perform quick exploration experiments with 120k iterations on 0.25T tokens and train the best models reported in Tables 3 and 4 with 480k iterations on 1T tokens. |