AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment
Authors: Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan (Celine) Lin
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments validate that Amoeba LLM not only sets new standards in LLM adaptability but also successfully delivers subnets that achieve stateof-the-art trade-offs between accuracy and efficiency. |
| Researcher Affiliation | Academia | Yonggan Fu, Zhongzhi Yu , Junwei Li , Jiayi Qian , Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan (Celine) Lin Georgia Institute of Technology {yonggan.fu, celine.lin}@gatech.edu |
| Pseudocode | No | The paper describes methodologies in text and figures but does not include structured pseudocode or algorithm blocks with explicit labels like 'Algorithm' or 'Pseudocode'. |
| Open Source Code | Yes | Our code is available at https://github.com/GATECH-EIC/Amoeba LLM. |
| Open Datasets | Yes | Following [7, 9], we adopt 50K samples from Alpaca [40] for our one-for-all fine-tuning as well as for fine-tuning all baselines. For both our method and the baselines, we adopt a constant learning rate of 2e-4 with an Adam W optimizer and a Lo RA rank of 64, and fine-tune for 10K iterations. |
| Dataset Splits | Yes | During each fine-tuning iteration, we employ the sandwich sampling [11, 13, 14] to sample K subnets {Ti}K i=1 with different layer/width remaining ratios, including the largest/smallest ones and K 2 random ones from our design space. Detailed layer/width configurations of sampled subsets can be obtained from the strategies derived in Sec. 3.2. We fine-tune our SMo L adapter as detailed in Sec. 3.3 by accumulating the gradients from all sampled subnets using in-place distillation, where only the loss of the largest subnet T1 is calculated using ground truth, while those of other subnets {Ti}K i=2 use distillation from the largest one [11]. |
| Hardware Specification | Yes | We profile these workloads using (1) two devices, including an NVIDIA A5000 consumer-level GPU and an NVIDIA Jetson Orin NX edge GPU; |
| Software Dependencies | No | The paper mentions 'Tensor RT-LLM [19], MLC-LLM [20], and vanilla Py Torch [21]' as deployment flows but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | For both our method and the baselines, we adopt a constant learning rate of 2e-4 with an Adam W optimizer and a Lo RA rank of 64, and fine-tune for 10K iterations. It takes 40 GPU hours on an NVIDIA A5000 GPU for our one-for-all fine-tuning. |