AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment

Authors: Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan (Celine) Lin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments validate that Amoeba LLM not only sets new standards in LLM adaptability but also successfully delivers subnets that achieve stateof-the-art trade-offs between accuracy and efficiency.
Researcher Affiliation Academia Yonggan Fu, Zhongzhi Yu , Junwei Li , Jiayi Qian , Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan (Celine) Lin Georgia Institute of Technology {yonggan.fu, celine.lin}@gatech.edu
Pseudocode No The paper describes methodologies in text and figures but does not include structured pseudocode or algorithm blocks with explicit labels like 'Algorithm' or 'Pseudocode'.
Open Source Code Yes Our code is available at https://github.com/GATECH-EIC/Amoeba LLM.
Open Datasets Yes Following [7, 9], we adopt 50K samples from Alpaca [40] for our one-for-all fine-tuning as well as for fine-tuning all baselines. For both our method and the baselines, we adopt a constant learning rate of 2e-4 with an Adam W optimizer and a Lo RA rank of 64, and fine-tune for 10K iterations.
Dataset Splits Yes During each fine-tuning iteration, we employ the sandwich sampling [11, 13, 14] to sample K subnets {Ti}K i=1 with different layer/width remaining ratios, including the largest/smallest ones and K 2 random ones from our design space. Detailed layer/width configurations of sampled subsets can be obtained from the strategies derived in Sec. 3.2. We fine-tune our SMo L adapter as detailed in Sec. 3.3 by accumulating the gradients from all sampled subnets using in-place distillation, where only the loss of the largest subnet T1 is calculated using ground truth, while those of other subnets {Ti}K i=2 use distillation from the largest one [11].
Hardware Specification Yes We profile these workloads using (1) two devices, including an NVIDIA A5000 consumer-level GPU and an NVIDIA Jetson Orin NX edge GPU;
Software Dependencies No The paper mentions 'Tensor RT-LLM [19], MLC-LLM [20], and vanilla Py Torch [21]' as deployment flows but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes For both our method and the baselines, we adopt a constant learning rate of 2e-4 with an Adam W optimizer and a Lo RA rank of 64, and fine-tune for 10K iterations. It takes 40 GPU hours on an NVIDIA A5000 GPU for our one-for-all fine-tuning.