Search for Efficient Large Language Models

Authors: Xuan Shen, Pu Zhao, Yifan Gong, Zhenglun Kong, Zheng Zhan, Yushu Wu, Ming Lin, Chao Wu, Xue Lin, Yanzhi Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Compared with SOTA training-free structured pruning works that can generate smaller networks, our method demonstrates superior performance across standard benchmarks. Furthermore, our generated subnets can directly reduce the usage of GPU memory and achieve inference acceleration. Code: https://github.com/shawnricecake/search-llm
Researcher Affiliation Collaboration Xuan Shen1, Pu Zhao1, Yifan Gong1, Zhenglun Kong2, Zheng Zhan1, Yushu Wu1, Ming Lin3, Chao Wu1, Xue Lin1, Yanzhi Wang1 1Northeastern University, 2Harvard University, 3Oracle
Pseudocode Yes Algorithm 1: Mask Mutation Input: S, Pm, γ, α, η Pr Random(0, 1) if Inheriting_Ratio(S) == γ and Pr > Pm then Output: S N len(S), iter 0 Idx1 {S == 1}, Idx2 φ while len(Idx1 Idx2) < α N and iter < η do Idx2 Random_Subset({0, 1, , N 1}|γ) iter iter + 1 end S = ON; S [Idx2] 1 Output: S if iter < η else S
Open Source Code Yes Code: https://github.com/shawnricecake/search-llm
Open Datasets Yes We compare the perplexity of the models on the Wiki Text2 [48] and PTB [49] datasets with the 2048 sequence length. We also compare the zero-shot accuracy on common reasoning zero-shot classification datasets including Bool Q [50], PIQA [51], Hella Swag [52], Wino Grande [53], ARC-easy [54], ARC-challenge [54], and Openbook QA [55].
Dataset Splits No The paper refers to using "128 calibration samples" from the training split of Wiki Text2 for the reformation process. However, it does not provide explicit train/validation/test splits (e.g., percentages or specific counts for each partition) for the datasets used in the main evaluation (WikiText2, PTB, BoolQ, etc.). It mentions using the "same pipeline as LLM-Pruner [8]" for evaluation, but this external reference does not constitute an explicit split description within the paper itself.
Hardware Specification Yes We leverage the evolution search on NVIDIA A100 40G GPUs. Specifically, to explore the subnets of LLa MA-7B, we finish the search on one GPU with around 5 hours.
Software Dependencies No The paper does not specify the versions of any software dependencies (e.g., specific Python libraries like PyTorch, TensorFlow, or CUDA versions) used for the experiments.
Experiment Setup Yes For the evolutionary search, we adopt specific hyper-parameters as follows: the population size (N), the number of mutations (Nm), and the number of crossovers (Nc) are set to 100, 50, and 30, respectively. In each generation, the top 10 subnets are selected as parental candidates to produce offspring networks through the mechanisms of mutation and crossover. The rest subnets in the population are generated with mutation with larger randomness (i.e., same as initial mutation). The initial mutation probabilities (P 0 m and P 0 s ) are set at 0.6 and 0.3 to promote variability early in the search process. Subsequently, for ongoing generation processes, the mutation probabilities (Pm and Ps) are adjusted to 0.3 and 0.1, while the probability for depth (Pd) is maintained at 0.1. The similarity ratio α and maximum iteration η are set at 0.8 and 1000 in mask mutation. The total evolution epoch is 50. For the reformation, we adopt ρ as 1.0 and the iteration number as 30.