Knowledge Fusion of Large Language Models
Authors: Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, Shuming Shi
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach using three popular LLMs with different architectures Llama-2, MPT, and Open LLa MA across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Sun Yat-sen University, China 2Tencent AI Lab |
| Pseudocode | Yes | Algorithm 1 FUSELLM for LLMs Fusion |
| Open Source Code | Yes | Our code, model weights, and data are public at https://github.com/fanqiwan/Fuse LLM. |
| Open Datasets | Yes | We have chosen Mini Pile, a meticulously curated dataset resulting from a thorough clustering and filtering process. Mini Pile comprises approximately 1 million documents across 22 domains and 1.8 billion tokens, constituting less than 0.1% of the 2 trillion training tokens of Llama-2. More dataset details can be found in Appendix B. and Mini Pile is curated from The Pile (Gao et al., 2020)... |
| Dataset Splits | No | The paper describes the “Dataset for continual training” (Mini Pile) but does not specify how this dataset is split into training, validation, and test subsets for their own model development. Evaluations are performed on external benchmarks. |
| Hardware Specification | Yes | We train the target LLM of Llama-2 7B using a batch size of 128 and a maximum length of 2048 on a single node equipped with 8 NVIDIA A100 GPUs, each with 40GB of memory. |
| Software Dependencies | No | Our training framework is implemented based on the Huggingface Transformers (Wolf et al., 2020) and accelerated with Flash Attention (Dao et al., 2022). (No specific version numbers are provided for these software components). |
| Experiment Setup | Yes | We train the target LLM of Llama-2 7B using a batch size of 128 and a maximum length of 2048... We empirically set the combination weight λ in Eq. 5 to 0.9. The training consists of only a single epoch... Our model is optimized using the Adam W optimizer with β1 = 0.9 and β2 = 0.95, with gradient clipping set to 1.0 and weight decay to 0.1. A cosine learning rate schedule is employed, with a maximum learning rate of 1e-5 and a warmup ratio of 0.008. |