reproducibilityindex.ai

Knowledge Fusion of Large Language Models

Authors: Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, Shuming Shi

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our approach using three popular LLMs with different architectures Llama-2, MPT, and Open LLa MA across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
Researcher Affiliation	Collaboration	1School of Computer Science and Engineering, Sun Yat-sen University, China 2Tencent AI Lab
Pseudocode	Yes	Algorithm 1 FUSELLM for LLMs Fusion
Open Source Code	Yes	Our code, model weights, and data are public at https://github.com/fanqiwan/Fuse LLM.
Open Datasets	Yes	We have chosen Mini Pile, a meticulously curated dataset resulting from a thorough clustering and filtering process. Mini Pile comprises approximately 1 million documents across 22 domains and 1.8 billion tokens, constituting less than 0.1% of the 2 trillion training tokens of Llama-2. More dataset details can be found in Appendix B. and Mini Pile is curated from The Pile (Gao et al., 2020)...
Dataset Splits	No	The paper describes the “Dataset for continual training” (Mini Pile) but does not specify how this dataset is split into training, validation, and test subsets for their own model development. Evaluations are performed on external benchmarks.
Hardware Specification	Yes	We train the target LLM of Llama-2 7B using a batch size of 128 and a maximum length of 2048 on a single node equipped with 8 NVIDIA A100 GPUs, each with 40GB of memory.
Software Dependencies	No	Our training framework is implemented based on the Huggingface Transformers (Wolf et al., 2020) and accelerated with Flash Attention (Dao et al., 2022). (No specific version numbers are provided for these software components).
Experiment Setup	Yes	We train the target LLM of Llama-2 7B using a batch size of 128 and a maximum length of 2048... We empirically set the combination weight λ in Eq. 5 to 0.9. The training consists of only a single epoch... Our model is optimized using the Adam W optimizer with β1 = 0.9 and β2 = 0.95, with gradient clipping set to 1.0 and weight decay to 0.1. A cosine learning rate schedule is employed, with a maximum learning rate of 1e-5 and a warmup ratio of 0.008.