Cross-model Control: Improving Multiple Large Language Models in One-time Training

Authors: Jiayi Wu, Hao Sun, Hengyi Cai, Lixin Su, Shuaiqiang Wang, Dawei Yin, Xiang Li, Ming Gao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have conducted extensive experiments on instruction tuning and unlearning tasks, demonstrating the effectiveness of CMC.
Researcher Affiliation Collaboration Jiayi Wu1, Hao Sun2, Hengyi Cai3, Lixin Su4, Shuaiqiang Wang4, Dawei Yin4, Xiang Li1 , Ming Gao1,5,6 1School of Data Science and Engineering, East China Normal University 2Peking University 3Chinese Academy of Sciences 4Baidu Inc 5KLATASDS-MOE, School of Statistics, East China Normal University 6Guizhou Zhuwen ECNU Data Power Institute
Pseudocode No The paper describes methods verbally and with figures, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/wujwyi/CMC.
Open Datasets Yes We utilized the GPT4-Alpaca dataset (Peng et al., 2023) to train our delta model. This dataset consists of 52k instruction-following examples, with instructions sourced from Stanford Alpaca data (Taori et al., 2023) and responses generated by GPT-4. and We utilized the TOFU benchmark (Maini et al., 2024) to evaluate our approach.
Dataset Splits No The paper mentions training and testing datasets (GPT4-Alpaca, TOFU, Alpaca Eval) but does not specify the train/validation/test splits, only implies training on one and evaluating on another.
Hardware Specification Yes Our experiments were conducted on a server equipped with 512GB of memory and 4 Nvidia A100 40G GPUs.
Software Dependencies No The paper mentions using Llama architecture but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes During the training of the delta model, we set the learning rate to 2e-4, batch size to 64, and trained for 4 epochs. For Lo RA fine-tuning, for both LLAMA2-7B and MISTRAL-7B models, we set r to 140 and alpha to 280, while for LLAMA2-13B, r is set to 90 and alpha to 180. For all models, the learning rate is 2e-5, batch size is 32, Lo RA target are [q_proj,k_proj,k_proj], and the training lasted for 2 epochs. When training the expert models for Proxy-tuning, the learning rate was set to 2e-4, batch size to 64, and trained for 16 epochs. In training the delta model [for unlearning], we set the learning rate to 1e-4, batch size to 16, and trained for 20 epochs. For Lo RA fine-tuning [for unlearning], across all models, r was set to 32, alpha to 64, with a learning rate of 1e-4, batch size of 16, and training spanned 5 epochs.