Cross-model Control: Improving Multiple Large Language Models in One-time Training
Authors: Jiayi Wu, Hao Sun, Hengyi Cai, Lixin Su, Shuaiqiang Wang, Dawei Yin, Xiang Li, Ming Gao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We have conducted extensive experiments on instruction tuning and unlearning tasks, demonstrating the effectiveness of CMC. |
| Researcher Affiliation | Collaboration | Jiayi Wu1, Hao Sun2, Hengyi Cai3, Lixin Su4, Shuaiqiang Wang4, Dawei Yin4, Xiang Li1 , Ming Gao1,5,6 1School of Data Science and Engineering, East China Normal University 2Peking University 3Chinese Academy of Sciences 4Baidu Inc 5KLATASDS-MOE, School of Statistics, East China Normal University 6Guizhou Zhuwen ECNU Data Power Institute |
| Pseudocode | No | The paper describes methods verbally and with figures, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/wujwyi/CMC. |
| Open Datasets | Yes | We utilized the GPT4-Alpaca dataset (Peng et al., 2023) to train our delta model. This dataset consists of 52k instruction-following examples, with instructions sourced from Stanford Alpaca data (Taori et al., 2023) and responses generated by GPT-4. and We utilized the TOFU benchmark (Maini et al., 2024) to evaluate our approach. |
| Dataset Splits | No | The paper mentions training and testing datasets (GPT4-Alpaca, TOFU, Alpaca Eval) but does not specify the train/validation/test splits, only implies training on one and evaluating on another. |
| Hardware Specification | Yes | Our experiments were conducted on a server equipped with 512GB of memory and 4 Nvidia A100 40G GPUs. |
| Software Dependencies | No | The paper mentions using Llama architecture but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | During the training of the delta model, we set the learning rate to 2e-4, batch size to 64, and trained for 4 epochs. For Lo RA fine-tuning, for both LLAMA2-7B and MISTRAL-7B models, we set r to 140 and alpha to 280, while for LLAMA2-13B, r is set to 90 and alpha to 180. For all models, the learning rate is 2e-5, batch size is 32, Lo RA target are [q_proj,k_proj,k_proj], and the training lasted for 2 epochs. When training the expert models for Proxy-tuning, the learning rate was set to 2e-4, batch size to 64, and trained for 16 epochs. In training the delta model [for unlearning], we set the learning rate to 1e-4, batch size to 16, and trained for 20 epochs. For Lo RA fine-tuning [for unlearning], across all models, r was set to 32, alpha to 64, with a learning rate of 1e-4, batch size of 16, and training spanned 5 epochs. |