Adversarial Moment-Matching Distillation of Large Language Models

Authors: Chen Jia

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Results from both task-agnostic instruction-following experiments and task-specific experiments demonstrate the effectiveness of our method and achieve new state-of-the-art performance. Empirically, we evaluate our approach on both the instruction-following dataset and three task-specific datasets for text summarization, machine translation, and commonsense reasoning.
Researcher Affiliation Industry Chen Jia SI-TECH Information Technology jiachenwestlake@gmail.com
Pseudocode Yes Algorithm 1: Adversarial training procedure (Page 5)
Open Source Code Yes The code and implementation are released at https://github.com/jiachenwestlake/MMKD.
Open Datasets Yes We construct the training data from databricks-dolly-15k [8], where we randomly select 15K samples for training and equally split 500 samples for validation and testing. we also add the Open Web Text [13] corpus. For the text summarization task, we follow Ko et al. [21] to conduct experiments on the SAMSum [12] dataset. For the machine translation tasks, we follow Ko et al. [21] to conduct experiments on the IWSLT 17 (en-de) [5] dataset. For the commonsense reasoning task, we conduct experiments on the Strategy QA dataset [11].
Dataset Splits Yes We construct the training data from databricks-dolly-15k [8], where we randomly select 15K samples for training and equally split 500 samples for validation and testing.
Hardware Specification Yes We use NVIDIA A40 GPUs with 40GB RAM to conduct all the experiments. (Appendix B.1)
Software Dependencies No The paper does not explicitly list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, etc.).
Experiment Setup Yes More details on experimental setup refer to Appendix B. More details about the experimental setup refer to Appendix B. (Tables 3 and 4 in Appendix B list detailed hyperparameters such as Max. Step Size, Inner Step Size, Batch Size, Learning Rate, etc.)