Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Adversarial Moment-Matching Distillation of Large Language Models
Authors: Chen Jia
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Results from both task-agnostic instruction-following experiments and task-specific experiments demonstrate the effectiveness of our method and achieve new state-of-the-art performance. Empirically, we evaluate our approach on both the instruction-following dataset and three task-specific datasets for text summarization, machine translation, and commonsense reasoning. |
| Researcher Affiliation | Industry | Chen Jia SI-TECH Information Technology EMAIL |
| Pseudocode | Yes | Algorithm 1: Adversarial training procedure (Page 5) |
| Open Source Code | Yes | The code and implementation are released at https://github.com/jiachenwestlake/MMKD. |
| Open Datasets | Yes | We construct the training data from databricks-dolly-15k [8], where we randomly select 15K samples for training and equally split 500 samples for validation and testing. we also add the Open Web Text [13] corpus. For the text summarization task, we follow Ko et al. [21] to conduct experiments on the SAMSum [12] dataset. For the machine translation tasks, we follow Ko et al. [21] to conduct experiments on the IWSLT 17 (en-de) [5] dataset. For the commonsense reasoning task, we conduct experiments on the Strategy QA dataset [11]. |
| Dataset Splits | Yes | We construct the training data from databricks-dolly-15k [8], where we randomly select 15K samples for training and equally split 500 samples for validation and testing. |
| Hardware Specification | Yes | We use NVIDIA A40 GPUs with 40GB RAM to conduct all the experiments. (Appendix B.1) |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, etc.). |
| Experiment Setup | Yes | More details on experimental setup refer to Appendix B. More details about the experimental setup refer to Appendix B. (Tables 3 and 4 in Appendix B list detailed hyperparameters such as Max. Step Size, Inner Step Size, Batch Size, Learning Rate, etc.) |