Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Token-Level Self-Play with Importance-Aware Guidance for Large Language Models

Authors: Tue Le, Hoang Tran, Quyen Tran, Linh Ngo, Mehrtash Harandi, Trung Le

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on diverse benchmarks and settings demonstrate that SWIFT consistently surpasses both existing alignment approaches and conventional knowledge distillation methods. We evaluate SWIFT through extensive experiments across diverse settings, highlighting several key findings.
Researcher Affiliation Academia Tue Le Hanoi University of Science and Technology Hanoi, Vietnam EMAIL Hoang Tran Vuong Hanoi University of Science and Technology Hanoi, Vietnam EMAIL Quyen Tran Rutgers University New Jersey, US EMAIL Linh Ngo Van Hanoi University of Science and Technology Hanoi, Vietnam EMAIL Mehrtash Harandi Monash University Clayton, VIC 3800, Australia EMAIL Trung le Monash University Clayton, VIC 3800, Australia EMAIL
Pseudocode Yes Pseudo-code for our method can be found in Appendix B.4 in the supplementary material. Algorithm 1 Self-Play Weighted Fine-Tuning (SWIFT) Algorithm 2 Teacher-Guided Token Importance Estimation
Open Source Code Yes Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We released our code and datasets.
Open Datasets Yes For the Alignment setting, We use Qwen1.5-1.8B [26] as the base model. As the teacher model, we adopt Zephyr-7B-SFT-Full [27], which is based on Mistral-7B [28] and further fine-tuned on the Ultrachat200k dataset1 provided by Hugging Face. Ultrachat200k is a curated 200k subset of the Ultra Chat corpus [45]... For the Knowledge Distillation setting... Four datasets are selected for evaluation: DATABRICKSDOLLY-15K (Dolly) [33], ALPACA (Alpaca) [34], S-NI (S-NI) [35], and DIALOGSUM (Dialogsum) [36].
Dataset Splits Yes In contrast, we construct separate training, validation, and testing splits for each domain, allowing for a more targeted evaluation of knowledge distillation within the same domain. The details of the datasets are provided in table 5 below. Table 5: Dataset Statistics (Train, Validation, Test columns with counts)
Hardware Specification Yes All experiments are conducted on 2 NVIDIA RTX 4090 GPUs. To further quantify the computational overhead of SWIFT, we report the GPU-hours required for each stage of the SWIFT pipeline on the 50k Ultrachat subset using a single NVIDIA H100 GPU.
Software Dependencies No To reduce training costs and memory consumption, we employ Deep Speed Ze RO-3 [53] and Flash Attention-2 [54] throughout all training iterations.
Experiment Setup Yes To reduce training costs and memory consumption, we employ Deep Speed Ze RO-3 [53] and Flash Attention-2 [54] throughout all training iterations. Models are trained using the RMSProp optimizer [55] without weight decay, following standard practice for LLM alignment fine-tuning. We set the global batch size to 2, use bfloat16 precision, and apply a 10% linear warmup at the start of each iteration. The peak learning rate is set to 5e-7 for iterations 0 and 1, and 1e-7 for iterations 2 and 3 as training approaches convergence. Each iteration is trained for 2 epochs with a maximum sequence length of 2048 tokens. For token importance estimation as defined in equation 14 in main paper, we set ยต = 1, with lower and upper clipping bounds L = 0.5 and U = 1.5, respectively. The hypeparameter k is fixed to 1.