Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Whose Instructions Count? Resolving Preference Bias in Instruction Fine-Tuning

Authors: Jiayu Zhang, Changbang Li, Yinan Peng, Weihao Luo, Peilai Yu, Xuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On five Super/GLUE tasks and the ALPACA set plus six preference-shifted variants DCPC boosts accuracy/F1-EM by 4.0 6.7 points and gpt-score by +0.7, while cutting inter-seed variance up to 35% on Lla MA-2 13B and Mistral-7B, setting a new state of the art for robust instruction tuning.
Researcher Affiliation	Collaboration	1Airon Technology CO., LTD 2Peking University 3University of Pennsylvania 4Hengxin Technology Ltd. 5Donghua University 6Ludwig Maximilian University of Munich 7Carnegie Mellon University
Pseudocode	No	The paper describes methods and processes through textual explanations and mathematical equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	Answer: [No] Justification: Due to institutional restrictions and proprietary considerations, the data and code used in this study are not publicly available at this time. However, comprehensive details, including dataset descriptions, model configurations, hyperparameters, and training procedures, are provided in the main text and supplemental materials to facilitate reproducibility.
Open Datasets	Yes	We evaluate the performance of DCPC framework using a variety of datasets that involve subjective labeling or human preference discrepancies:(a) three tasks from Super GLUE benchmark (Bool Q,COPA, and Re Co RD)[Wang et al., 2019]. (b)two tasks from GLUE benchmark (SST-2 and RTE)[Wang, 2018]. (c) Alpaca Dataset[Taori et al., 2023b].
Dataset Splits	Yes	For SST-2, RTE, Bool Q, and COPA, we measure performance based on the accuracy of the model s predictions (denoted as acc), which reflects the proportion of correct answers compared to ground truth labels. For Re Co RD, we calculate both the F1 score and the exact match (EM) score. The final evaluation metric for Re Co RD is the average of these two scores (denoted as f1-em). For the Alpaca dataset and its modified versions, we leverage GPT-4o as an evaluator to assign a quantitative score to each response, based on coherence, completeness, and adherence to the task instructions. The average score provided by GPT-4o on a scale from 1 to 10 (denoted as gpt-score) is used as the primary performance metric for instruction-tuning tasks.
Hardware Specification	Yes	All experiments are conducted using NVIDIA A100.
Software Dependencies	No	We fine-tune the Lla MA-2 7B and 13B models using the Hugging Face Transformers library.
Experiment Setup	Yes	The hyperparameters of the DCPC framework are set as follows: (a) the length of the prefix embeddings m is fixed at 16, (b) the meta-matrix M in the Preference Correction Module (PCM) is configured with dimensions m d, where d = 4096 for Lla MA-2 7B and d = 5120 for Lla MA-2 13B, corresponding to the hidden dimension of each model. (c) The cross-layer alignment similarity threshold τcos is set to 0.85, and the ambiguity loss threshold τambiguity is set to 0.3. We fine-tune the Lla MA-2 7B and 13B models using the Hugging Face Transformers library. The maximum sequence length is set to 2048 tokens for both models, and training runs for up to 10 epochs. The batch size is 16 for smaller datasets (e.g., SST-2 and RTE) and 64 for larger datasets (e.g., Re Co RD and Bool Q). We employ the Adam W optimizer with an initial learning rate of 1 10 4, utilizing a linear learning rate decay and a warm-up phase covering 6% of the training steps. Evaluation is performed on the development set every 200 steps, and early stopping is applied if no improvement is observed after 10 evaluations. The best checkpoint based on the development set is used for final testing.