Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning

Authors: Jiaru Zou, Yikun Ban, Zihao Li, Yunzhe Qi, Ruizhong Qiu, Ling Yang, Jingrui He

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability.
Researcher Affiliation	Academia	1University of Illinois Urbana-Champaign 2Princeton University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: Transformer Copilot (Training Paradigm) Algorithm 2: Inference Paradigm
Open Source Code	Yes	Code: https://github.com/jiaruzouu/Transformer Copilot
Open Datasets	Yes	To comprehensively evaluate T-Copilot, we utilize a broad suite of reasoning and generation tasks: (i) Commonsense reasoning: PIQA [10], Hella Swag [93], Wino Grande [68], Bool Q [18], SIQA [70], and Openbook QA (OBQA) [55]. (ii) Arithmetic reasoning: AQu A [48], GSM8K [19], MAWPS [42], and SVAMP [60]. and (iii) Downstream Recommendation: Beauty [30] and Last FM [67]. Detailed dataset descriptions are provided in Appendix D.
Dataset Splits	Yes	Each dataset s individual test set is used for evaluation. Both fine-tuning and testing data instances utilize zero-shot input prompts.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A100 GPUs.
Software Dependencies	No	We modify the generate in Hugging Face Transformers [22] to perform token-level logits fusion and rectified next-token generation during inference.
Experiment Setup	Yes	We use the Adam W optimizer and Cosine learning rate scheduler for both Pilot and Copilot models. Table 5: Hyperparameter configuration of Transformer Copilot for LLa MA-3 and Qwen-2.5 series models on the Commonsense Reasoning Tasks. Table 6: Hyperparameter configuration of Transformer Copilot for LLa MA-3 and Qwen-2.5 series models on the Arithemtic Reasoning Tasks. Table 7: Hyperparameter configuration of Transformer Copilot for LLa MA-3.2-1B, LLa MA-3.2-3B, and LLa MA-3.1-8B on the Downstream Recommendation Tasks. Table 8: Hyperparameter configuration of Transformer Copilot for FLAN-T5-small/base/large on the Commonsense Reasoning Tasks. Table 9: Hyperparameter configuration of Transformer Copilot for FLAN-T5-small/base/large on the Arithmetic Reasoning Tasks. Table 10: Hyperparameter configuration of Transformer Copilot for T5-small/base on the Downstream Recommendation Tasks.