Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models
Authors: Luohe Shi, Yao Yao, Zuchao Li, Lefei Zhang, Hai Zhao
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental evaluations on various LLMs using different benchmarks demonstrate that RTD establishes a new paradigm for augmenting models to downstream tasks. Furthermore, our method exhibits strong orthogonality with traditional methods, allowing for concurrent usage. |
| Researcher Affiliation | Academia | National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, 430072, P. R. China 2Department of Computer Science and Engineering, Shanghai Jiao Tong University |
| Pseudocode | No | The paper does not include a section or figure explicitly labeled 'Pseudocode' or 'Algorithm', nor does it present any structured code-like blocks. |
| Open Source Code | Yes | Our code can be found at https://github.com/Shi Luohe/Reference Trustable Decoding |
| Open Datasets | Yes | Testing benchmarks are: Massive Multitask Language Understanding (MMLU) [15], AI2 Reasoning Challenge (ARC, both Easy (E) and Challenge (C) parts) [4], Reasoning about Physical Commonsense in Natural Language (PIQA) [5], Open Book Question Answering (OBQA) [30], and Massive Multitask Language Understanding in Chinese (CMMLU) [25]. [...] To generate reference datastores, LLMs are shown to the questions and options in the training split of the benchmarks and we store the attention output. |
| Dataset Splits | No | The paper mentions 'training split' and 'test set' explicitly but does not specify a distinct validation set with quantitative details such as percentages or sample counts for the main experiments. Appendix D mentions 'Max Seq. Len. 4096' for Lo RA tuning, but does not specify validation split. |
| Hardware Specification | Yes | All testing are done on a server with 8*A100 80G SXM. For models with less than 15B parameters, 2 of 8 GPUs are used. For models with more than 15B parameters, 4 of 8 GPUs are used. |
| Software Dependencies | No | All testing are carried out under Hugging Face Transformers library [43]. While the software is mentioned, a specific version number for the 'Hugging Face Transformers library' or any other key software component is not provided. |
| Experiment Setup | Yes | If not tuned, we set k = 1024, s L = 19, 828, λ = 1 and T = 750 by default. [...] The hyperparameters of Lo RA are in Appendix D. Table 11: Lo RA Hyper-parameters: Batch Size 4, Epochs 2, Max Seq. Len. 4096, Lo RA Target {Q, K, V, O, Up, Down, Gate}_proj, Lo RA Rank 16, Lo RA α 32, Lo RA dropout 0.01, Learning Rate 1e-5, Optimizer Adam W Adma RMS ϵ 2e-4, Adam β (0.9, 0.999), Adam Weight Decay 0.01, Scheduler Constant LR. |