Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Distribution-Aligned Decoding for Efficient LLM Task Adaptation

Authors: Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Kwong, Yuguang Fang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Across three tasks and nine benchmarks, SVDecode paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains (1-2 percentage points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter.
Researcher Affiliation	Academia	1Hong Kong JC STEM Lab of Smart City, 2City University of Hong Kong, 3University of Sussex, 4Huazhong University of Science and Technology, 5Fudan University, 6Lingnan University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Task-Aware Steering Vector for LLM Decoding Algorithm 2 Computing the Global Steering Constant µ
Open Source Code	Yes	Justiﬁcation: We have provided the code in the supplemental material. Once the paper is accepted, we will release the code.
Open Datasets	Yes	1. Multiple-Choice Tasks. For multiple-choice and open-ended generation tasks, we evaluate on the Truthful QA dataset [21]... 3. Commonsense Reasoning Tasks. For commonsense reasoning tasks, we leverage eight datasets including Bool Q [22], PIQA [23], SIQA [24], Hella Swag [25], Wino Grande [26], ARC-easy [27], ARC-challenge [27] and OBQA [28]
Dataset Splits	Yes	Split Dtask into training set Dtrain and calibration set Dcalib Firstly, we ﬁnetune the model on a comprehensive training dataset merged from all the datasets. Then, we evaluate the method on each task s test set.
Hardware Specification	No	The paper mentions "training a LLa MA-7B model demands at least 58 GB of memory [9], which is beyond the capacity of consumer-grade hardware like the NVIDIA RTX 4090 with 24GB", but this is an example of hardware capacity in general, not explicitly stating that their experiments were run on this or any other specific hardware.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the implementation of the experiments.
Experiment Setup	Yes	Table 6: Implementation Details of SVDecode. Parameter Value/Setting Warm-start Steps (Epochs) 1 α in Conﬁdence-aware Constraint 0.1 λ in Conﬁdence-aware Constraint -inf Default Decoding Strategy Greedy Search Table 7: Hyperparameters for PEFT Methods. Parameter Lo RA IA3 Prompt P-T Lo RA Rank 8 Lo RA α 16 Lo RA Dropout 0.1 Num Virtual Tokens 20 Preﬁx Projection False Encoder Hidden Size 128 Encoder Num Layers 2 Learning Rate 5e-5 Epochs 1 Train Batch Size 1 Eval Batch Size 2 Max Seq Length 512 FP16 True