Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Distribution-Aligned Decoding for Efficient LLM Task Adaptation
Authors: Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Kwong, Yuguang Fang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across three tasks and nine benchmarks, SVDecode paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains (1-2 percentage points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. |
| Researcher Affiliation | Academia | 1Hong Kong JC STEM Lab of Smart City, 2City University of Hong Kong, 3University of Sussex, 4Huazhong University of Science and Technology, 5Fudan University, 6Lingnan University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Task-Aware Steering Vector for LLM Decoding Algorithm 2 Computing the Global Steering Constant µ |
| Open Source Code | Yes | Justification: We have provided the code in the supplemental material. Once the paper is accepted, we will release the code. |
| Open Datasets | Yes | 1. Multiple-Choice Tasks. For multiple-choice and open-ended generation tasks, we evaluate on the Truthful QA dataset [21]... 3. Commonsense Reasoning Tasks. For commonsense reasoning tasks, we leverage eight datasets including Bool Q [22], PIQA [23], SIQA [24], Hella Swag [25], Wino Grande [26], ARC-easy [27], ARC-challenge [27] and OBQA [28] |
| Dataset Splits | Yes | Split Dtask into training set Dtrain and calibration set Dcalib Firstly, we finetune the model on a comprehensive training dataset merged from all the datasets. Then, we evaluate the method on each task s test set. |
| Hardware Specification | No | The paper mentions "training a LLa MA-7B model demands at least 58 GB of memory [9], which is beyond the capacity of consumer-grade hardware like the NVIDIA RTX 4090 with 24GB", but this is an example of hardware capacity in general, not explicitly stating that their experiments were run on this or any other specific hardware. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the implementation of the experiments. |
| Experiment Setup | Yes | Table 6: Implementation Details of SVDecode. Parameter Value/Setting Warm-start Steps (Epochs) 1 α in Confidence-aware Constraint 0.1 λ in Confidence-aware Constraint -inf Default Decoding Strategy Greedy Search Table 7: Hyperparameters for PEFT Methods. Parameter Lo RA IA3 Prompt P-T Lo RA Rank 8 Lo RA α 16 Lo RA Dropout 0.1 Num Virtual Tokens 20 Prefix Projection False Encoder Hidden Size 128 Encoder Num Layers 2 Learning Rate 5e-5 Epochs 1 Train Batch Size 1 Eval Batch Size 2 Max Seq Length 512 FP16 True |