Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Authors: Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches. Section 6: Experiments, includes Performance Evaluation, Inference Latency Evaluation and Per-Query Qo S Validation. The paper presents numerous tables (Table 1-14) with perplexity and downstream task evaluation results, latency overheads, and ablation studies.
Researcher Affiliation	Academia	Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee , Yeonhong Park Seoul National University EMAIL
Pseudocode	Yes	Algorithm 1 Layer-wise Candidate Precision Set and Threshold Assignment of DP-LLM
Open Source Code	Yes	https://github.com/SNU-ARC/DP-LLM
Open Datasets	Yes	We evaluate perplexity on Wikitext2 [32] and C4 [33] datasets. Additionaly, we evaluate decoding performances using datasets with sufficient token generation lengths. Specificaly, we utilize GSM8K [34], MBPP [35], BBH [36], and MATH [37].
Dataset Splits	Yes	For both the Wiki Text2 and C4 datasets, samples are concatenated and divided into chunks of size 2048. A subset of the C4 train dataset (512 tokens 1000 samples) is used for threshold fine-tuning, with the number of epochs and learning rate each set to 5 and 0.01, respectively. GSM8K is evaluated in a 5-shot setting. MBPP is evaluated in a 3-shot setting. BBH is evaluated in a 3-shot setting. MATH is evaluated using the MATH-500 [45] variant, also with 3-shot prompting.
Hardware Specification	Yes	The evaluation is conducted on two hardware platforms: NVIDIA Jetson Orin AGX 64GB [40] and NVIDIA RTX 4060 Ti 16GB [41]. Both Llama-3-8B and Phi-3-Medium are fine-tuned on a single RTX 3090 GPU with 24GB of VRAM. On an A100 80GB GPU, Llama-3-8B requires about 30 minutes, and Phi-3-Medium takes approximately one hour to complete fine-tuning.
Software Dependencies	No	The paper mentions 'gpt-fast' and 'torch.compile feature' (implying PyTorch) and 'AdamW optimizer', but does not provide specific version numbers for these software components or any other libraries/frameworks.
Experiment Setup	Yes	A subset of the C4 train dataset (512 tokens 1000 samples) is used for threshold fine-tuning, with the number of epochs and learning rate each set to 5 and 0.01, respectively. Fine-tuning is performed using the Adam W optimizer. The hyperparameter α is set to 1 for all target precisions, except when the target precision is 3.25, where α is set to 10 to better align with the target precision. In our further experiment setups, we use k = 64 for every linear projection, which limits the relative error estimation difference within 15% with 91% confidence when measured empirically using the C4 dataset. To determine the strength of the linear relationship, the coefficient of determination (R2) between the input vector norm and the relative error for each layer is computed using the calibration set, and compared against a hyperparameter R2 th(which is set to 0.9 in our further experiment setups) at offline.