Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Authors: Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches. Section 6: Experiments, includes Performance Evaluation, Inference Latency Evaluation and Per-Query Qo S Validation. The paper presents numerous tables (Table 1-14) with perplexity and downstream task evaluation results, latency overheads, and ablation studies.
Researcher Affiliation Academia Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee , Yeonhong Park Seoul National University EMAIL
Pseudocode Yes Algorithm 1 Layer-wise Candidate Precision Set and Threshold Assignment of DP-LLM
Open Source Code Yes https://github.com/SNU-ARC/DP-LLM
Open Datasets Yes We evaluate perplexity on Wikitext2 [32] and C4 [33] datasets. Additionaly, we evaluate decoding performances using datasets with sufficient token generation lengths. Specificaly, we utilize GSM8K [34], MBPP [35], BBH [36], and MATH [37].
Dataset Splits Yes For both the Wiki Text2 and C4 datasets, samples are concatenated and divided into chunks of size 2048. A subset of the C4 train dataset (512 tokens 1000 samples) is used for threshold fine-tuning, with the number of epochs and learning rate each set to 5 and 0.01, respectively. GSM8K is evaluated in a 5-shot setting. MBPP is evaluated in a 3-shot setting. BBH is evaluated in a 3-shot setting. MATH is evaluated using the MATH-500 [45] variant, also with 3-shot prompting.
Hardware Specification Yes The evaluation is conducted on two hardware platforms: NVIDIA Jetson Orin AGX 64GB [40] and NVIDIA RTX 4060 Ti 16GB [41]. Both Llama-3-8B and Phi-3-Medium are fine-tuned on a single RTX 3090 GPU with 24GB of VRAM. On an A100 80GB GPU, Llama-3-8B requires about 30 minutes, and Phi-3-Medium takes approximately one hour to complete fine-tuning.
Software Dependencies No The paper mentions 'gpt-fast' and 'torch.compile feature' (implying PyTorch) and 'AdamW optimizer', but does not provide specific version numbers for these software components or any other libraries/frameworks.
Experiment Setup Yes A subset of the C4 train dataset (512 tokens 1000 samples) is used for threshold fine-tuning, with the number of epochs and learning rate each set to 5 and 0.01, respectively. Fine-tuning is performed using the Adam W optimizer. The hyperparameter α is set to 1 for all target precisions, except when the target precision is 3.25, where α is set to 10 to better align with the target precision. In our further experiment setups, we use k = 64 for every linear projection, which limits the relative error estimation difference within 15% with 91% confidence when measured empirically using the C4 dataset. To determine the strength of the linear relationship, the coefficient of determination (R2) between the input vector norm and the relative error for each layer is computed using the calibration set, and compared against a hyperparameter R2 th(which is set to 0.9 in our further experiment setups) at offline.