Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization

Authors: Bowei He, Lihao Yin, Hui-Ling Zhen, Shuqi LIU, Han Wu, Xiaokun Zhang, Mingxuan Yuan, Chen Ma

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a summary, our contributions are as three folds: 1) we conduct extensive empirical exploration on the impact of calibration data variations from different compositional properties and domain correspondence perspectives; ... 4.3 Empirical Performance Evaluation We evaluate our calibration data curation framework across two settings: general deployment and targeted deployment.
Researcher Affiliation	Collaboration	Bowei He1, Lihao Yin2, Huiling Zhen2, Shuqi Liu2, Han Wu2, Xiaokun Zhang1, , Mingxuan Yuan2, Chen Ma1, 1 Department of Computer Science, City University of Hong Kong, 2 Huawei, Hong Kong
Pseudocode	No	The paper describes methods using mathematical formulations (e.g., equations 1, 2, 3) but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured steps formatted like code.
Open Source Code	Yes	Our code is provided in Link. ... We provide the code repository for our proposed COLA framework in https://github.com/Bokwai Ho/COLA.git.
Open Datasets	Yes	As for the first category, similar to previous works, we take the following datasets as the calibration data sources: C4 [Raffel et al., 2020], Wiki Text, and Slim Pajama. ... To comprehensively evaluate the capability preservation of compressed LLMs, we take the benchmarks focusing on different capabilities, especially some high-level complex reasoning ones: 1) Language modeling: Wiki Text2 [Merity et al., 2022], PTB [Marcus et al., 1993]; 2) Commonsense reasoning: Bool Q [Clark et al., 2019], PIQA [Bisk et al., 2020], Hella Swag [Zellers et al., 2019], Wino Grande [Sakaguchi et al., 2021], ARC-Easy [Clark et al., 2018], ARC-Challenge [Clark et al., 2018], and Openbook QA [Mihaylov et al., 2018]; 3) Mathematical problem solving: GSM8K [Cobbe et al., 2021], MATH(including subtasks like algebra, counting-and-prob, geometry, intermediate-algebra, num-theory, prealgebra, mathprecalc) [Hendrycks et al.], Minerva-Math [Lewkowycz et al., 2022] with custom prompts; 4) Code generation: Human Eval [Chen et al., 2021], MBPP [Austin et al., 2021]; 5) Multilingual comprehension: ARC-Multilingual [Lai et al., 2023], Hella Swag-Multilingual [Lai et al., 2023].
Dataset Splits	Yes	The evaluation part is based on the open-source repository lm-evaluation-harness 10, v0.4.7 version. ... For the three math benchmarks and code benchmark MBPP are evaluated with 4-shot and 3-shot manner, respectively, while others are evaluated with 0-shot manner.
Hardware Specification	Yes	We run all experiments on a server with 128 Intel Xeon Platinum 8538 CPU @ 2.60GHz and 8 Nvidia RTX 6000 Ada GPU having 48 GB GDDR6 VRAM.
Software Dependencies	Yes	For all experiments in this work, we use the Ubuntu 22.04 LTS system, Python 3.11.11 environment, and v LLM 0.7.2 library7 for LLM local inference of both LLa MA3-8B-Instruct and Qwen2.5-7B-Instruct. ... The evaluation part is based on the open-source repository lm-evaluation-harness 10, v0.4.7 version.
Experiment Setup	Yes	LLM Compression Schemes To ensure comprehensiveness, we select two representative post-training pruning methods: Sparse GPT [Frantar and Alistarh, 2023] for unstructured pruning and Wanda [Sun et al., 2024a] for semi-structured pruning. The pruning ratio is set as 50% for Sparse GPT, and the block pattern is set as 4 : 8 for Wanda. ... As for the post-training quantization, we choose the widely adopted GPTQ [Frantar et al., 2023] and more recent AWQ [Lin et al., 2024] with the bit number set as 4. ... Except for the experiments investigating the impact of sample amounts and sequence lengths, we randomly sample 128 sequences with the token length as 2048 from corresponding sources as calibration data. ... For v LLM inference hypermeters, we set the max tokens to 1024, temperature to 0.7, top k to 50, top p to 0.7, and repetition penalty to 1. ... Each experiment is performed for five times with different seeds and then reports the averaged performance to mitigate the randomness.