Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Task-Specific Data Selection for Instruction Tuning via Monosemantic Neuronal Activations

Authors: Da Ma, Gonghu Shang, Zhi Chen, Libo Qin, Yijie LUO, Hongshen Xu, Lei Pan, Shuai Fan, Kai Yu, Lu Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we design experiments to systematically evaluate our method (MONA) for task-specific instruction tuning. We center our evaluation around the following key questions: Effectiveness and Robustness (Q1): Does MONA consistently select data that yields better downstream performance across (i) various source general instruction datasets and target evaluation tasks, (ii) different instruction-tuned LLMs, and (iii) a range of data selection ratios?
Researcher Affiliation	Collaboration	Da Maα, Gonghu Shangα, Zhi Chenσ, Libo Qinδ, Yijie Luoα, Hongshen Xuα, Lei Panγ, Shuai Fanγµ, Kai Yuαµλ, Lu Chenαβµλ , αX-LANCE Lab, Mo E Key Lab of Artificial Intelligence, AI Institute School of Computer Science, Shanghai Jiao Tong University, Shanghai, China βShanghai Innovation Institution, Shanghai, China γAISpeech Co., Ltd., Suzhou, China σByte Dance δSchool of Computer Science and Engineering, Central South University µJiangsu Key Lab of Language Computing, Suzhou, China λSuzhou Laboratory, Suzhou, China
Pseudocode	Yes	A complete description of our algorithm is provided (Algorithm 1) in the form of pseudocode, to facilitate reproducibility and implementation in future work.
Open Source Code	No	We will release our code to facilitate further research in the community. (from Introduction) / Answer: [No] Justification: Core components of our code are provided in the supplemental material. The full code with instructions will be released upon final organization. (from NeurIPS Paper Checklist, Q5)
Open Datasets	Yes	General Instruction Data and Evaluation Tasks To comprehensively evaluate robustness and generalization on target tasks, we select training data from two large-scale, diverse instruction datasets: OPENHERMES-2.5 [9] (1M synthetic and curated instruction/chat samples) and LESS [11] (270K samples covering both classical sources such as FLAN V2 [8], COT [34], and open-ended humanannotated datasets like DOLLY [35] and OPEN ASSISTANT 1 [36]). We evaluate performance on six target tasks: MMLU [37] (general knowledge), BBH [38] (complex reasoning), GSM8K [39] (math problems), MBPP [40] (programming), GPQA [41] (expert QA), and Tydi QA [42] (multilingual QA).
Dataset Splits	Yes	For MMLU and MBPP, we directly use their respective validation sets as representative examples. For the remaining datasets without validation sets, we follow the strategies outlined below: GSM8K: We randomly select 100 samples from the training set to serve as representative examples. BBH: We extract representative examples by selecting the provided few-shot samples in the task setup. GPQA: We use the extended 98 data points, which are the "extended split" minus the "main split." Tydi QA: Following [11], we select one sample per language as the representative example. (from Section B.1 Evaluation Tasks Details)
Hardware Specification	Yes	Additionally, all experiments are conducted on NVIDIA A100, A800, and H800 GPUs.
Software Dependencies	No	Fine-tuning is performed with llama-factory [47], using a cosine scheduler (peak learning rate 7e 6, warmup ratio 0.01), batch size 128, weight decay 0.1, and maximum sequence length 8192. All models are trained for two epochs. (from Section 3.1) / Evaluations use lm-evaluation-harness [43] and v LLM [44] except for Tydi QA, which uses the LESS codebase [11]. (from Section 3.1)
Experiment Setup	Yes	Fine-tuning is performed with llama-factory [47], using a cosine scheduler (peak learning rate 7e 6, warmup ratio 0.01), batch size 128, weight decay 0.1, and maximum sequence length 8192. All models are trained for two epochs.