Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data

Authors: Zhenqing Ling, Daoyuan Chen, Liuyi Yao, Qianli Shen, Yaliang Li, Ying Shen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. Extensive evaluations across 7 benchmarks and 3 model families reveal that DAAR achieves new state-of-the-art (SOTA) average performance, consistently outperforming 9 baseline methods on high-difficulty tasks while maintaining computational efficiency.
Researcher Affiliation Collaboration 1Sun Yat-sen University, 2Alibaba Group, 3FSIETP EMAIL, EMAIL EMAIL
Pseudocode No The paper includes descriptions of processes and prompts but does not present them in clearly labeled pseudocode or algorithm blocks. Figure 1 provides a high-level illustration but is not pseudocode.
Open Source Code Yes Our code is released at https://github.com/modelscope/datajuicer/tree/Daa R to foster more data-centric research for LLMs.
Open Datasets Yes The seed data pool is sourced from the following datasets: Dolly-15k [15] for common sense, Cot-en [13] for reasoning, Math-Instruct [56] for mathematics, and Code-Alpaca [6] for coding. ... we select the following widely used evaluation sets: NQ [28] and Trivia QA [26] for common sense, Hellaswag [57] for reasoning, GSM8K [14] and MATH [24] for mathematics, MBPP [4] and Human Eval [11] for coding.
Dataset Splits Yes To accelerate the evaluation process while maintaining fairness and accuracy, we randomly tailor the original evaluation sets into evaluation subsets, as detailed in Table 6. All experiments were conducted using this consistent setup to ensure the fairness of the experiments. ... Following some effective instruction-tuning work [62, 34], we set the size of our subset to 8,000 entries, which constitutes 20% of the data pool.
Hardware Specification Yes Experiments are conducted on a computing platform equipped with four NVIDIA A100 GPUs (40GB), with pre-trained LLMs loaded as 16-bit floating-point numbers.
Software Dependencies Yes We implement our approaches using Py Torch [36] v2.4.1, coupled with PEFT v0.12.0 and the Transformers library [47] v4.45.2.
Experiment Setup Yes In our experimental setup, we employ Low-Rank Adaptation (Lo RA) [25] adapters for the fine-tuning process, utilizing a Lo RA-rank of 8 and a Lo RA-alpha of 16. The learning rate was consistently maintained at 5 Γ— 10βˆ’5 across all experiments to ensure uniformity in training dynamics. We utilize a batch size of 4 and set the maximum sequence length to 2048 tokens to accommodate the model’s capacity. To optimize the training process, a warmup ratio of 0.05 was applied, and a validation ratio of 0.03 was used. The training was conducted over a single epoch, balancing computational efficiency with the need for effective model adaptation.