Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bohdi: Heterogeneous LLM Fusion with Automatic Data Exploration

Authors: 俊琪 高, Zhichang Guo, Dazhi Zhang, Dong Li, Runze Liu, Pengfei Li, Kai Tian, Biqing Qi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comparative experimental results on a comprehensive suite of benchmarks demonstrate that Bohdi significantly outperforms existing baselines on multiple target LLMs, exhibits higher data efficiency, and virtually eliminates the imbalance in the target LLM s capabilities. Our code is available at Bohdi.
Researcher Affiliation Collaboration 1 School of Mathematics, Harbin Institute of Technology 2 Shanghai Artificial Intelligence Laboratory 3 Tsinghua Shenzhen International Graduate School, Tsinghua University 4 Department of Electronic Engineering, Tsinghua University 5 Shanghai Innovation Institute EMAIL, EMAIL
Pseudocode Yes B.4 Pseudocode of Bohdi We present the pseudocode for Bohdi s actual execution process in Algorithm 1.
Open Source Code No Answer: [No] Justification: We will include the code link in the official publication, and our method does not require open-sourcing data.
Open Datasets Yes For multidisciplinary knowledge, we select the multidisciplinary question-answering benchmark MMLU [18] and the natural science subject question-answering benchmark GPQA [19]. For mathematics, we choose the mathematical problem-solving benchmarks GSM8K [20] and MATH [21]. For programming, we select the programming benchmarks Human Eval [22] and MBPP [23]. For reasoning ability, we choose the benchmark BBH [24] for measuring logical reasoning ability and the theorem-driven reasoning benchmark Theorem QA [25].
Dataset Splits Yes To ensure a unified and fair evaluation, we use Opencompass [26] as the evaluation suite.
Hardware Specification Yes All of our training is conducted on 8 Nvidia A100 GPUs.
Software Dependencies No we uniformly use trl3 as the training framework and train with bfloat16 precision. All of our training is conducted on 8 Nvidia A100 GPUs. ... To ensure unified and fair evaluation, we use opencompass4 as the evaluation suite.
Experiment Setup Yes For SFT, we use a training schedule of 3 epochs and a consistent learning rate of 5e 6 across all target models. ... For Bohdi, in the comparative experiments, we perform R = 50 iterations, sampling B = 90 paths in the Meditation phase and M = 180 data in the Enlightenment phase for training the target model in each round, with the quantile parameter u = 0.2 and window width w = 20 for SWBLRT.