Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Curriculum Model Merging: Harmonizing Chemical LLMs for Enhanced Cross-Task Generalization

Authors: Baoyi He, Luotian Yuan, Ying Wei, Fei Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments on two benchmark datasets show that CMM effectively consolidates task-specific expertise and outperforms the state-of-the-art methods by 29.03% in terms of overall average performance. Moreover, CMM facilitates chemical knowledge generalization across prediction and generative tasks without sacrificing robustness, exhibiting promising merging performance under both expert-abundant and expert-sparse scenarios.
Researcher Affiliation	Academia	Baoyi He Zhejiang University EMAIL Luotian Yuan Zhejiang University EMAIL Ying Wei Zhejiang University EMAIL Fei Wu Zhejiang University EMAIL
Pseudocode	No	The paper describes the methodology using mathematical equations and textual explanations in Section 3, but it does not include explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will open code.
Open Datasets	Yes	Benchmarks and evaluation metrics The performance of various models is evaluated on two representative chemical benchmarks. The first benchmark, Chembench [62], consists of 4,100 high-quality multiple-choice questions and answers spanning 9 core chemistry tasks: Name Conversion (NC), Property Prediction (Property_P), Mol2Caption (M2C), Caption2Mol (C2M), Product Prediction (Product_P), Retrosynthesis (RS), Yield Prediction (YP), Temperature Prediction (TP), and Solvent Prediction (SP). The evaluation metric used for this benchmark is accuracy. The second benchmark consists of two molecular generation tasks from Mol-Instructions [15]: retrosynthesis and forward reaction prediction.
Dataset Splits	Yes	To reduce computational overhead, 200 samples from each task are randomly selected as the test dataset. An ablation study in Appendix D examines the relationship between the number of evaluation samples and the resulting metrics. ... The validation set used for ranking is composed of the Chem Bench dev set and 100 samples from Mol-Instructions, both of which are entirely disjoint from the test set.
Hardware Specification	No	The paper discusses experimental setup and results in Section 4, but it does not specify the types of GPUs, CPUs, or other hardware used for running the experiments.
Software Dependencies	No	The paper describes the experimental setup in Section 4 but does not explicitly list specific software dependencies with version numbers, such as programming languages, libraries, or frameworks, used for implementing the proposed methodology.
Experiment Setup	Yes	The merging weight coefficients for each model, β, are assigned using a linear strategy that increase from 0.3 to 0.6. We draw inspiration for this coefficient range from a conclusion in task arithmetic [25]: Scaling coefficients in the range 0.3 to 0.5 produce close to optimal results in many cases. Ablation studies and corresponding experimental results are provided in Section 4.4 to demonstrate the effectiveness of the chosen merging order and the β weighting strategy.