Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ChemOrch: Empowering LLMs with Chemical Intelligence via Groundbreaking Synthetic Instructions
Authors: Yue Huang, Zhengzhe Jiang, Xiaonan Luo, Kehan Guo, Haomin Zhuang, Yujun Zhou, Zhengqing Yuan, Xiaoqi Sun, Jules Schleinitz, Yanbo Wang, Shuhao Zhang, Mihir Surve, Nitesh Chawla, Olaf Wiest, Xiangliang Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of Chem Orch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, University of Notre Dame 2MIT 3Cal Tech 4MBZUAI 5CMU 6Department of Chemistry & Biochemistry, University of Notre Dame |
| Pseudocode | Yes | The details of instruction generation are presented in Algorithm 1 in Appendix M. For all tasks in T , by varying the constraint in each running iteration alongside the metadata (check implementation details in Appendix B), the IG model generates diverse instructions X = {xi}N i=1. ... The details of response generation are shown in Algorithm 2 in Appendix M. |
| Open Source Code | Yes | The code is available at https://github.com/Howie Hwong/Chem Orch. |
| Open Datasets | Yes | We leverage data samples from Chem LLMBench [11] as few-shot exemplars to guide the generation process of Chem Orch. ... We use selected 115 examples from the chemistry reasoning questions in MMLU-Pro [32]... The BBB penetration prediction comes from B3DB [56], the DDI prediction comes from TDC [57 59], and the lipophilicity prediction comes from Molecule Net [60]. |
| Dataset Splits | Yes | For the chemistry evaluation in subsection 4.4 contains 400 examples per task. For the fine-grained evaluation, each task includes 150 examples. For the finetuning experiments described in subsection 4.5, each of the three tasks (property prediction, tool usage, and molecule captioning) includes 400 samples for training and 400 for testing, with the test data sampled from the original benchmark [14]. For the general Q&A task, 1000 samples are used for training and 200 for testing the larger size reflects the broader scope of chemical knowledge required. The ablation study of the Chem Orch module is conducted using a separate set of 200 data points. |
| Hardware Specification | Yes | Training was performed for 3 epochs with a cosine learning rate scheduler and a warmup ratio of 0.1. The learning rate was fixed at 1e-5, and the per-device training batch size was set to 4, with no gradient accumulation (i.e., gradient_accumulation_steps = 1). We used bfloat16 (bf16) precision and trained on 4 NVIDIA A100 GPUs to accelerate computation and reduce memory usage. |
| Software Dependencies | Yes | We employ GPT-4o [23] as the IG model across all experiments. For response generation, we adopt a hybrid setting: GPT-4o is used for general-purpose reasoning tasks... For components requiring fine-grained decision-making or complex reasoning (specifically, tool distillation, code script generation, and self-repairing), we utilize the o1-mini model [24], which demonstrates stronger reasoning capabilities. For text embedding, we adopt the text-embedding-3-small model [25]. ... Chem Orch leverages two categories of tools: chemistry-related tools such as RDKit [19] and Pub Chem [20]... We used LLa MA-Factory [62] for the training process. |
| Experiment Setup | Yes | Training was performed for 3 epochs with a cosine learning rate scheduler and a warmup ratio of 0.1. The learning rate was fixed at 1e-5, and the per-device training batch size was set to 4, with no gradient accumulation (i.e., gradient_accumulation_steps = 1). We used bfloat16 (bf16) precision and trained on 4 NVIDIA A100 GPUs... In our framework, we set a few hyperparameters to optimize the ability of our model. We set top_k = 5 and tool_distilling_num_threshold = 5 in the tool selection module to guarantee the selected tools diversity and avoid tool redundancy. In the tool invacation module, we set script_fixing_num_threshold = 3, error_fixing_num_threshold = 3, and effectiveness_checking_num_threshold = 5. |