Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Training a Scientific Reasoning Model for Chemistry
Authors: Siddharth Narayanan, James Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Michael Pieler, Sam Rodriques, Andrew White
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We report ether0, a 24B parameter LLM (based on Mistral-Small-24B) that can reason in natural language and respond with chemical structures. This reasoning model was trained with reinforcement learning on 640,730 experimentally-grounded chemistry problems across 375 tasks ranging from synthesizability, to blood-brain barrier permeability, to human receptor activity, to scent. Our model exceeds general-purpose chemistry models, frontier models, and human experts on molecular design tasks. |
| Researcher Affiliation | Industry | 1Future House Inc., San Francisco, CA These authors jointly supervise technical work at Future House. Correspondence: EMAIL |
| Pseudocode | Yes | Algorithm 1 GRPO Input: Minibatch sampling distribution PB(D), hyperparameters µ, M 1: for k = 1, . . . , K do 2: πold πθ 3: if k mod M = 0 then 4: Update reference policy: πref πθ 5: end if 6: Sample minibatch DB PB(D) 7: for x DB do 8: Sample yx i , . . . , yx G πθold( |x) 9: Compute rewards rx 1, . . . , rx G, then advantages Ax 1, . . . , Ax G 10: end for 11: for j = 1, . . . , µ do 12: Update πθ with a gradient ascent step on JGRPO over {x, {yx 1, . . . , yx G} | x DB} 13: end for 14: end for |
| Open Source Code | Yes | We have publicly released a Git Hub repository (https://github.com/Future House/ether0) containing the reward functions and test data. All training data sources are publicly available, and we have also open-sourced the templates used to generate the question prompts. |
| Open Datasets | Yes | We construct a dataset of 640,730 chemical reasoning problems, comprising 18 different tasks. Molecules are represented in the question and expected answer as SMILES, which encodes the molecular graph or chemical reaction as ASCII text [55]. The answers are all either a molecule or a reaction. Many tasks are broken down into subtasks. For example, in the solubility editing task, one subtask is to increase solubility without changing the molecular scaffold, and another is to change it without affecting specific functional groups. Table 1 summarizes all problems in our dataset, and Section C.2 provides full details on the dataset provenance as well as the construction of each task. ... The dataset was constructed by aggregating data from 13 distinct sources, detailed in Table 1. All selected references exclusively involved experimental measurements of synthesized molecules, excluding any hypothetical or computationally generated structures. |
| Dataset Splits | Yes | To prevent leakage, all compounds used in a question type together were excluded between train and test. Namely, we made a graph where each edge represents when two molecules appeared in the same MCQ. Then ensured that the train and test subgraphs had no connections, but that we could group similar molecules densely enough to make questions with distractors. The smell, Eve Bio, and GHS tasks had enough compounds that this wasn t necessary, and we just randomly split. |
| Hardware Specification | Yes | The seven specialist models were trained using 24-72 Nvidia H100 GPUs each, with a varying set of hyperparameters (detailed in Section E.1). A total of 186,010 sequences were collected from the specialist training runs for distillation. A single SFT epoch was sufficient for distillation, with a batch size of 64 and learning rate of 1.9 10 5. The all-task RL training phase was performed using 384 H100 GPUs, over 4 days; all hyperparameters are described in Section E.2. The final safety alignment phase required 104 H100 GPUs (see Section E.3). |
| Software Dependencies | No | The paper mentions several LLM models like Mistral-Small-24B-Instruct-2501 [97], Deep Seek-R1, Gemini 2.5 Flash, Open AI GPT 4.1, Google Gemini 2.5 Pro, and Llama-3.3-70B-Instruct [110]. It also refers to tools like RDKit [83], KDESol [56], and exmol [84]. However, it does not provide specific version numbers for the general software libraries or frameworks (e.g., Python, PyTorch, RDKit version). |
| Experiment Setup | Yes | All task-specific RL runs shared the following hyperparameters: Maximum completion length: 2048 GRPO epochs µ: 1 Sampling temperature: 1.0 KL penalty weight β: 0.005 Learning rate: 10 6 Linear LR warm-up steps: 20 Reference policy reset period M: never ... The following hyperparameters were used for the all-task RL phase: Maximum completion length: 4096 Number of training steps: 434 Group size: 4 Group batch size: 768 GRPO epochs µ: 1 Sampling temperature: 1.0 KL penalty weight β: 0.005 Learning rate: 1.25 10 6 Linear LR warm-up steps: 20 Reference policy reset period M: 256 steps Curriculum buffer sampling rate ϵcur: 0.25 Curriculum buffer seed: Molecule quality bonus reward: enabled for the last 50 steps Fraction of LLM-rewritten problems: 75% ... The following hyperparameters were used for the safety RL phase: Maximum completion length: 4096 Number of training steps: 120 Group size: 4 (non-safety problems) and 5 (safety problems) Group batch size: 104 GRPO epochs µ: 1 Sampling temperature: 1.0 KL penalty weight β: 0.005 Learning rate: 1 10 6 Linear LR warm-up steps: 20 Reference policy reset period M: 256 steps Curriculum buffer: Fraction of LLM-rewritten problems: 75% |