Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Retro-R1: LLM-based Agentic Retrosynthesis

Authors: Wei Liu, Jiangtao Feng, Hongli Yu, Yuxuan Song, Yuqiang Li, Shufei Zhang, LEI BAI, Wei-Ying Ma, Hao Zhou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that RETRO-R1 achieves a 55.79% pass@1 success rate, surpassing the previous state of the art by 8.95%. Notably, RETRO-R1 demonstrates strong generalization to out-of-domain test cases, where existing methods tend to fail despite their high in-domain performance. Our work marks a significant step toward equipping LLMs with advanced, chemist-like reasoning abilities, highlighting the promise of reinforcement learning for enabling data-efficient, generalizable, and sophisticated scientific problem-solving in LLM-based agents.
Researcher Affiliation Collaboration 1School of Computer Science, Shanghai Jiao Tong University 2Institute for AI Industry Research (AIR), Tsinghua University 3Shanghai Artificial Intelligence Laboratory 4Shanghai Innovation Institute 5Department of Computer Science and Technology, Tsinghua University
Pseudocode No The paper describes the methodology and workflows using text and diagrams in sections 3.1, 3.2, 3.3, and 3.4, and figures 1 and 2. However, it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code Yes Our framework is built upon verl [34] and Agent-R1 [35], with the implementation publicly accessible at https://github.com/Gen SI-THUAIR/Retro-R1.
Open Datasets Yes Following the common practice [12], we use 23,081,633 commercially available molecules from e Molecules 2 as the building blocks. For the training dataset, the single-step reactions come from USPTO reaction dataset [39]. [...] For the testing datasets, we use Retro*-190, which are 190 challenging target molecules from [12], each with a reference route, and Ch EMBL-1000 [10], which is an out-of-distribution dataset used to test the generalizability of methods, consisting of 1000 molecules without reference routes.
Dataset Splits Yes For the training dataset, the single-step reactions come from USPTO reaction dataset [39]. Based on these reactions and building blocks, 299,202 multi-step synthesis routes are constructed in the prior work [12]. The length of the routes varies from 1 to 14. For baselines, all routes are used in the training. For our method, only 11,366 routes with lengths longer than 4 are used, which shows the data efficiency of our method. For the testing datasets, we use Retro*-190, which are 190 challenging target molecules from [12], each with a reference route, and Ch EMBL-1000 [10], which is an out-of-distribution dataset used to test the generalizability of methods, consisting of 1000 molecules without reference routes.
Hardware Specification Yes We use 32 A800 80G GPU to train RETRO-R1. The model converges after 190 training steps, requiring approximately 70 hours. When planning with 500 iteration limit, it takes around 20 minutes per molecule on a single A800 80G GPU.
Software Dependencies No The paper mentions using Qwen2.5-7B-Instruct-1M as the policy network and vLLM for rollouts. While Qwen2.5-7B-Instruct-1M specifies a model version, the paper does not provide specific version numbers for general programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other key software libraries like vLLM, which are necessary to fully replicate the experimental environment.
Experiment Setup Yes Efficient rollouts are conducted using v LLM[44] with a sampling temperature of 1.0 and a maximum sequence length of 32,768. For training, we apply a clipping ratio of 0.2 in the LCLIP objective and incorporate a KL-divergence penalty with a coefficient of 0.001 into the token-level reward. Additionally, an entropy loss with a coefficient of 0.001 is utilized to prevent the model from generating overly chaotic outputs. Both the policy and value networks use Adam [45] as the optimizer. The learning rates of the policy network and value function are 10 6 and 10 5, respectively. The rollout batch size is set to 128, while the mini-batch size for gradient updates is 64. During the initial 3 training steps, updates to the policy network are omitted to facilitate warm-up of the randomly initialized critic head.