Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multimodal Large Language Models for Inverse Molecular Design with Retrosynthetic Planning

Authors: Gang Liu, Michael Sun, Wojciech Matusik, Meng Jiang, Jie Chen

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We create benchmarking datasets and conduct extensive experiments to evaluate Llamole against in-context learning and supervised fine-tuning. Llamole significantly outperforms 14 adapted LLMs across 12 metrics for controllable molecular design and retrosynthetic planning.
Researcher Affiliation	Collaboration	1University of Notre Dame 2MIT CSAIL 3 MIT-IBM Watson AI Lab, IBM Research EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in prose and figures (e.g., Figure 3: Overview of Llamole) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and model at https://github.com/liugangcode/Llamole.
Open Datasets	Yes	We collect small drug molecules from Pub Chem (Kim et al., 2021), Molecule Net (Wu et al., 2018), Ch EMBL (Zdrazil et al., 2024), and ZINC (Sterling & Irwin, 2015). Polymers are macromolecules with one repeating unit called monomers. We collect polymers from PI1M (Ma & Luo, 2020), the Membrane Society of Australia (MSA) (Thornton et al., 2012), and others (Liu et al., 2024b). Additionally, we collect 3.8 million patent chemical reactions with descriptions from USPTO (Lowe, 2017), spanning from 1976 to 2016.
Dataset Splits	Yes	We sample around 11K routes (750 for materials and 9986 for drugs) for testing and use the rest for instruction tuning. For each length of the routes, we split half of the molecules into the testing set, with a maximum of 3000, while the remainder is retained in the training set. It results in around 11K routes (750 for materials and 9986 for drugs) for testing and 126K target molecules for training.
Hardware Specification	Yes	We pre-train the denoising model with the loss function in Eq. (2) using 600K graph-text pairwise data and the eight properties defined in appendix C.3. The model employs the following hyperparameters: depth of 28, hidden size of 1024, 16 heads, and MLP hidden size of 4096. The total model size is around 574 million parameters. We pre-train the model for 45 epochs, which takes approximately one week on a single A100 card.
Software Dependencies	No	The paper mentions software like RDKit, rdchiral, and SciBERT, but does not provide specific version numbers for any of these dependencies.
Experiment Setup	Yes	The model employs the following hyperparameters: depth of 28, hidden size of 1024, 16 heads, and MLP hidden size of 4096. ... All LLMs are fine-tuned using Lo RA (Hu et al., 2021) for four epochs.