Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Procedural Synthesis of Synthesizable Molecules

Authors: Michael Sun, Alston Lo, Minghao Guo, Jie Chen, Connor Coley, Wojciech Matusik

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate performance advantages of our bi-level framework for synthesizable analog generation and synthesizable molecule design. Notably, our approach offers the user explicit control over the resources required to perform synthesis and biases the design space towards simpler solutions, which is particularly promising for autonomous synthesis platforms.
Researcher Affiliation Collaboration Michael Sun1, Alston Lo1, Minghao Guo1, Jie Chen2, Connor Coley3, Wojciech Matusik1 1MIT CSAIL 2MIT-IBM Watson AI Lab, IBM Research 3MIT Chemical Engineering EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Construction of training dataset. Require: A synthetic dataset D0 P B of programs (Section 4.1.1). 1: D 2: for each (P, B) D0 do 3: Turn (P, B) into a fully-filled program T P whose root is attributed with FP(P, B). 4: for each Λ 2T containing the root and closed under parent( ) do 5: Frontier(Λ) {i T | i / Λ and parent(i) Λ} 6: Populate node features H and labels Y based on P and B (App. F) 7: for i T Λ do 8: Mask the feature in H corresponding to node i 9: for i T Frontier(Λ) do 10: Mask the label in Y corresponding to node i 11: D D {(T, H, Y )} return D
Open Source Code Yes Supporting code is at https://github.com/shiningsunnyday/Synthesis Net.
Open Datasets Yes We use 91 reaction templates from Hartenfeller et al. (2011); Button et al. (2019) representative of common synthetic reactions.
Dataset Splits Yes Filtering by QED > 0.5 of the product molecules leaves 227,808 synthetic trees (136,684 for training, 45,563 for validation, and 45,561 for testing), which are then preprocessed into programs to construct our final datasets.
Hardware Specification No No specific hardware details (like GPU/CPU models, memory, or specific computing infrastructure) are provided in the paper.
Software Dependencies No The paper mentions methods and formats like 'SMIRKS formal language' and 'graph neural network', but it does not specify any particular software libraries, frameworks, or their version numbers used for implementation or experimentation.
Experiment Setup Yes In practice, we featurize molecules using Morgan fingerprints with radius 2 and d = 2048 bits, which is a common molecular representation in both predictive and design tasks. This means that F is now technically a map over fingerprint space X {0, 1}d. It is then natural to use the Tanimoto distance between fingerprints as our notion of molecular distance.