Projecting Molecules into Synthesizable Chemical Spaces

Authors: Shitong Luo, Wenhao Gao, Zuofan Wu, Jian Peng, Connor W. Coley, Jianzhu Ma

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we first examine the model s capability of finding synthesis pathways for molecules that are known to be synthesizable, and explore the generalization ability of the models (Section 4.1). Then, we study the application of the model in three drug design settings: structure-based drug design (Section 4.2), goal-directed generation (Section 4.3), and hit expansion (Section 4.4). Our model produces valid synthetic pathways for most test cases and thus achieves a high success rate (Table 1).
Researcher Affiliation Collaboration 1Helixon Research 2Massachusetts Institute of Technology 3Tsinghua University.
Pseudocode Yes Algorithm 1 Construct a (postfix notation, molecular graph) pair for training
Open Source Code Yes Code and data The code and data of this project are available at https://github.com/luost26/ChemProjector.
Open Datasets Yes We use the building blocks in the Enamine US Stock catalog retrieved in October 2023 (Enamine, 2023). For building blocks that include more than one molecule (e.g., salts, hydrates), we keep the largest molecule and drop the remaining ones... In addition, we include another challenging benchmark: 1,000 molecules from the ChEMBL database (Release 33)(Gaulton et al., 2012), reported to be predominantly unreachable in previous work (Gao et al., 2021).
Dataset Splits Yes First, we clustered building blocks into 128 groups using K-means algorithm based on their fingerprints, reserving one cluster exclusively for testing while using the remaining 127 for training.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'RDKit' and 'SMARTS string format (Daylight Chemical Information Systems, Inc., 2023)' but does not provide specific version numbers for these or other software dependencies necessary for replication.
Experiment Setup Yes The molecular graph encoder consists of 8 graph transformer layers, each of which has 8 attention heads, and the dimension of the input and output features is 512. The postfix notation decoder is a stack of 8 transformer decoder layers. Each has 8 attention heads, and the dimension of the input and output features is also 512. In the nearest-neighbor search, each building block molecule is indexed by the Morgan fingerprint (Morgan, 1965) of length 256 and radius 2. We use the Adam W optimizer (Loshchilov & Hutter, 2017) to train the network with a learning rate of 3e-4 and a batch size of 256 for 500,000 iterations. During the construction of synthesis for training, we limit the maximum number of reactions to 5 and the maximum number of atoms to 80. At inference time, we set the maximum sequence length to 16.