Differentiable Scaffolding Tree for Molecule Optimization

Authors: Tianfan Fu, Wenhao Gao, Cao Xiao, Jacob Yasonik, Connor W. Coley, Jimeng Sun

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical studies show the gradient-based molecular optimizations are both effective and sample efficient (in terms of oracle calling number). Furthermore, the learned graph parameters can also provide an explanation that helps domain experts understand the model output. The code repository (including processed data, trained model, demonstration, molecules with the highest property) is available at https://github.com/futianfan/DST.
Researcher Affiliation Collaboration Tianfan Fu1 , Wenhao Gao2 , Cao Xiao3, Jacob Yasonik2, Connor W. Coley2 & Jimeng Sun4 1Georgia Institute of Technology, 2Massachusetts Institute of Technology 3Amplitude. 4University of Illinois at Urbana-Champaign.
Pseudocode Yes Algorithm 1 Differentiable Scaffolding Tree (DST)
Open Source Code Yes The code repository (including processed data, trained model, demonstration, molecules with the highest property) is available at https://github.com/futianfan/DST.
Open Datasets Yes Dataset: ZINC 250K contains around 250K druglike molecules (Sterling & Irwin, 2015).
Dataset Splits No When training GNN, the training epoch number is 5, and we evaluate the loss function on the validation set every 20K data passes. When the validation loss would not decrease, we terminate the training process.
Hardware Specification Yes We implemented DST using Pytorch 1.7.0, Python 3.7, RDKit v2020.09.1.0 on an Intel Xeon E5-2690 machine with 256G RAM and 8 NVIDIA Pascal Titan X GPUs.
Software Dependencies Yes We implemented DST using Pytorch 1.7.0, Python 3.7, RDKit v2020.09.1.0 on an Intel Xeon E5-2690 machine with 256G RAM and 8 NVIDIA Pascal Titan X GPUs.
Experiment Setup Yes Both the size of substructure embedding and hidden size of GCN (GNN) in Eq. (6) are d = 100. The depth of GNN L is 3. When training GNN, the training epoch number is 5, and we evaluate the loss function on the validation set every 20K data passes. When the validation loss would not decrease, we terminate the training process. When optimizing JNK3 , GSK3β , QED , JNK3+GSK3β and QED+SA+JNK3+GSK3β , we use binary cross entropy as loss criterion. When optimizing Log P , since Log P ranges from to + , we leverage GNN to conduct regression tasks, and use means square error (MSE) as loss criteria L. In the de novo generation, in each generation, we keep C = 10 molecules for the next iteration. In most cases in the experiment, the size of the neighborhood set (Definition. 10) is less than 100. We use Adam optimizer with 1e-3 learning rate in training and inference procedure, optimizing the GNN and differentiable scaffolding tree, respectively. When optimizing DST(Equation 10), our method processes one DST at a time, we conduct 1000 Adam steps in every iteration, which is sufficient to converge in almost all the cases. As a complete generation algorithm, we optimize a batch parallelly and select candidates based on DPP. When we use up oracle budgets, we stop it. All DST results in the tables take at most T = 50 iterations.