Molecule Generation For Target Protein Binding with Structural Motifs

Authors: ZAIXI ZHANG, Yaosen Min, Shuxin Zheng, Qi Liu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive evaluations to evaluate our approach. Experimental results show that: (1) our method is able to generate diverse drug-like molecules with high binding affinity to target proteins; (2) FLAG is much faster than most of the baseline methods at sampling new molecules; (3) thanks to the design of fragment-based generation, our method outperforms baselines by a large margin on generating valid molecules with realistic substructures.
Researcher Affiliation Collaboration Zaixi Zhang1,2, Yaosen Min3, Shuxin Zheng4, Qi Liu1,2 1: Anhui Province Key Lab of Big Data Analysis and Application, School of Computer Science and Technology, University of Science and Technology of China 2:State Key Laboratory of Cognitive Intelligence, Hefei, Anhui, China 3:Institute of Interdisciplinary Information Sciences, Tsinghua University, 4: Microsoft Research
Pseudocode No The paper describes methods and procedures in text and with diagrams, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Our code is publicly available at https://github.com/zaixizhang/FLAG.
Open Datasets Yes Following (Luo et al., 2021a) and (Peng et al., 2022), we use the Cross Docked dataset (Francoeur et al., 2020) which contains 22.5 million protein-molecule structures.
Dataset Splits Yes We filter out data points whose binding pose RMSD is greater than 1 A and molecules that can not be sanitized with RDkit (Bento et al., 2020), leading to a refined subset with around 160k data points. We use mmseqs2 (Steinegger & S oding, 2017) to cluster data at 30% sequence identity, and randomly draw 100,000 protein-ligand pairs for training and 100 proteins from remaining clusters for testing.
Hardware Specification Yes All the experiments are conducted on Ubuntu Linux with V100 GPUs.
Software Dependencies Yes The codes are implemented in Python 3.8 and Pytorch 1.10.0.
Experiment Setup Yes The number of layers L in context encoder is 6, and the hidden dimension is 256. The model is trained with the Adam optimizer at a learning rate of 0.0001. The batch size is 4 and the number of total training iterations is 1,000,000. The standard deviation of the added Gaussian noise to the ligand coordinates is 0.2. The cutoff distance in the context encoder is set to 10 A. The threshold τ in motif extraction is set to 100 in the default setting.