Text-Guided Molecule Generation with Diffusion Language Model

Authors: Haisong Gong, Qiang Liu, Shu Wu, Liang Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we undertake experiments to assess the performance of our proposed TGM-DLM in text-guided molecule generation.
Researcher Affiliation Academia 1Center for Research on Intelligent Perception and Computing State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences
Pseudocode Yes Algorithm 1: Training algorithm 1: repeat 2: Sample M, C and t. 3: Obtain x0 by embedding process 4: if t = 0 then 5: g = θ ( log pθ(M|x0)) 6: else if phase two and t < τ then 7: Obtain xt by Equation 7 and 8 8: g = θ f2,θ( xt, t) x0 2 9: else 10: g = θ f ,θ(xt, t, C) x0 2 11: end if 12: Take one step of optimization through gradient g 13: until converged
Open Source Code Yes Code will be released at: https://github.com/Deno-V/tgm-dlm.
Open Datasets Yes Given the nascent nature of our research focus, our evaluation centers on the Ch EBI-20 dataset (Edwards et al. 2022), which is currently the sole publicly available dataset.
Dataset Splits Yes This dataset encompasses a collection of 33,010 molecule-description pairs, which are separated into 80/10/10% train/validation/test splits.
Hardware Specification Yes it only takes about 1.2 seconds on average to generate one molecule from its description on our hardware (AMD EPYC 7742 (256) @ 2.250GHz CPU and one NVIDIA A100 GPU).
Software Dependencies No The paper mentions using 'Sci BERT' and 'RDKit toolkit' and 'Adam' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We set the maximum sequence length for tokenized SMILES strings to n = 256. SMILES vocabulary contained 257 tokens, with trainable token embeddings set at d = 32.We employed Sci BERT (Beltagy, Lo, and Cohan 2019) as our frozen encoder for text descriptions, with an embedding dimension of d1 = 768. The Transformer network for f ,θ comprises L = 12 layers, and the hidden dimension is configured as d2 = 1024. TGM-DLM is composed of approximately 180M trainable parameters. During molecule generation, we adopt a uniform skipping strategy for reverse steps to enhance sampling efficiency. As a result, the practical number of sample steps is 200 for phase one and 20 for phase two. During training, we set the total diffusion steps to T = 2, 000 for both phase one and phase two. For phase two, τ is set to 400, and the corruption probability p is set to 0.4. We used Adam (Kingma and Ba 2015) as the optimizer, employing linear warm-up and a learning rate of 1e-4.