Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Retrieval-based Controllable Molecule Generation

Authors: Zichao Wang, Weili Nie, Zhuoran Qiao, Chaowei Xiao, Richard Baraniuk, Anima Anandkumar

ICLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On various tasks ranging from simple design criteria to a challenging real-world scenario for designing lead compounds that bind to the SARS-Co V-2 main protease, we demonstrate our approach extrapolates well beyond the retrieval database, and achieves better performance and wider applicability than previous methods.
Researcher Affiliation Collaboration Zichao Wang Rice University EMAIL Weili Nie NVIDIA EMAIL Zhuoran Qiao Caltech EMAIL Chaowei Xiao NVIDIA, ASU EMAIL Richard G. Baraniuk Rice University EMAIL Anima Anandkumar NVIDIA, Caltech EMAIL
Pseudocode Yes Algorithm 1: Exemplar molecule retriever
Open Source Code Yes The source code is available at https://github.com/NVlabs/Ret Mol.
Open Datasets Yes The training dataset uses either ZINC250k (Irwin and Shoichet, 2004) (for the experiments in Section 3.1 or Che MBL (Gaulton et al., 2016).
Dataset Splits Yes For the ZINC250k dataset, we follow the train/validation/test splits in (Jin et al., 2019) and train on the train split.
Hardware Specification Yes Training is distributed over four V100 NVIDIA GPUs, each with 16GB memory... Inference uses a single V100 NVIDIA GPU with 16 GB memory... NNVIDIA Quadro RTX 8000.
Software Dependencies No The paper mentions software components like Megatron, Deep Speed, Apex, RDKit, Autodock-GPU, and Autodock software suite, but it does not provide specific version numbers for these dependencies.
Experiment Setup Yes Training is distributed over four V100 NVIDIA GPUs, each with 16GB memory, with a batch size of 256 samples on each GPU, for 50k iterations. The total training time is approximately 2 hours... for each input molecule, we set the maximum number of iterations to 1000 and sample 50 molecules at each iteration... we run the optimization for 80 iterations and sample 100 molecules at each iteration...