Generating Novel Leads for Drug Discovery Using LLMs with Logical Feedback
Authors: Shreyas Bhat Brahmavar, Ashwin Srinivasan, Tirtharaj Dash, Sowmya Ramaswamy Krishnan, Lovekesh Vig, Arijit Roy, Raviprasad Aduri
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate LMLF using two well-known targets (inhibition of the Janus Kinase 2; and Dopamine Receptor D2); and two different LLMs (GPT-3 and Pa LM). We show that LMLF, starting with the same logical constraints and query text, can guide both LLMs to generate potential leads. We find: (a) Binding affinities of LMLF-generated molecules are skewed towards higher binding affinities than those from existing baselines; (b) LMLF results in generating molecules that are skewed towards higher binding affinities than without logical feedback; (c) Assessment by a computational chemist suggests that LMLF generated compounds may be novel inhibitors. |
| Researcher Affiliation | Collaboration | 1 Department of Electrical and Electronics Engineering, BITS Pilani, Goa Campus, India 2 Department of Computer Science, BITS Pilani, Goa Campus, India 3 Department of Pediatrics, University of California, San Diego, USA 4 TCS Innovation Labs (Life Sciences Division), India 5 TCS Research, India 6 Department of Biological Sciences, BITS Pilani, Goa Campus, India |
| Pseudocode | Yes | Procedure 1: Incremental sampling from an LLM s conditional distribution using iterative constraint-based labelling and constraint generalisation. Input: L: an LLM; B0: background knowledge, which contains a sample D0 of labelled instances; C0: a logical formula representing constraints; Q: a query; k: an upperbound on the number of iterations; and n: an upper-bound on the number of samples Output: a set of instances 1: j := 1 2: while (j k and Dj 1 = ) do 3: Pj := Assemble Prompt(Bj 1, Cj 1, Q) 4: Ej := Sample(n, L, Pj) 5: Dj := {(e, l) : e Ej and l = Satisfies(e, Bj 1, Cj 1)} 6: Bj := Update Back(Bj 1, Dj) 7: Cj := Generalise Constraint(Bj, Cj 1) 8: j := j + 1 9: end while 10: return Dj |
| Open Source Code | Yes | The code for Py LMLF can be found at: https://github.com/Shreyas-Bhat/LMLF. |
| Open Datasets | Yes | We conduct our evaluations on JAK2, with 4100 molecules provided with labels (3700 active) and DRD2 (4070 molecules with labels, of which 3670 are active). These datasets were collected from Ch EMBL (Gaulton et al. 2012), which are selected based on their IC50 values and docking scores with active JAK2 and DRD2 proteins less than 7.0. |
| Dataset Splits | No | The paper describes the datasets used (JAK2, DRD2) and their collection, but it does not explicitly provide details about how these datasets were split into training, validation, or test sets. |
| Hardware Specification | Yes | All the experiments are conducted using a Linux (Ubuntu) based workstation with 64GB of main memory and 16-core Intel Xeon 3.10GHz processors. |
| Software Dependencies | Yes | All the implementations are in Python3, with API calls to the respective model engines for GPT-3.0 and Pa LM. We use RDKit (version: 2022.9.5) for computing molecular properties and GNINA 1.0 for computing docking scores (binding affinities) of molecules. |
| Experiment Setup | Yes | We make API calls to text-davinci-003 for GPT3.0 and text-bison-001 for Pa LM. For both LLMs, temperature is set to 0.7. The upper-bound on the number of iterations (k in Procedure 1) is 10. In our constraint C, we use a threshold of 7 on binding affinity for the first 5 iterations and 8 for the next 5 iterations. |