Knowledge Enhanced Representation Learning for Drug Discovery
Authors: Thanh Lam Hoang, Marco Luca Sbodio, Marcos Martinez Galindo, Mykhaylo Zayats, Raul Fernandez-Diaz, Victor Valls, Gabriele Picco, Cesar Berrospi, Vanessa Lopez
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our study reveals that enhanced representations, derived from multimodal knowledge graphs describing relations among molecules and proteins, lead to state-of-the-art results in well-established benchmarks (first place in the leaderboard for Therapeutics Data Commons benchmark Drug-Target Interaction Domain Generalization Benchmark , with an improvement of 8 points with respect to previous best result). Moreover, our results significantly surpass those achieved in standard benchmarks by using conventional pre-trained representations that rely only on sequence or SMILES data. We release our multimodal knowledge graphs, integrating data from seven public data sources, and which contain over 30 million triples. Pretrained models from our proposed graphs and benchmark task source code are also released. |
| Researcher Affiliation | Collaboration | 1 IBM Research, Dublin research lab, Dublin, Ireland 2 IBM Research, Zurich research lab, Zurich, Switzerland 3 University College Dublin, Ireland |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. Methods are described in prose. |
| Open Source Code | Yes | We release our multimodal knowledge graphs, integrating data from seven public data sources, and which contain over 30 million triples. Pretrained models from our proposed graphs and benchmark task source code are also released. 1https://github.com/IBM/otter-knowledge |
| Open Datasets | Yes | Uniprot (Consortium 2022) comprises 573,227 Swiss Prot proteins (from curated Uni Prot subset). The UBC KG combines all Uni Prot proteins with diverse attributes, including sequence (567,483 entries), full name, organism, protein family, function, catalytic activity, pathways and length. The KG also features 38,665 target of edges linking Uni Prot IDs to Ch EMBL and Drugbank IDs, along with 196,133 interactants connecting Uni Prot protein IDs. |
| Dataset Splits | Yes | Downstream benchmarks To evaluate the performance of the proposed approach on drug-target binding affinity prediction task we use three datasets: DTI DG, DAVIS and KIBA, which are available from the TDC (Huang et al. 2022) benchmark. The DTI DG dataset features a leaderboard with the state-of-the-art metrics reported for different methods. The dataset s temporal split, based on patent application dates, making this dataset suitable for evaluating method generalization. In contrast, the DAVIS and KIBA datasets employ random splits, including two additional splits based on target or drug. These latter splits assess learning methods with new drugs/proteins. |
| Hardware Specification | No | The paper mentions "scalable parallel and GPU-based computation" and "GPU utilization" but does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper mentions software components like "RDKit", "Mol Former", "esm1b t33 650M UR50S model", "sentencetransformers/paraphrase-albert-small-v2", and "Metis", but it does not specify any version numbers for these software dependencies. |
| Experiment Setup | Yes | GNN hyperparameters values are detailed in the Appendix. |