Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers

Authors: Hadi Abdine, Michail Chatzianastasis, Costas Bouyioukos, Michalis Vazirgiannis

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our model, we extracted a multimodal protein dataset from Swiss Prot, and demonstrate empirically the effectiveness of Prot2Text. These results highlight the transformative impact of multimodal models, specifically the fusion of GNNs and LLMs, empowering researchers with powerful tools for more accurate function prediction of existing as well as first-to-see proteins.
Researcher Affiliation Academia Hadi Abdine1, Michail Chatzianastasis1, Costas Bouyioukos2, 3, Michalis Vazirgiannis1 1Laboratoire d Informatique (LIX), Ecole Polytechnique, Institut Polytechnique de Paris, Palaiseau, France 2Epigenetics and Cell Fate, CNRS UMR7216, Universit e Paris Cit e, F-75013 Paris, France. 3Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus
Pseudocode No The paper does not contain a clearly labeled pseudocode block or algorithm.
Open Source Code Yes Code, data and models are publicly available1. 1https://github.com/hadi-abdine/Prot2Text
Open Datasets Yes To train the Prot2Text framework using proteins structures, sequences and textual descriptions, we build a multimodal dataset with 256, 690 proteins. For each protein, we have three crucial information: the corresponding sequence, the Alpha Fold accession ID and the textual description. To build this dataset, we used the Swiss Prot database (Bairoch and Apweiler 1996), the only curated proteins knowledge base with full proteins textual description included in the Uni Prot KB (Consortium 2016) Release 2022 04. ... We further release this curated dataset to the public, allowing other researchers to use it for benchmarking and further advancements in the field.
Dataset Splits Yes (5) Apply the CD-HIT clustering algorithm (Li and Godzik 2006) to create a train/validation/test scheme with 248, 315, 4, 172 and 4, 203 proteins respectively. The maximum similarity threshold between the (train, validation test) sets used in the CD-HIT algorithm is 40%.
Hardware Specification Yes We implemented all the models using Py Torch and utilized 64 NVIDIA V100 GPUs for training. ... The inference time is computed during text generation using two NVIDIA RTX 6000 with 48GB memory in parallel and batch size of four per device.
Software Dependencies No We implemented all the models using Py Torch and utilized 64 NVIDIA V100 GPUs for training. ... All experiments were carried out using the Hugging Face transformers library (Wolf et al. 2020). The paper mentions software like PyTorch and Hugging Face transformers library but does not specify their version numbers.
Experiment Setup Yes We used the Adam W optimizer (Loshchilov and Hutter 2019) with ϵ = 10 6, β1 = 0.9, β2 = 0.999, with a learning rate starting from 2.10 4 and decreasing to zero using a cosine scheduler. We used a warm-up of 6% of the total training steps. We fixed the batch size to four per GPU and we trained the models for 25 epochs. For the GNN encoder, we used 6 layers with a hidden size equal to GPT2 s hidden size (768 for the base model of GPT-2) in each layer.