Evaluating Representation Learning on the Protein Structure Universe

Authors: Arian Rokkum Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, Tom Leon Blundell

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Protein Workshop, a comprehensive benchmark suite for representation learning on protein structures with Geometric Graph Neural Networks. We consider large-scale pre-training and downstream tasks on both experimental and predicted structures to enable the systematic evaluation of the quality of the learned structural representation and their usefulness in capturing functional relationships for downstream tasks. We find that: (1) large-scale pretraining on Alpha Fold structures and auxiliary tasks consistently improve the performance of both rotation-invariant and equivariant GNNs, and (2) more expressive equivariant GNNs benefit from pretraining to a greater extent compared to invariant models. We aim to establish a common ground for the machine learning and computational biology communities to rigorously compare and advance protein structure representation learning.
Researcher Affiliation Academia 1University of Cambridge, 2University of Missouri, 3Mila Québec AI Institute
Pseudocode No The paper describes methods and processes in narrative text and tables but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Protein Workshop is available at: github.com/a-r-j/Protein Workshop. Our open-source codebase reduces the barrier to entry for working with large protein structure datasets by providing: (1) storage-efficient dataloaders for large-scale structural databases including Alpha Fold DB and ESM Atlas, as well as (2) utilities for constructing new tasks from the entire PDB.
Open Datasets Yes We provide the dataset derived from CATH 4.2 40% (Knudsen & Wiuf, 2010) non-redundant chains developed by Ingraham et al. (2019) as an additional, smaller, pretraining dataset. A preprocessed version of the dataset can be downloaded from the benchmark s Zenodo data record.
Dataset Splits Yes Table 1: Overview of supervised tasks and datasets. Task: Inverse Folding, Dataset Origin: Ingraham et al. (2019) Experimental. # Train: 105 K, # Validation: 180 K, # Test: 180 K.
Hardware Specification Yes All models are trained on 80GB NVIDIA A100 GPUs. All baseline and finetuning results are performed using a single GPU while pre-training is performed using four GPUs.
Software Dependencies No The benchmark is developed using Py Torch (Paszke et al., 2019), Py Torch Geometric (Fey & Lenssen, 2019), Py Torch Lightning (Falcon, 2019), and Graphein (Jamasb et al., 2022). Experiment configuration is performed using Hydra (Yadan, 2019). Certain architectures introduce additional dependencies, such as Torch Drug (Zhu et al., 2022) and e3nn (Geiger & Smidt, 2022). The paper lists software dependencies but does not specify their version numbers.
Experiment Setup Yes Training. As we are interested in benchmarking large-scale datasets and models, we try to consistently use six layers for all models, each with 512 hidden channels. For equivariant GNNs, we reduced the number of layers and hidden channels to fit 80GB of GPU memory on one NVIDIA A100 GPU. For downstream tasks, we set the maximum number of epochs to 150 and use the Adam optimizer with a batch size of 32 and Reduce LROn Plateau learning rate scheduler, monitoring the validation metric with patience of 5 epochs and reduction of 0.6. See Appendix D.10 for details on hyperparameter tuning for optimal learning rates and dropouts for each architecture. We train models to convergence, monitoring the validation metric and performing early stopping with a patience of 10 epochs. Pretraining is performed for 10 epochs using a linear warm-up with cosine schedule. We report standard deviations over three runs across three random seeds.