Multi-Scale Representation Learning for Protein Fitness Prediction

Authors: Zuobai Zhang, Pascal Notin, Yining Huang, Aurelie C. Lozano, Vijil Chenthamarakshan, Debora Marks, Payel Das, Jian Tang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our methods are rigorously evaluated using the comprehensive Protein Gym benchmark [31], which includes 217 substitution deep mutational scanning assays and over 2.4 million mutated sequences across more than 200 diverse protein families. Our experimental results show that S2F achieves competitive results with prior methods, while S3F reaches state-of-the-art performance after incorporating surface features.
Researcher Affiliation Collaboration Zuobai Zhang1,2,* Pascal Notin3,* Yining Huang3 Aurélie Lozano5 Vijil Chenthamarakshan5 Debora Marks3,4 Payel Das5, Jian Tang1,6,7, *equal contribution corresponding author 1Mila Québec AI Institute, 2Université de Montréal, 3Harvard Medical School, 4Broad Institute, 5IBM Research, 6HEC Montréal, 7CIFAR AI Chair
Pseudocode No The paper describes computational steps (e.g., Geometric Message Passing in Section 3.3 and Surface Message Passing in Section 3.4) but does not include a formally labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes Our code is at https://github.com/Deep Graph Learning/S3F.
Open Datasets Yes Our models are pre-trained on a non-redundant subset of CATH v4.3.0 dataset (CC BY 4.0 license) [30], which contains 30,948 experimental structures with less than 40% sequence identity.
Dataset Splits No The paper states that models are pre-trained on the CATH dataset and evaluated on the Protein Gym benchmark, but does not explicitly provide details about train/validation/test splits for the CATH dataset during its own model training.
Hardware Specification Yes S2F and S3F are trained with batch sizes of 128 and 8, respectively, for 100 epochs on four A100 GPUs.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks used in the experiments.
Experiment Setup Yes Following the pre-training methodology in [58], we select 15% of residues at random for prediction. If the i-th residue is selected, we manipulate the i-th token by replacing it with: (1) the [MASK] token 80% of the time, (2) a random residue type 10% of the time, and (3) leaving the i-th token unchanged 10% of the time. Our models are pre-trained on a non-redundant subset of CATH v4.3.0 dataset (CC BY 4.0 license) [30], which contains 30,948 experimental structures with less than 40% sequence identity. S2F and S3F are trained with batch sizes of 128 and 8, respectively, for 100 epochs on four A100 GPUs. During pre-training, the weights of the ESM2-650M model are frozen, and only the GVP layers for structure and surface graphs are trainable.