Multi-Scale Representation Learning for Protein Fitness Prediction
Authors: Zuobai Zhang, Pascal Notin, Yining Huang, Aurelie C. Lozano, Vijil Chenthamarakshan, Debora Marks, Payel Das, Jian Tang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our methods are rigorously evaluated using the comprehensive Protein Gym benchmark [31], which includes 217 substitution deep mutational scanning assays and over 2.4 million mutated sequences across more than 200 diverse protein families. Our experimental results show that S2F achieves competitive results with prior methods, while S3F reaches state-of-the-art performance after incorporating surface features. |
| Researcher Affiliation | Collaboration | Zuobai Zhang1,2,* Pascal Notin3,* Yining Huang3 Aurélie Lozano5 Vijil Chenthamarakshan5 Debora Marks3,4 Payel Das5, Jian Tang1,6,7, *equal contribution corresponding author 1Mila Québec AI Institute, 2Université de Montréal, 3Harvard Medical School, 4Broad Institute, 5IBM Research, 6HEC Montréal, 7CIFAR AI Chair |
| Pseudocode | No | The paper describes computational steps (e.g., Geometric Message Passing in Section 3.3 and Surface Message Passing in Section 3.4) but does not include a formally labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | Our code is at https://github.com/Deep Graph Learning/S3F. |
| Open Datasets | Yes | Our models are pre-trained on a non-redundant subset of CATH v4.3.0 dataset (CC BY 4.0 license) [30], which contains 30,948 experimental structures with less than 40% sequence identity. |
| Dataset Splits | No | The paper states that models are pre-trained on the CATH dataset and evaluated on the Protein Gym benchmark, but does not explicitly provide details about train/validation/test splits for the CATH dataset during its own model training. |
| Hardware Specification | Yes | S2F and S3F are trained with batch sizes of 128 and 8, respectively, for 100 epochs on four A100 GPUs. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | Following the pre-training methodology in [58], we select 15% of residues at random for prediction. If the i-th residue is selected, we manipulate the i-th token by replacing it with: (1) the [MASK] token 80% of the time, (2) a random residue type 10% of the time, and (3) leaving the i-th token unchanged 10% of the time. Our models are pre-trained on a non-redundant subset of CATH v4.3.0 dataset (CC BY 4.0 license) [30], which contains 30,948 experimental structures with less than 40% sequence identity. S2F and S3F are trained with batch sizes of 128 and 8, respectively, for 100 epochs on four A100 GPUs. During pre-training, the weights of the ESM2-650M model are frozen, and only the GVP layers for structure and surface graphs are trainable. |