Predicting a Protein's Stability under a Million Mutations
Authors: Jeffrey Ouyang-Zhang, Daniel Diaz, Adam Klivans, Philipp Kraehenbuehl
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We trained on the Mega Scale c DNA proteolysis dataset and achieved state-of-the-art performance on single and higher-order mutations on S669, Pro Therm, and Protein Gym datasets. |
| Researcher Affiliation | Academia | Jeffrey Ouyang-Zhang UT Austin jozhang@utexas.edu Daniel J. Diaz UT Austin danny.diaz@utexas.edu Adam R. Klivans UT Austin klivans@cs.utexas.edu Philipp Krähenbühl UT Austin philkr@cs.utexas.edu |
| Pseudocode | No | The paper describes the method and illustrates it with figures but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/jozhang97/Mutate Everything. |
| Open Datasets | Yes | We trained on the Mega Scale c DNA proteolysis dataset... c DNA proteolysis [72] is a large scale dataset containing mutant proteins with G measurements. |
| Dataset Splits | Yes | We follow the train-val split introduced in [18] and additionally filter out proteins in our training set that are similar to those in our evaluation benchmarks. Specifically, we train on 116 proteins with 213,000 total mutations, of which 97,000 are double mutants and 117,000 are single mutants. We hold out c DNA2, a validation set of 18 mini-proteins with 22,000 total double mutations. |
| Hardware Specification | Yes | Training takes 6 hours on 3 A100 GPUs. |
| Software Dependencies | No | Our primary feature extractor is Alpha Fold, implemented in Open Fold [3, 31]. The multiple sequence alignment (MSA) is computed using Colabfold [43]. Specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | We fine-tune a pre-trained backbone on single mutations for 20 epochs. Then, we finetune the model on both single and double mutations for 100 epochs using a cosine learning rate schedule with 10 warmup epochs. We use a batch size of 3 proteins due to the high memory requirements of Alpha Fold. We use a learning rate of 3e-4 and weight decay of 0.5. |