ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers
Authors: Pascal Notin, Ruben Weitzman, Debora Marks, Yarin Gal
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust performance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that Protein NPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments. |
| Researcher Affiliation | Academia | Pascal Notin Computer Science, University of Oxford Ruben Weitzman Computer Science, University of Oxford Debora S. Marks Harvard Medical School Broad Institute Yarin Gal Computer Science, University of Oxford |
| Pseudocode | Yes | Algorithm 1 Iterative protein redesign 1: Input: Initial labeled data DL; Initial unlabeled data DU; Batch size B; Batch set S = ; Acquisition function α(x; λ) 2: for t 1, 2, . . . , 10 do 3: Train model model on DL 4: for t 1, 2, . . . , B do 5: xnew = arg maxx DU α(x; λ) 6: Obtain label ynew for xnew 7: S S {(xnew, ynew)} 8: DU DU\{xnew} 9: end for 10: DL DL S, S 11: end for 12: Output: DL, Trained model on DL |
| Open Source Code | Yes | The entire codebase to train and evaluate Protein NPT and the various baselines considered in this work is available on our Git Hub repository (https://github.com/OATML-Markslab/Protein NPT) under the MIT license. |
| Open Datasets | Yes | We conducted our experiments on Protein Gym [Notin et al., 2022a], which contains an extensive set of Deep Mutational Scanning (DMS) assays covering a wide range of functional properties (e.g., thermostability, ligand binding, viral replication, drug resistance). |
| Dataset Splits | Yes | We develop 3 distinct cross-validation schemes to assess the ability of each model to extrapolate to positions not encountered during training. In the Random scheme, commonly-used in other supervised fitness prediction benchmarks [Rao et al., 2019, Dallago et al., 2022], each mutation is randomly allocated to one of five distinct folds. In the Contiguous scheme, the sequence is split into five contiguous segments along its length, with mutations assigned to each segment based on the position they occur in the sequence. Lastly, the Modulo scheme uses the modulo operator to assign mutated positions to each fold. For example, position 1 is assigned to fold 1, position 2 to fold 2, and so on, looping back to fold 1 at position 6. For each assay in Protein Gym and cross-validation scheme introduced in 4.1, we perform a 5-fold cross-validation, selecting the first 4 folds for training, and using the remaining one as test set. |
| Hardware Specification | Yes | All experiments carried out in this work were conducted in Pytorch, on A100 GPUs with either 40GB or 80GB of GPU memory. |
| Software Dependencies | No | The paper mentions 'Pytorch' as the framework used, but does not provide specific version numbers for Pytorch or any other key software dependencies. |
| Experiment Setup | Yes | Table 4: Protein NPT Architecture details: Nb. Protein NPT layers 5, Embedding dimension (d) 200, Feedforward embedding dimension 400, Nb. attention heads 4, CNN kernel size (post embedding) 7, Weight decay 5.10 3, Dropout 0.0. Table 5: Protein NPT Training details: Training steps 10k, Learning rate warmup steps 100, Peak learning rate 3.10 4, Optimizer Adam W, Gradient clipping norm 1.0, Learning rate schedule Cosine, Training batch (masked) 64, Training batch (unmasked) 361. |