ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers

Authors: Pascal Notin, Ruben Weitzman, Debora Marks, Yarin Gal

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust performance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that Protein NPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments.
Researcher Affiliation Academia Pascal Notin Computer Science, University of Oxford Ruben Weitzman Computer Science, University of Oxford Debora S. Marks Harvard Medical School Broad Institute Yarin Gal Computer Science, University of Oxford
Pseudocode Yes Algorithm 1 Iterative protein redesign 1: Input: Initial labeled data DL; Initial unlabeled data DU; Batch size B; Batch set S = ; Acquisition function α(x; λ) 2: for t 1, 2, . . . , 10 do 3: Train model model on DL 4: for t 1, 2, . . . , B do 5: xnew = arg maxx DU α(x; λ) 6: Obtain label ynew for xnew 7: S S {(xnew, ynew)} 8: DU DU\{xnew} 9: end for 10: DL DL S, S 11: end for 12: Output: DL, Trained model on DL
Open Source Code Yes The entire codebase to train and evaluate Protein NPT and the various baselines considered in this work is available on our Git Hub repository (https://github.com/OATML-Markslab/Protein NPT) under the MIT license.
Open Datasets Yes We conducted our experiments on Protein Gym [Notin et al., 2022a], which contains an extensive set of Deep Mutational Scanning (DMS) assays covering a wide range of functional properties (e.g., thermostability, ligand binding, viral replication, drug resistance).
Dataset Splits Yes We develop 3 distinct cross-validation schemes to assess the ability of each model to extrapolate to positions not encountered during training. In the Random scheme, commonly-used in other supervised fitness prediction benchmarks [Rao et al., 2019, Dallago et al., 2022], each mutation is randomly allocated to one of five distinct folds. In the Contiguous scheme, the sequence is split into five contiguous segments along its length, with mutations assigned to each segment based on the position they occur in the sequence. Lastly, the Modulo scheme uses the modulo operator to assign mutated positions to each fold. For example, position 1 is assigned to fold 1, position 2 to fold 2, and so on, looping back to fold 1 at position 6. For each assay in Protein Gym and cross-validation scheme introduced in 4.1, we perform a 5-fold cross-validation, selecting the first 4 folds for training, and using the remaining one as test set.
Hardware Specification Yes All experiments carried out in this work were conducted in Pytorch, on A100 GPUs with either 40GB or 80GB of GPU memory.
Software Dependencies No The paper mentions 'Pytorch' as the framework used, but does not provide specific version numbers for Pytorch or any other key software dependencies.
Experiment Setup Yes Table 4: Protein NPT Architecture details: Nb. Protein NPT layers 5, Embedding dimension (d) 200, Feedforward embedding dimension 400, Nb. attention heads 4, CNN kernel size (post embedding) 7, Weight decay 5.10 3, Dropout 0.0. Table 5: Protein NPT Training details: Training steps 10k, Learning rate warmup steps 100, Peak learning rate 3.10 4, Optimizer Adam W, Gradient clipping norm 1.0, Learning rate schedule Cosine, Training batch (masked) 64, Training batch (unmasked) 361.