Kermut: Composite kernel regression for protein variant effects
Authors: Peter Mørch Groth, Mads Kerrn, Lars Olsen, Jesper Salomon, Wouter Boomsma
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this model on the comprehensive Protein Gym substitution benchmark and show that it is able to reach state-of-the-art performance in supervised protein variant effect prediction, outperforming recently proposed deep learning methods in this domain; We provide a thorough calibration analysis and show that while Kermut provides wellcalibrated uncertainties overall, the calibratedness of instance-specific uncertainties remains challenging; We demonstrate that our model can be trained and evaluated orders of magnitude faster and with better out-of-the-box calibration than competing methods. |
| Researcher Affiliation | Collaboration | Peter Mørch Groth University of Copenhagen Novonesis; Mads Herbert Kerrn University of Copenhagen; Lars Olsen Novonesis; Jesper Salomon Novonesis; Wouter Boomsma University of Copenhagen |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The codebase is publicly available at https://github.com/petergroth/kermut under the open source MIT License. |
| Open Datasets | Yes | We evaluate Kermut on the 217 substitution DMS assays from the Protein Gym benchmark [7].; All data and evaluation software is accessed via the Protein Gym [7] repository at https://github. com/OATML-Markslab/Protein Gym which is under the MIT License. |
| Dataset Splits | Yes | The overall benchmark results are an aggregate of three different cross-validation schemes: In the random scheme, variants are assigned to one of five folds randomly. In the modulo scheme, every fifth position along the protein backbone are assigned to the same fold, and in the contiguous scheme, the protein is split into five equal-sized segments along its length, each constituting a fold. For all three schemes, models are trained on four combined partitions and tested on the fifth for a total of five runs per assay, per scheme.; The results for each dataset are obtained via five-fold cross validation, corresponding to five separately trained models for each split scheme. |
| Hardware Specification | Yes | All experiments are performed on a Linux-based cluster running Ubuntu 20.04.4 LTS, with a AMD EPYC 7642 48-Core Processor with 192 threads and 1TB RAM. NVIDIA A40s were used for GPU acceleration both for fitting the Gaussian processes and for generating the protein embeddings. |
| Software Dependencies | No | The paper mentions that the kernel is built using the “GPy Torch framework” but does not specify its version number. Other software, like the operating system, is mentioned without detailing relevant dependencies with specific versions. |
| Experiment Setup | Yes | We assume a homoschedastic Gaussian noise model, on which we place a Half Cauchy prior [80] with scale 0.1. We fit the hyperparameters by maximizing the exact marginal likelihood with gradient descent using the Adam W optimizer [81] with learning rate 0.1 for a 150 steps, which proved to be sufficient for convergence for a number of sampled datasets. |