Kernel-Based Evaluation of Conditional Biological Sequence Models

Authors: Pierre Glaser, Steffanie Paul, Alissa M Hummer, Charlotte Deane, Debora Susan Marks, Alan Nawzad Amin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We now investigate the behavior and utility of the ACMMD and ACMMD Rel metrics and tests in practice. We start with a synthetic example showing that ACMMD is a natural measure of model distance. We then perform an extended analysis of a state-of-the-art inverse folding model, Protein MPNN. We show that ACMMD can detect small perturbations in the model, and that it can be used to tune its temperature parameter. Finally, we analyze the absolute performance of Protein MPNN.
Researcher Affiliation Academia 1Gatsby Computational Neuroscience Unit, London, UK 2Systems Biology, Harvard Medical School, Boston, USA 3Department of Statistics, University of Oxford, Oxford, UK 4Harvard Medical School, Broad Institute, Boston, USA 5Courant Institute, New York University, New York, USA.
Pseudocode Yes Algorithm 1 ACMMD Conditional Goodness of fit Test; Algorithm 2 Estimating ACMMD Rel
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository.
Open Datasets Yes We leveraged the CATH taxonomy to select a set of diverse (in sequence and structural topologies) protein structures to perform our ACMMD test on. CATH is a taxonomy of protein structures that categorizes proteins according to a hierarchy of structural organization (Sillitoe et al., 2021). We used the S60 redundancy filtered set which includes proteins that are at least 60% different in sequence identity from each other.
Dataset Splits No After training, we are interested in quantifying how accurately Q| approximates P| on average across all values of x after training, using a held-out set of samples {Xi, Yi}N i=1 P(X, Y ). For the experiments, it states: "performed bootstrap sampling to produce dataset sizes ranging from 100 to 1000" and "samples of 5000 proteins across all families in the dataset." No specific percentages or absolute counts for training, validation, or test splits are provided in a reproducible manner.
Hardware Specification No The paper does not specify any hardware details like GPU/CPU models, memory, or cloud instance types used for running the experiments.
Software Dependencies No The paper mentions using "Protein MPNN" and pre-trained neural networks like "Gearnet (Zhang et al., 2023) for structures, and ESM-2 (Lin et al., 2023) for sequences." However, it does not provide specific version numbers for these or any other software components used in the experiments.
Experiment Setup Yes The sampling temperature T of Protein MPNN can also be varied, letting the user control the trade-off between accuracy and diversity of the generated sequences. We performed an estimation of ACMMD 2 for a ground truth temperature T = 0.1 (the default in the Protein MPNN documentation) and δT {0, 0.01, 0.05, 0.1}.