Generating Highly Designable Proteins with Geometric Algebra Flow Matching

Authors: Simon Wagner, Leif Seute, Vsevolod Viliuga, Nicolas Wolf, Frauke Gräter, Jan Stühmer

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our architecture by incorporating it into the framework of Frame Flow, a state-of-the-art flow matching model for protein backbone generation. The proposed model achieves high designability, diversity and novelty, while also sampling protein backbones that follow the statistical distribution of secondary structure elements found in naturally occurring proteins, a property so far only insufficiently achieved by many state-of-the-art generative models.
Researcher Affiliation Academia 1Heidelberg Institute for Theoretical Studies, Heidelberg, Germany 2IWR, Heidelberg University, Heidelberg, Germany 3Max Planck Institute for Polymer Research, Mainz, Germany 4Sci Life Lab and DBB at Stockholm University, Stockholm, Sweden 5IAR, Karlsruhe Institute of Technology, Karlsruhe, Germany
Pseudocode Yes Algorithm 1 Invariant point attention (IPA) [32] ... Algorithm 2 Original backbone update [32] ... Algorithm 3 Clifford frame attention ... Algorithm 4 Geometric Bilinear [13] ... Algorithm 5 Geometric Many Body Product ... Algorithm 6 Backbone update
Open Source Code Yes Source code and trained model weights are available at https://github.com/hits-mli/gafl
Open Datasets Yes We train GAFL2 on a subset of the Protein Data Bank (PDB) dataset [9] comprised of monomeric protein structures with up to 512 residues and perform extensive ablations on the smaller, curated SCOPe dataset [27, 18] filtered by proteins with length of up to 128 residues (SCOPe-128) as in [67, 37].
Dataset Splits No The paper mentions training on PDB and SCOPe datasets but does not explicitly provide percentages or counts for training, validation, and test splits within the main text. The NeurIPS checklist mentions 'standard data set split' but this is not sufficiently explicit within the paper's main content.
Hardware Specification Yes We train GAFL for 15 days on two NVIDIA A100-80GB GPUs on the dataset used in Frame Diff... All models are trained with three different random seeds for 6500 epochs on one NVIDIA A100-80GB GPU, which takes around 6 days...
Software Dependencies No Our implementation is based on the implementation of Frame Flow [67, 69]4. We will publish our code together with the camera ready version of this manuscript. The implementation is in the Python [59] programming language and uses the Py Torch framework [45] and further dependencies of Frame Flow: Numpy [30], Hydra [66], and Sci Py [61].
Experiment Setup Yes The learning rate is increased in 50 warmup steps to 0.0002 and then kept constant for 3500 epochs. From there we use a cosine-annealing schedule to decrease the learning rate to 0.0001 at epoch 5000. From epoch Ntrain = 5000 to Ntrain + Nselect = 5200 we employ our checkpointing criterion described in Section 4.1 evaluating secondary structures and storing checkpoints every second epoch. We keep the k = 30 best checkpoints and filter them for checkpoints with a secondary structure content deviation of less than dmax < 0.2.