Generating Highly Designable Proteins with Geometric Algebra Flow Matching
Authors: Simon Wagner, Leif Seute, Vsevolod Viliuga, Nicolas Wolf, Frauke Gräter, Jan Stühmer
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our architecture by incorporating it into the framework of Frame Flow, a state-of-the-art flow matching model for protein backbone generation. The proposed model achieves high designability, diversity and novelty, while also sampling protein backbones that follow the statistical distribution of secondary structure elements found in naturally occurring proteins, a property so far only insufficiently achieved by many state-of-the-art generative models. |
| Researcher Affiliation | Academia | 1Heidelberg Institute for Theoretical Studies, Heidelberg, Germany 2IWR, Heidelberg University, Heidelberg, Germany 3Max Planck Institute for Polymer Research, Mainz, Germany 4Sci Life Lab and DBB at Stockholm University, Stockholm, Sweden 5IAR, Karlsruhe Institute of Technology, Karlsruhe, Germany |
| Pseudocode | Yes | Algorithm 1 Invariant point attention (IPA) [32] ... Algorithm 2 Original backbone update [32] ... Algorithm 3 Clifford frame attention ... Algorithm 4 Geometric Bilinear [13] ... Algorithm 5 Geometric Many Body Product ... Algorithm 6 Backbone update |
| Open Source Code | Yes | Source code and trained model weights are available at https://github.com/hits-mli/gafl |
| Open Datasets | Yes | We train GAFL2 on a subset of the Protein Data Bank (PDB) dataset [9] comprised of monomeric protein structures with up to 512 residues and perform extensive ablations on the smaller, curated SCOPe dataset [27, 18] filtered by proteins with length of up to 128 residues (SCOPe-128) as in [67, 37]. |
| Dataset Splits | No | The paper mentions training on PDB and SCOPe datasets but does not explicitly provide percentages or counts for training, validation, and test splits within the main text. The NeurIPS checklist mentions 'standard data set split' but this is not sufficiently explicit within the paper's main content. |
| Hardware Specification | Yes | We train GAFL for 15 days on two NVIDIA A100-80GB GPUs on the dataset used in Frame Diff... All models are trained with three different random seeds for 6500 epochs on one NVIDIA A100-80GB GPU, which takes around 6 days... |
| Software Dependencies | No | Our implementation is based on the implementation of Frame Flow [67, 69]4. We will publish our code together with the camera ready version of this manuscript. The implementation is in the Python [59] programming language and uses the Py Torch framework [45] and further dependencies of Frame Flow: Numpy [30], Hydra [66], and Sci Py [61]. |
| Experiment Setup | Yes | The learning rate is increased in 50 warmup steps to 0.0002 and then kept constant for 3500 epochs. From there we use a cosine-annealing schedule to decrease the learning rate to 0.0001 at epoch 5000. From epoch Ntrain = 5000 to Ntrain + Nselect = 5200 we employ our checkpointing criterion described in Section 4.1 evaluating secondary structures and storing checkpoints every second epoch. We keep the k = 30 best checkpoints and filter them for checkpoints with a secondary structure content deviation of less than dmax < 0.2. |