MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training
Authors: Bo Chen, Zhilei Bei, Xingyi Cheng, Pan Li, Jie Tang, Le Song
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments confirm the efficacy of MSAGPT in generating faithful virtual MSA to enhance the structure prediction accuracy (up to +8.5% TM-Score on few-shot scenarios). |
| Researcher Affiliation | Collaboration | 1Tsinghua University 2Bio Map Research 3MBZUAI |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The model is available at https://github.com/THUDM/MSAGPT. |
| Open Datasets | Yes | We utilize the Uniclust30 MSA dataset from Open Protein Set [44], which is processed through an all-against-all search on Uniclust30 [45] using HHblits [46]. This results in approximately 16 million MSAs (See Appendix A.1 for Details). |
| Dataset Splits | Yes | For each task, we sample 1000 protein sequences with the corresponding labels. Then we use MSAGPT-DPO to generate 32 virtual MSAs with the generation strategy T=0.8 and P=0.8. Both setups are trained briefly (for one epoch) for 5-fold cross-validation as shown in Table 9, and we report the average performance. |
| Hardware Specification | Yes | All models are trained on 24 A800 GPUs for 254k updates, consuming about 150 billion tokens. |
| Software Dependencies | No | The paper mentions software components like 'Flash Attention-v1 [42]' and 'Adam W [50]', and implies Python/PyTorch/CUDA usage through GPU training, but does not provide specific version numbers for the core software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Regarding the backbone of MSAGPT, we employ the standard transformer decoder framework [47, 49] and train the model with 2.8 billion parameters owning 36 layers, 2560 embedding size, and 40 attention heads. We employ batches of 48 MSAs with each MSA containing 12,288 residues. We follow BF16 mixed-precision pre-training strategy. We use Adam W [50] as our optimizer with β1 = 0.9, β2 = 0.95, eps = 10 8 and a learning rate of 1.2 10 4. We use a cosine learning rate schedule, with a warmup of the first 2.5% steps, and decay the final learning rate down to 10% of the peak learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0 without dropout. |