MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training

Authors: Bo Chen, Zhilei Bei, Xingyi Cheng, Pan Li, Jie Tang, Le Song

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments confirm the efficacy of MSAGPT in generating faithful virtual MSA to enhance the structure prediction accuracy (up to +8.5% TM-Score on few-shot scenarios).
Researcher Affiliation Collaboration 1Tsinghua University 2Bio Map Research 3MBZUAI
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The model is available at https://github.com/THUDM/MSAGPT.
Open Datasets Yes We utilize the Uniclust30 MSA dataset from Open Protein Set [44], which is processed through an all-against-all search on Uniclust30 [45] using HHblits [46]. This results in approximately 16 million MSAs (See Appendix A.1 for Details).
Dataset Splits Yes For each task, we sample 1000 protein sequences with the corresponding labels. Then we use MSAGPT-DPO to generate 32 virtual MSAs with the generation strategy T=0.8 and P=0.8. Both setups are trained briefly (for one epoch) for 5-fold cross-validation as shown in Table 9, and we report the average performance.
Hardware Specification Yes All models are trained on 24 A800 GPUs for 254k updates, consuming about 150 billion tokens.
Software Dependencies No The paper mentions software components like 'Flash Attention-v1 [42]' and 'Adam W [50]', and implies Python/PyTorch/CUDA usage through GPU training, but does not provide specific version numbers for the core software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Regarding the backbone of MSAGPT, we employ the standard transformer decoder framework [47, 49] and train the model with 2.8 billion parameters owning 36 layers, 2560 embedding size, and 40 attention heads. We employ batches of 48 MSAs with each MSA containing 12,288 residues. We follow BF16 mixed-precision pre-training strategy. We use Adam W [50] as our optimizer with β1 = 0.9, β2 = 0.95, eps = 10 8 and a learning rate of 1.2 10 4. We use a cosine learning rate schedule, with a warmup of the first 2.5% steps, and decay the final learning rate down to 10% of the peak learning rate. We use a weight decay of 0.1 and gradient clipping of 1.0 without dropout.