MSA Generation with Seqs2Seqs Pretraining: Advancing Protein Structure Predictions

Authors: LE ZHANG, Jiayang Chen, Tao Shen, Yu Li, Siqi Sun

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on CASP14 and CASP15 benchmarks reveal significant improvements in LDDT scores, particularly for complex and challenging sequences, enhancing the performance of both Alpha Fold2 and Rose TTAFold.
Researcher Affiliation Collaboration Le Zhang1,3 , Jiayang Chen4 , Tao Shen5, Yu Li4 , Siqi Sun1,2 1 Fudan University 2 Shanghai Artificial Intelligence Laboratory 3 Mila, Universit e de Montr eal 4 The Chinese University of Hong Kong 5 Zelixir Biotech
Pseudocode No The paper describes the architecture and process but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is released at https://github.com/lezhang7/MSAGen.
Open Datasets Yes We employ CASP14/15 as our test set, a prestigious dataset that encompasses proteins from a broad spectrum of biological families. ... This process was iterated until no additional sequences emerged, searching parameters are detailed in appendix C. For every batch of sequences retrieved, a random selection was made, designating query with some as the source X and the remainder as the target Y , as illustrated in fig. 2. Notably, the assurance of co-evolutionary relationships is intrinsically facilitated by the search algorithm s mechanism.
Dataset Splits No The paper mentions CASP14/15 as a test set and a pretraining dataset, but does not explicitly detail train/validation/test splits for the pretraining dataset or validation splits for the evaluation datasets.
Hardware Specification Yes It s pretrained with ADAM-W at a 5e 5 rate, 0.01 linear warm-up, and square root decay for 200k steps on 8 A100 GPUs, batch size of 64, using a dataset containing 2M MSAs constructed as described in section 3.1.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as deep learning frameworks or libraries in the main text or appendices relevant to the research content.
Experiment Setup Yes Pretrained MSA-Generator adopts 12 transformer encoders/decoders with 260M parameters, 768 embedding size, and 12 heads. It s pretrained with ADAM-W at a 5e 5 rate, 0.01 linear warm-up, and square root decay for 200k steps on 8 A100 GPUs, batch size of 64, using a dataset containing 2M MSAs constructed as described in section 3.1.