Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MSA Generation with Seqs2Seqs Pretraining: Advancing Protein Structure Predictions
Authors: LE ZHANG, Jiayang Chen, Tao Shen, Yu Li, Siqi Sun
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on CASP14 and CASP15 benchmarks reveal significant improvements in LDDT scores, particularly for complex and challenging sequences, enhancing the performance of both Alpha Fold2 and Rose TTAFold. |
| Researcher Affiliation | Collaboration | Le Zhang1,3 , Jiayang Chen4 , Tao Shen5, Yu Li4 , Siqi Sun1,2 1 Fudan University 2 Shanghai Artificial Intelligence Laboratory 3 Mila, Universit e de Montr eal 4 The Chinese University of Hong Kong 5 Zelixir Biotech |
| Pseudocode | No | The paper describes the architecture and process but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is released at https://github.com/lezhang7/MSAGen. |
| Open Datasets | Yes | We employ CASP14/15 as our test set, a prestigious dataset that encompasses proteins from a broad spectrum of biological families. ... This process was iterated until no additional sequences emerged, searching parameters are detailed in appendix C. For every batch of sequences retrieved, a random selection was made, designating query with some as the source X and the remainder as the target Y , as illustrated in fig. 2. Notably, the assurance of co-evolutionary relationships is intrinsically facilitated by the search algorithm s mechanism. |
| Dataset Splits | No | The paper mentions CASP14/15 as a test set and a pretraining dataset, but does not explicitly detail train/validation/test splits for the pretraining dataset or validation splits for the evaluation datasets. |
| Hardware Specification | Yes | It s pretrained with ADAM-W at a 5e 5 rate, 0.01 linear warm-up, and square root decay for 200k steps on 8 A100 GPUs, batch size of 64, using a dataset containing 2M MSAs constructed as described in section 3.1. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as deep learning frameworks or libraries in the main text or appendices relevant to the research content. |
| Experiment Setup | Yes | Pretrained MSA-Generator adopts 12 transformer encoders/decoders with 260M parameters, 768 embedding size, and 12 heads. It s pretrained with ADAM-W at a 5e 5 rate, 0.01 linear warm-up, and square root decay for 200k steps on 8 A100 GPUs, batch size of 64, using a dataset containing 2M MSAs constructed as described in section 3.1. |