Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Equi-mRNA: Protein Translation Equivariant Encoding for mRNA Language Models
Authors: Mehdi Yazdani-Jahromi, Ali Khodabandeh Yalabadi, Ozlem Ozmen Garibay
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On downstream property-prediction tasks including expression, stability, and riboswitch switching Equi-m RNA delivers up to 10% improvements in accuracy. In sequence generation, it produces m RNA constructs that are up to 4 more realistic under Frรฉchet Bio Distance metrics and 28% better preserve functional properties compared to vanilla baseline. |
| Researcher Affiliation | Academia | Mehdi Yazdani-Jahromi Department of Computer Science University of Central Florida Orlando, FL 32816 EMAIL Ali Khodabandeh Yalabadi Department of Industrial Engineering University of Central Florida Orlando, FL 32816 EMAIL Ozlem Ozmen Garibay Department of Computer Science and Industrial Engineering University of Central Florida Orlando, FL 32816 EMAIL |
| Pseudocode | No | The paper provides detailed mathematical formulations and descriptions of the methodology, but does not include any explicitly labeled pseudocode or algorithm blocks with structured, code-like steps. |
| Open Source Code | Yes | The data are publicly available. We can provide the link to code at any time. For the review version, we did not include the link to keep it as an anonymous review. The supplementary Zip file including code has been attached. |
| Open Datasets | Yes | We curate and release a unified coding-region corpus of 25M protein-coding sequences plus a stratified 1M sequence subset to standardize benchmarking across studies. Pretraining Corpus We constructed a large-scale pretraining corpus by drawing 25 million annotated protein-coding sequences from 56 million Ref Seq entries... |
| Dataset Splits | Yes | All datasets are split consistently into training, validation, and test subsets at ratios of 70%, 15%, and 15%, respectively, for model training and evaluation. |
| Hardware Specification | Yes | Pretraining was conducted on thirty-two NVIDIA H100 GPUs (ablation used eight NVIDIA H200 GPUs); runtimes and resource utilization are detailed in Appendix A.11. |
| Software Dependencies | No | The paper mentions using a 'GPT2 Transformer backbone' and 'hybrid Mamba Transformer backbone', and refers to the 'geoopt library' for implementing Stiefel manifold optimization, but does not provide specific version numbers for any of these software components or other libraries. |
| Experiment Setup | Yes | All pretraining arguments and hyperparameters, as well as downstream generation and property-prediction hyperparameters, are provided in Appendix A.10. Table 4: Pretraining hyperparameters (identical for all 12 variants). Table 5: Pretraining hyperparameters for the 25 M-sequence corpus (identical for both GPT-2 and Mamba hybrid architecture). |