Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Bidirectional Representations Augmented Autoregressive Biological Sequence Generation
Authors: Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Zhi Jin, ZhiQiang Gao, Nanqing Dong, Siqi Sun
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on a demanding 9-species benchmark of de novo peptide sequencing task show our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation. |
| Researcher Affiliation | Academia | 1 Fudan University 2 University of British Columbia 3 Shanghai Artificial Intelligence Laboratory 4 Zhejiang University EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 CROSSNOVO: AT Fine-tuning with Cross-Decoder NAT Knowledge Transfer |
| Open Source Code | Yes | Our code is available in Git Hub. |
| Open Datasets | Yes | Dataset. Following prior work [41, 46] for fair comparison, we trained CROSSNOVO on the Mass IVE-KB dataset [33], which contains 30 million high-resolution peptide-spectrum matches (PSMs) from diverse instruments. For validation and benchmarking against leading methods [40, 41, 46, 50], we used the 9-species-v1 (approx. 1.5M spectra from nine experiments) and the 9-species-v2 revised datasets. The latter offers more and higher-quality spectra with broader data distribution and stricter annotation than its predecessor. ... We utilize a publicly available human antibody dataset [30] |
| Dataset Splits | Yes | Dataset. Following prior work [41, 46] for fair comparison, we trained CROSSNOVO on the Mass IVE-KB dataset [33], which contains 30 million high-resolution peptide-spectrum matches (PSMs) from diverse instruments. For validation and benchmarking against leading methods [40, 41, 46, 50], we used the 9-species-v1 (approx. 1.5M spectra from nine experiments) and the 9-species-v2 revised datasets. |
| Hardware Specification | Yes | We trained CROSSNOVO on eight NVIDIA A100 80GB GPUs using the Adam W optimizer [18] with an initial learning rate of 5 × 10−4, a linear warm-up phase, and a subsequent cosine decay schedule for training stability. |
| Software Dependencies | No | The paper mentions the "Adam W optimizer [18]" but does not specify any software libraries or frameworks (e.g., PyTorch, TensorFlow) along with their version numbers. |
| Experiment Setup | Yes | All inputs (peaks and amino acids) were embedded into 400 dimensions. The shared spectrum encoder, NAT decoder, and AT decoder of CROSSNOVO each comprise 9 Transformer layers with 8 attention heads and 1024 hidden dimensions. We trained CROSSNOVO on eight NVIDIA A100 80GB GPUs using the Adam W optimizer [18] with an initial learning rate of 5 × 10−4, a linear warm-up phase, and a subsequent cosine decay schedule for training stability. |