Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bidirectional Representations Augmented Autoregressive Biological Sequence Generation

Authors: Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Zhi Jin, ZhiQiang Gao, Nanqing Dong, Siqi Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluations on a demanding 9-species benchmark of de novo peptide sequencing task show our model substantially surpasses AR and NAR baselines. It uniquely harmonizes AR stability with NAR contextual awareness, delivering robust, superior performance on diverse downstream data. This research advances biological sequence modeling techniques and contributes a novel architectural paradigm for augmenting AR models with enhanced bidirectional understanding for complex sequence generation.
Researcher Affiliation Academia 1 Fudan University 2 University of British Columbia 3 Shanghai Artificial Intelligence Laboratory 4 Zhejiang University EMAIL, EMAIL
Pseudocode Yes Algorithm 1 CROSSNOVO: AT Fine-tuning with Cross-Decoder NAT Knowledge Transfer
Open Source Code Yes Our code is available in Git Hub.
Open Datasets Yes Dataset. Following prior work [41, 46] for fair comparison, we trained CROSSNOVO on the Mass IVE-KB dataset [33], which contains 30 million high-resolution peptide-spectrum matches (PSMs) from diverse instruments. For validation and benchmarking against leading methods [40, 41, 46, 50], we used the 9-species-v1 (approx. 1.5M spectra from nine experiments) and the 9-species-v2 revised datasets. The latter offers more and higher-quality spectra with broader data distribution and stricter annotation than its predecessor. ... We utilize a publicly available human antibody dataset [30]
Dataset Splits Yes Dataset. Following prior work [41, 46] for fair comparison, we trained CROSSNOVO on the Mass IVE-KB dataset [33], which contains 30 million high-resolution peptide-spectrum matches (PSMs) from diverse instruments. For validation and benchmarking against leading methods [40, 41, 46, 50], we used the 9-species-v1 (approx. 1.5M spectra from nine experiments) and the 9-species-v2 revised datasets.
Hardware Specification Yes We trained CROSSNOVO on eight NVIDIA A100 80GB GPUs using the Adam W optimizer [18] with an initial learning rate of 5 × 10−4, a linear warm-up phase, and a subsequent cosine decay schedule for training stability.
Software Dependencies No The paper mentions the "Adam W optimizer [18]" but does not specify any software libraries or frameworks (e.g., PyTorch, TensorFlow) along with their version numbers.
Experiment Setup Yes All inputs (peaks and amino acids) were embedded into 400 dimensions. The shared spectrum encoder, NAT decoder, and AT decoder of CROSSNOVO each comprise 9 Transformer layers with 8 attention heads and 1024 hidden dimensions. We trained CROSSNOVO on eight NVIDIA A100 80GB GPUs using the Adam W optimizer [18] with an initial learning rate of 5 × 10−4, a linear warm-up phase, and a subsequent cosine decay schedule for training stability.