Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Improving Bilinear RNN with Closed-loop Control
Authors: Jiaxi Hu, Yongqi Pan, Jusen Du, Disen Lan, Tang, Qingsong Wen, Yuxuan Liang, Weigao Sun
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pretrain models with 340M/1.3B parameters on large-scale corpus. Comba demonstrates superior performance and computation efficiency in both language and vision modeling. 4 Experiments Setting In this paper, all models are pretrained based on flash-linear-attention [109] repository and utilize NVIDIA A800-80G GPUs. The 340M Comba pretraining requires 8 10 GPU hours, while the 1.3B Comba requires 32 48 GPU hours. We employ the Adam W optimizer [63] with a 3e-4 learning rate, cosine schedule, 0.01 weight decay, and 1.0 gradient clipping. Random seed is 42. Figure 3: Operator speed evaluated on the Triton-Testing-Benchmark [93] (fwd and bwd) in single A800-80G GPU. Table 6: Zero-shot performance of 340M and 1.3B models trained on Slim Pajama [85] datasets. The commonsense Reasoning task is evaluated by lm-evaluation-harness [33] and the recall-intensive task follows prefix-linear-attention [3] with 2K input tokens. |
| Researcher Affiliation | Collaboration | 1The Hong Kong University of Science and Technology (Guangzhou) 2Shanghai AI Laboratory 3Squirrel Ai Learning, USA |
| Pseudocode | Yes | We provide the recurrent Comba in Algorithm 1. 1 def Recurrent_comba (q, k, v, alpha , beta , b, d): 2 B, T, H, D = q.shape 3 q_new = q d * k # Output correction 4 o, S = torch.zeros_like(v), torch.zeros(b, h, d, d) 5 for i in range (T): 6 _q , _k , _alpha , _beta = q_new [:, i], k[:, i], alpha [:, i], beta[:, i] 7 _v_new = _beta [..., None] * (v[:, i] b * (S * _k[... , None ]) .sum(-2)) 8 S = _At [..., None] * S + _k.unsqueeze (-1) * _v_new.unsqueeze (-2) 9 o[:, i] = torch.einsum( bhd ,bhdm ->bhm , _q , S) 10 return o Listing 1: Recurrent Comba-pk in Pytorch-like Pseudo-code for Inference |
| Open Source Code | Yes | https://github.com/fla-org/flash-linear-attention. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide the code in an anonymous link. |
| Open Datasets | Yes | Table 6: Zero-shot performance of 340M and 1.3B models trained on Slim Pajama [85] datasets. [85] Daria Soboleva, Faisal Al-Khateeb, Robert Myers, Jacob R Steeves, Joel Hestness, and Nolan Dey. Slim Pajama: A 627B token cleaned and deduplicated version of Red Pajama. https://www.cerebras.net/blog/ slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. Table 8: Performance on Long Bench [4] tasks with 10K length based on lm-evaluation-harness [33]. [4] Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. Long Bench: A bilingual, multitask benchmark for long context understanding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3119 3137, Bangkok, Thailand, August 2024. Association for Computational Linguistics. Table 9: Performance on the Image Net-1K [27] classification, compared to Vision Mamba [115] (linear), Dei T [94] (quadratic), and Agent Attention [44] (sparse). [27] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248 255. Ieee, 2009. Table 10: Performance on the object tracking datasets such as GOT10k [54] and La SOT [30] |
| Dataset Splits | No | The paper does not explicitly provide specific training/test/validation dataset splits (percentages or counts) in the main text. While it mentions datasets used for pretraining and evaluation (Slim Pajama, Long Bench, ImageNet-1K, GOT10k, La SOT), and states zero-shot performance, it does not detail how these datasets were partitioned for the experiments described. |
| Hardware Specification | Yes | Setting In this paper, all models are pretrained based on flash-linear-attention [109] repository and utilize NVIDIA A800-80G GPUs. Figure 3: Operator speed evaluated on the Triton-Testing-Benchmark [93] (fwd and bwd) in single A800-80G GPU. |
| Software Dependencies | No | The paper mentions using Triton [93], PyTorch [71], and the AdamW optimizer [63], but it does not specify version numbers for any of these software components or libraries, which is required for reproducibility. |
| Experiment Setup | Yes | The 340M Comba pretraining requires 8 10 GPU hours, while the 1.3B Comba requires 32 48 GPU hours. We employ the Adam W optimizer [63] with a 3e-4 learning rate, cosine schedule, 0.01 weight decay, and 1.0 gradient clipping. Random seed is 42. 340M params with 15B training tokens and 0.5M batchsize tokens 1.3B params with 100B training tokens and 1M batchsize tokens |