Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

MoBA: Mixture of Block Attention for Long-Context LLMs

Authors: Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Yutao Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To assess the effectiveness of Mo BA, we perform scaling law experiments by comparing the validation loss of language models trained using either full attention or Mo BA. Following the Chinchilla scaling law [34], we train five language models of varying sizes with a sufficient number of training tokens to ensure that each model achieves its training optimum. We assess Mo BA across a variety of real-world downstream tasks, evaluating its performance in comparison to full attention models. As shown in Figure 3a, the validation loss curves for Mo BA and full attention display very similar scaling trends.
Researcher Affiliation Collaboration 1 Moonshot AI 2 Tsinghua University 3 Hangzhou Institute of Medicine, CAS
Pseudocode Yes Algorithm 1 Mo BA (Mixture of Block Attention) Implementation
Open Source Code Yes Our code is available at https://github.com/Moonshot AI/Mo BA. We include our code in Supplementary Material. Our new assets include the code of this work, which will be released under The MIT License (MIT).
Open Datasets No The paper references benchmarks like AGIEval, BBH, CEval, GSM8K, MMLU, Long Bench, and RULER for evaluation. However, it does not provide concrete access information (e.g., specific links, DOIs, repository names, or formal citations for the datasets themselves) to these publicly available datasets, nor does it specify the training datasets used.
Dataset Splits No The paper mentions training on '30B tokens with a context length of 32K tokens' and assessing 'LM loss on validation set'. However, it does not explicitly provide details about the train/validation/test dataset splits, such as percentages, sample counts, or specific methodologies for partitioning the data.
Hardware Specification Yes For our validation experiments in the Section 3.1 and Section 3.2, we utilized a distributed computing infrastructure consisting of 8 server nodes, each equipped with 8 NVIDIA H800 GPUs (64 GPUs in total). For our validation experiments in the Section 3.3, we utilized a distributed computing infrastructure consisting of 128 GPU server nodes, each equipped with 8 NVIDIA H800 GPUs (1024 GPUs in total).
Software Dependencies No The paper mentions using Flash Attention [30] and Deepspeed-moe [31] as components for its implementation, but it does not specify version numbers for any general software dependencies like programming languages (e.g., Python) or deep learning frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes For Mo BA models, we set the block size to 512 and select the top-3 blocks for attention, resulting in a sparse attention pattern with sparsity up to 1 512 3 / 8192 = 81.25%2. For the hyperparameters of Mo BA, the block size is set to 2048, and the top-k parameter is set to 3. We set the block size to 4096 and the top-K parameter to 12. Table 2: Configuration of Scaling Law Experiments lists No-Emb Model Param, Head, Layer, Hidden, Training Token, Block size, Top K.