Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Authors: Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, YI WU

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on math and code reasoning benchmarks show that ARea L achieves up to 2.77 training speedup compared to synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AREAL is available at https://github.com/inclusion AI/ARea L/. 6 Experiments Our evaluation comprises three components: (1) comprehensive comparisons against state-of-the-art open-source frameworks across model sizes, (2) strong-scaling analysis with varying compute resources, and (3) ablation studies validating our design choices.
Researcher Affiliation	Collaboration	Wei Fu12 , Jiaxuan Gao1, Xujie Shen2, Chen Zhu2, Zhiyu Mei12, Chuyi He2, Shusheng Xu12, Guo Wei2, Jun Mei2, Jiashu Wang3, Tongkai Yang2, Binhang Yuan3, Yi Wu1 1 IIIS, Tsinghua University, 2 Ant Group, 3 HKUST EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Dynamic Batching
Open Source Code	Yes	The code of AREAL is available at https://github.com/inclusion AI/ARea L/.
Open Datasets	Yes	For the math task, we used the open-source data from Deep Scale R [25], For code training, we used the dataset released by Deep Coder [24]. All compared methods use the same dataset.
Dataset Splits	No	Our evaluation of mathematical tasks follows the Qwen evaluation protocol [56, 13], while coding models are assessed on Live Code Bench (8/1/24-2/1/25) [14] using the official protocol. Unless otherwise specified, we set the maximum staleness η = 4 for coding and η = 8 for math, and adopt the training configurations used in Section 6.2, with additional hyperparameters detailed in Appendix B.
Hardware Specification	Yes	We conduct experiments on an H800 GPU cluster comprising 64 nodes, each equipped with 8 GPUs. The cluster features NVLink for intra-node connectivity and Ro CE with 3.2Tbps bandwidth for inter-node communication.
Software Dependencies	Yes	We implement AREAL using Python and Py Torch [35] built upon the Rea LHF [27] framework. Our system combines SGLang [63] v0.4.6 for generation serving with Megatron-Core [46] v0.11.0 as the training backend, managed by SLURM [59] for resource scheduling. For most of the results, we use SGLang [63] v0.4.6 as generation backend and pytorch FSDP [62] as training backend. In a few cases where SGLang raises errors (experiments with 32B models or 64 nodes), we use v LLM [18] v0.8.4 as a substitution.
Experiment Setup	Yes	Table 3: Training configurations and hyperparameters. Training Configuration Batch size (number of prompts) 512 Random seed 1 PPO Parameters PPO Minibatches 4 Clipping ϵ 0.2 Advantage normalization True Discount factor γ 1.0 GAE λ 1.0 Optimizer Parameters Optimizer Adam Learning rate 2.0 10 5 Weight decay 0.05 β1 0.9 β2 0.95 Adam ϵ 1 10 5 Gradient norm clipping 1.0 Learning rate scheduler constant Warmup steps proportion 0.001 Precision Parameters Parameter dtype fp16 KV cache dtype fp16 Gradient dtype fp32 Optimizer state dtype fp32 Generation Parameters Answers per prompt 16 Temperature 1.0 Top-p 1.0 Top-k -1 Max prompt length 1024 Min generation length 0 Max generation length 27648