Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

Authors: Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, Cheng Li, Mao Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of r Star-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with significantly smaller model sizes. On Live Code Bench, r Star-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by 3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B.
Researcher Affiliation Collaboration 1University of Science and Technology of China 2Microsoft Research Asia 3Dalian University of Technology 4Shanghai Jiao Tong University
Pseudocode Yes Algorithm 1 Three-Step Test Input Generation Algorithm Step 1: GPT-4o generates test input and validation functions with applying the CYa Ron library.
Open Source Code Yes We have included the code in the supplementary materials for transparency and reproducibility.
Open Datasets Yes r Star-Coder dataset is publicly available at https://huggingface.co/datasets/microsoft/r Star-Coder. ... We have open-sourced our dataset at https://huggingface.co/datasets/microsoft/r Star-Coder.
Dataset Splits No The paper states: "Training setup. Using our 580K dataset, we fine-tune Qwen2.5-Coder instruct models [16] at 1.5B, 7B, and 14B sizes for 6 epochs using the Adam W optimizer, a batch size of 96, and a max sequence length of 16k." While this describes the training data used, it does not specify explicit training/validation/test splits *of the 580K dataset itself* for model development, only that it is used for fine-tuning and evaluation is done on external benchmarks.
Hardware Specification Yes Specifically, the 1.5B and 7B models are trained on 8 MI300X AMD GPUs, while the 14B model uses 32 MI300X GPUs.
Software Dependencies No The paper mentions "Adam W optimizer", "Flash Attention-2 [7]", and "Deep Speed Ze RO-0" but does not provide specific version numbers for these software components or the underlying programming environment (e.g., Python, PyTorch).
Experiment Setup Yes Training setup. Using our 580K dataset, we fine-tune Qwen2.5-Coder instruct models [16] at 1.5B, 7B, and 14B sizes for 6 epochs using the Adam W optimizer, a batch size of 96, and a max sequence length of 16k. The learning rate is 4e-5 with cosine decay.