Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Compiler-R1: Towards Agentic Compiler Auto-tuning with Reinforcement Learning

Authors: Haolin Pan, Hongyu Lin, Haoran Luo, Yang Liu, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents the experimental evaluation of Compiler-R1. We begin by describing the common experimental setup, followed by three main experiments: (1) a performance comparison between Compiler-R1 and various baselines; (2) an analysis of factors affecting task success rate, which reflects effective environment interaction and highlights differences in sampling needs between interactive and non-interactive models; and (3) a case study investigating the impact of input feature representations.
Researcher Affiliation Academia 1Hangzhou Institute for Advanced Study, UCAS, China 2Institute of Software Chinese Academy of Sciences, China 3University of Chinese Academy of Sciences, China 4Nanyang Technological University, Singapore EMAIL, EMAIL EMAIL
Pseudocode Yes Algorithm 1 implements the synergy pass pair identification methodology described in Section 3.1.
Open Source Code Yes Our code and datasets are publicly available at https://github.com/Panhaolin2001/Compiler-R1.
Open Datasets Yes Our code and datasets are publicly available at https://github.com/Panhaolin2001/Compiler-R1.
Dataset Splits Yes Training is performed on six Compiler Gym datasets filtered to contain programs with fewer than 10k IR instructions. Evaluation is conducted on seven test suites: blas, cbench, chstone, mibench, npb, opencv, and tensorflow.
Hardware Specification Yes All experiments were conducted on Intel Xeon Gold 6430 servers (128 cores, 1TB RAM) with NVIDIA H100 GPUs (4 80GB HBM3).
Software Dependencies Yes All evaluated models operate within a fixed optimization space comprising 124 LLVM 10.0.0 opt passes and the -Oz preset (125 total actions).
Experiment Setup Yes Training involves 800 supervised fine-tuning (SFT) samples for protocol initialization, followed by reinforcement learning (GRPO, PPO, or RPP) on 19k interactive episodes, updating over 40 steps.