Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

Authors: Mingzhe Du, Anh Tuan Luu, Yue Liu, Yuhao Qing, Dong HUANG, Xinyi He, Qian Liu, Zejun MA, See-Kiong Ng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both PASS@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%).
Researcher Affiliation Collaboration 1Nanyang Technological University 2National University of Singapore 3The University of Hong Kong 4Xi an Jiaotong University 5Tik Tok EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Iterative Efficiency Optimization Procedure Input: Problem description P, Efficiency instruction I {time, memory, integral}, Set of test cases Tcases, Original code Cin 0 (optional), Number of iterations Niter Output: Improved code Cout 0 , Improved code performance Mout 0
Open Source Code Yes We released our code and data at https://github.com/Elfsong/Afterburner.
Open Datasets Yes Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both PASS@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%). Our work demonstrates effective test-time code efficiency improvement and critically reveals the power of RL in teaching LLMs to truly self-improve code efficiency. We released our code and data at https://github.com/Elfsong/Afterburner.
Dataset Splits Yes As shown in Table 7, Venus Python set includes 2,181 training and 300 test tasks. From this data, we derived training subsets for various optimization methods: SFT Dataset... DPO Dataset... Cold Start Dataset... GRPO Dataset... Finally, we split the dataset into a training set of 984 problems and a held-out test set of 300 problems, forming the complete Venus dataset. Beginning with the official APPS training split (5,000 problems), we discard problems that lack a sufficient number of accepted reference solutions, yielding 2,803 problems in the final dataset.
Hardware Specification Yes Afterburner models are trained on a single node with eight H100 GPUs. We deploy a code execution environment on a GCP n2-highcpu-96 instance (96 v CPUs, 96 GB Memory) with 81 Monolith workers.
Software Dependencies Yes We utilized Llama-Factory [75] for SFT and DPO training phases, and Verl [58] for GRPO training. For inference acceleration, we use v LLM [39]. Table 12: Programming Language Docker Images Language Image Python python:3.9.19-bullseye Java openjdk:11.0.12-jdk-bullseye Javascript node:22-bullseye Cpp gcc:11.2.0-bullseye Go golang:1.17.0-bullseye Ruby ruby:3.0.2-bullseye Rust rust:1.85.0-bullseye
Experiment Setup Yes D.2 Details of Afterburner SFT Training. We fine-tune Qwen/Qwen2.5-3B-Instruct using Low-Rank Adaptation (Lo RA). The model is trained for one epoch on DSSF T . Key hyperparameters include a learning rate of 3e-5, managed by a cosine scheduler with 200 warm-up steps, an effective batch size of 64 (per-device batch size of 4 with 16 gradient accumulation steps), and the adamw_torch optimizer. For Lo RA, the rank is 8 and alpha is 16. The training uses BF16 precision, and gradients are clipped at a norm of 1.0. D.3 Details of Afterburner DPO Training. Afterburner DP O is trained from the checkpoint of Afterburner SF T utilizing Lo RA for one epoch of DSDP O dataset. Key hyperparameters include: learning_rate=4e-5 with a cosine scheduler and 300 warm-up steps, an effective batch size of 16 (per-device batch size of 2 with 8 gradient accumulation steps), and the adamw optimizer. Lo RA parameters are set to rank 16, alpha 16, and a dropout of 0.05. DPO-specific settings include a beta of 0.1 and a sigmoid loss function, with pref_ftx (SFT loss component) set to 0. The training uses BF16 precision, and gradients are clipped at a norm of 1.0. D.5 Details of Afterburner GRPO Training. Afterburner GRP O is trained on Verl [58] and initialized from Afterburner CS. The GRPO training runs for 20 epochs on DSGRP O. Since executing generated code and computing its efficiency metrics are time-consuming, we use a batch reward function to accelerate the reward calculation in a parallel manner. Key hyperparameters include: actor_learning_rate=1e-6, ppo_mini_batch_size=32 (4 per-GPU micro-batch). During the roll-outs, 16 responses are generated per prompt using v LLM [39] with inference_temperature=1.0. KL loss for actor updates is disabled, and the entropy coefficient is 0. For the reward weights, we set βf = 0.2, βe = 0.3, βc = 0.5. Note that Refficiency is set to 0 if C pass = 0. eupper is set to 90, 1048576, 94371840, respectively, which aligns with our timeout (90s) and memory (1GB) limitation.