Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Training Language Models to Generate Quality Code with Program Analysis Feedback

Authors: Feng Yao, Zilong Wang, Liyuan Liu, Junxia Cui, Li Zhong, Xiaohan Fu, Haohui Mai, Viswanathan Krishnan, Jianfeng Gao, Jingbo Shang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate REAL s efficacy, we evaluate it across multiple benchmarks spanning diverse production scenarios, assessing code quality along two key dimensions. (1) For security evaluation, we augment Sec Code PLT dataset [Yang et al., 2024] with a program analysis-based detector built by us that effectively identifies 17 Common Weakness Enumerations (CWEs) A.1, resulting in an enhanced benchmark we term Sec Code PLT+. To enable fine-grained evaluation of high-impact vulnerabilities, we additionally introduce Safe SQL, a targeted dataset featuring realistic database query tasks susceptible to SQL injection attacks. (2) For maintainability assessment, we augment APPS dataset [Hendrycks et al., 2021] to APPS+ with comprehensive static analysis, including type checking, unreachable code detection, and function signature verification for Python code.
Researcher Affiliation Collaboration 1UC San Diego, 2Microsoft Research, 3Causal Flow Inc. EMAIL, EMAIL
Pseudocode No The paper does not contain an explicit pseudocode block or algorithm section. It describes the framework and methods in text and diagrams (Figure 1).
Open Source Code Yes 2Our code and datasets are released at https://github.com/yaof20/Rea L.git
Open Datasets Yes We contribute three datasets for quality code generation: (1) Sec Code PLT+, enhancing [Yang et al., 2024] with detectors for 17 CWEs, (2) APPS+, augmenting [Hendrycks et al., 2021] with static analysis for maintainability, and (3) Safe SQL, a targeted dataset for SQL injection vulnerabilities. ... 2Our code and datasets are released at https://github.com/yaof20/Rea L.git
Dataset Splits Yes Table 1: Overview of the benchmarks. ... Dataset Train Size Test Size ... Sec Code PLT+ 655 164 ... Safe SQL 339 85 ... APPS+ 2,038 519
Hardware Specification Yes We implement REAL based on the Ve RL framework3 and conduct all experiments on a server node equipped with 8 NVIDIA H100 GPUs.
Software Dependencies No The paper mentions My Py [Lehtosalo, 2025] Version accessed: May 2025, a static analysis tool for Python, and Qwen2.5-Coder-Instruct as the backbone model. However, it does not specify version numbers for general programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or other dependencies crucial for replication.
Experiment Setup Yes For reinforcement learning, we adopt the Proximal Policy Optimization (PPO) algorithm with a hybrid reward design that balances functional correctness and code quality, focusing on both security and maintainability. The policy model is initialized from the Qwen2.5-Coder-Instruct checkpoint and fine-tuned using PPO with a learning rate of 1e-6, a batch size of 256, and a KL divergence penalty coefficient of 1e-3 to ensure stable policy updates. Advantage estimates are computed using Generalized Advantage Estimation (GAE) with a discount factor of 1.0 and a GAE lambda of 1.0. To promote exploration, entropy regularization is applied, and hybrid rewards are normalized to further stabilize training.