Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PurpCode: Reasoning for Safer Code Generation

Authors: Jiawei Liu, Nirav Diwan, Zhe Wang, Haoyu Zhai, Xiaona Zhou, Kiet Nguyen, Tianjiao Yu, Muntasir Wahed, Yinlin Deng, Hadjer Benkraouda, Yuxiang Wei, LINGMING ZHANG, Ismini Lourentzou, Gang Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that Purp Code-32B generates safer code than many frontier models on various cybersafety benchmarks and red-teaming. Main evaluation section 4 contains Table 4: Cyber safety evaluation results among frontier LLMs and our Purp Code-32B.
Researcher Affiliation Academia University of Illinois Urbana-Champaign EMAIL
Pseudocode No The paper describes methods and processes in text, such as in sections 2 (Reasoning-based alignment for safe code generation) and 3 (Internal red-teaming). While it references algorithms like GRPO, it does not present structured pseudocode or algorithm blocks for its own methodology.
Open Source Code Yes Furthermore, we fully open-source our training recipe, including training infrastructure, training and evaluation datasets, data synthesizers, and evaluators. And also from Neur IPS Paper Checklist, section 5, 'We have open-sourced the complete code, data, and instructions to reproduce the results.'
Open Datasets Yes Furthermore, we fully open-source our training recipe, including training infrastructure, training and evaluation datasets, data synthesizers, and evaluators. And also from Neur IPS Paper Checklist, section 5, 'We have open-sourced the complete code, data, and instructions to reproduce the results.'
Dataset Splits Yes Table 3 lists the alignment data overview for training our default Purp Code-32B model, covering safety prompts curated by this work and additional utility prompts for code generation and security knowledge. We first use a small percentage of prompts for rule learning, which samples 8 responses per prompt and retains one passing samples (if any) for supervised finetuning. For RL, we use all single-turn prompts and exclude easy rule-learning prompts with over 70% passing rate.
Hardware Specification Yes All model training was performed on NVIDIA H100 and H200 GPUs, equipped with 8 × 80 GB and 8 × 144 GB of VRAM, respectively.
Software Dependencies Yes We employ Code Guru v0.2.4 [2] as our default code analyzer
Experiment Setup Yes We train Purp Code-32B starting from Qwen2.5-32B(Instruct3). We first use a small percentage of prompts for rule learning, which samples 8 responses per prompt and retains one passing samples (if any) for supervised finetuning. As o4-mini locks its temperature to 1, we repeat the o4-mini evaluation three times for each benchmark and report the average score.