Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reinforcement Learning with Backtracking Feedback
Authors: Bilgehan Sel, Vaishakh Keshava, Phillip Wallis, Lukas Rutishauser, Ming Jin, Dingcheng Li
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive empirical evaluations demonstrate that RLBF significantly reduces attack success rates across diverse benchmarks and model scales, achieving superior safety outcomes while critically preserving foundational model utility. Sections 5 and 5.1-5.4 detail experimental results, including comparisons on various adversarial attacks and benchmarks with reported Attack Success Rates (ASR) and Solution Rates. |
| Researcher Affiliation | Collaboration | Authors are affiliated with 'Google', 'Google Deep Mind' (industry) and 'Virginia Tech' (academia), indicating a collaborative affiliation. |
| Pseudocode | No | The paper describes its methodology in prose in Sections 3 and 4, without presenting any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Question 5 of the 'Neur IPS Paper Checklist' states: 'Does the paper provide open access to the data and code...?' Answer: '[No]' Justification: 'At this time, we provided all the needed information to reproduce the results given in the paper. We will consider releasing code upon acceptance.' |
| Open Datasets | Yes | The paper uses standard academic benchmarks for evaluation, such as 'LMSYS benchmark', 'MMLU', 'BBH', 'GSM8K', and 'MATH', which are well-known publicly available datasets. For example, 'Table 1 summarizes the Attack Success Rates (ASR) on the LMSYS benchmark... We assessed this by evaluating model performance on standard academic benchmarks: MMLU (general knowledge), BBH (complex reasoning), GSM8K (mathematical word problems), and MATH (advanced mathematics).' |
| Dataset Splits | No | The paper references standard academic benchmarks such as LMSYS, MMLU, BBH, GSM8K, and MATH for evaluation but does not explicitly detail the training/test/validation splits used for these benchmarks or for its own Supervised Fine-Tuning (SFT) data generation within the main body of the paper. The NeurIPS checklist indicates experimental details are in the supplemental, but not explicitly in the main paper. |
| Hardware Specification | No | The main paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types) used for its experiments. Question 8 of the 'Neur IPS Paper Checklist' states that 'All sufficient information on the computer resources needed to reproduce the experiments are supplied in the supplemental,' implying these details are not in the main text. |
| Software Dependencies | No | The main paper does not explicitly provide specific software dependency details with version numbers. Question 6 of the 'Neur IPS Paper Checklist' states that 'The full details can be provided either with the code, in appendix, or as supplemental material,' indicating these details are in the supplemental. |
| Experiment Setup | No | The main paper describes the theoretical framework and learning objectives, including the SFT loss function and GRPO optimization, but does not explicitly provide specific hyperparameter values, optimizer settings, or other detailed experimental setup configurations in the main text. Question 6 of the 'Neur IPS Paper Checklist' states that 'All sufficient information on the computer resources needed to reproduce the experiments are supplied in the supplemental,' indicating these details are not in the main text. |