Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Teaching Models to Improve on Tape

Authors: Liat Bezalel, Eyal Orgad, Amir Globerson

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce an RL framework for teaching models to use such rewards, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints. We refer to our method as CORGI (Controlled Generation with RL for Guided Interaction), and evaluate it on a variety of controlled generation tasks. We find that CORGI consistently outperforms the baseline reinforcement learning method that does not incorporate conversational feedback.
Researcher Affiliation	Collaboration	1Tel Aviv University 2Google Research
Pseudocode	No	The paper describes the CORGI setup, reward definition, and PPO objective function mathematically and textually, but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	Code https://github.com/Liat-Bezalel/corgi
Open Datasets	Yes	We further evaluated our approach on a broader set of tasks, including Style Transfer (Pryzant et al. 2020), Common Gen-lite (Lin et al. 2020), Program Synthesis (Numeric category) (bench authors 2023), MBPP (sanitized subset) (Austin et al. 2021), and Common Gen-Hard (Madaan et al. 2023).
Dataset Splits	Yes	For each task, we created a prompt outlining the constraints and including two few-shot examples. Additionally, we generated 7,500 training prompts and 500 validation prompts per task. For Rationale Generation, Controlled Paraphrase Generation, Common-Gen lite, Program Synthesis (numeric category), MBPP (sanitized subset), Common-Gen Hard tasks we used the predefined train/dev/test splits sampling the training prompts and validation prompts from their corresponding splits. For tasks without predefined train/dev/test splits, we additionally produced 1,000 test prompts.
Hardware Specification	Yes	Training was conducted on a single NVIDIA A100-SXM4-80GB GPU, taking 4 days for the multi-task setting and 12-24 hours for the single-task setting.
Software Dependencies	No	The CORGI framework was implemented using the TRL (von Werra et al. 2020) library, which we also utilized for RL-No FB by setting the number of attempts to one. The Adam optimizer was employed with a learning rate of 10^-5 and Adaptive KL control (Ziegler et al. 2019).
Experiment Setup	Yes	The Adam optimizer was employed with a learning rate of 10^-5 and Adaptive KL control (Ziegler et al. 2019), with initial coefficients set to 0.05 for Llama-2 and 0.075 for Llama-3. Due to training instability, the KL coefficient was adjusted to 0.3 for the rationale generation task in Llama-3. For CORGI training, the number of attempts was limited to four per prompt due to computational constraints.