Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Teaching Models to Improve on Tape
Authors: Liat Bezalel, Eyal Orgad, Amir Globerson
AAAI 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce an RL framework for teaching models to use such rewards, by simulating interaction sessions, and rewarding the model according to its ability to satisfy the constraints. We refer to our method as CORGI (Controlled Generation with RL for Guided Interaction), and evaluate it on a variety of controlled generation tasks. We find that CORGI consistently outperforms the baseline reinforcement learning method that does not incorporate conversational feedback. |
| Researcher Affiliation | Collaboration | 1Tel Aviv University 2Google Research |
| Pseudocode | No | The paper describes the CORGI setup, reward definition, and PPO objective function mathematically and textually, but it does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | Yes | Code https://github.com/Liat-Bezalel/corgi |
| Open Datasets | Yes | We further evaluated our approach on a broader set of tasks, including Style Transfer (Pryzant et al. 2020), Common Gen-lite (Lin et al. 2020), Program Synthesis (Numeric category) (bench authors 2023), MBPP (sanitized subset) (Austin et al. 2021), and Common Gen-Hard (Madaan et al. 2023). |
| Dataset Splits | Yes | For each task, we created a prompt outlining the constraints and including two few-shot examples. Additionally, we generated 7,500 training prompts and 500 validation prompts per task. For Rationale Generation, Controlled Paraphrase Generation, Common-Gen lite, Program Synthesis (numeric category), MBPP (sanitized subset), Common-Gen Hard tasks we used the predefined train/dev/test splits sampling the training prompts and validation prompts from their corresponding splits. For tasks without predefined train/dev/test splits, we additionally produced 1,000 test prompts. |
| Hardware Specification | Yes | Training was conducted on a single NVIDIA A100-SXM4-80GB GPU, taking 4 days for the multi-task setting and 12-24 hours for the single-task setting. |
| Software Dependencies | No | The CORGI framework was implemented using the TRL (von Werra et al. 2020) library, which we also utilized for RL-No FB by setting the number of attempts to one. The Adam optimizer was employed with a learning rate of 10^-5 and Adaptive KL control (Ziegler et al. 2019). |
| Experiment Setup | Yes | The Adam optimizer was employed with a learning rate of 10^-5 and Adaptive KL control (Ziegler et al. 2019), with initial coefficients set to 0.05 for Llama-2 and 0.075 for Llama-3. Due to training instability, the KL coefficient was adjusted to 0.3 for the rationale generation task in Llama-3. For CORGI training, the number of attempts was limited to four per prompt due to computational constraints. |