RLVF: Learning from Verbal Feedback without Overgeneralization
Authors: Moritz Pascal Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts more than current methods. For both humanand GPT-4-generated high-level feedback, C3PO effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%. Our experiments are intended to answer several research questions about learning from verbal feedback. |
| Researcher Affiliation | Academia | Moritz Stephan 1 Alexander Khazatsky 1 Eric Mitchell 1 Annie S Chen 1 Sheryl Hsu 1 Archit Sharma 1 Chelsea Finn 1 1Department of Computer Science, Stanford University, CA, USA. |
| Pseudocode | No | The paper describes the data generation scheme and fine-tuning objective in text and figures (Figure 3 and Figure 4) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to a project website (https://austrian-code-wizard.github.io/c3po-website/) but does not include an explicit statement in the paper text about releasing the source code or a direct link to a code repository. |
| Open Datasets | Yes | We sample the prompts for Dout-of-scope from the Open Instruction Generalist (OIG) Dataset (LAION, 2023) which contains a mix of diverse prompts ranging from math to QA and chat. |
| Dataset Splits | Yes | Within each prompt sub-dataset, we randomly select 80% of the prompts to be used for training and validation and the remainder are used for testing. |
| Hardware Specification | No | The paper mentions compute credits from “Open AI Researcher Access Program and Modal.com” but does not specify any particular hardware components like GPU models, CPU types, or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using ‘Mistral-7B-Instruct-v0.2’ as the base model and ‘Low-Rank Adaptation (LoRA)’ for training, along with ‘Adam W’ as the optimizer. However, it does not provide specific version numbers for general software dependencies such as Python, PyTorch/TensorFlow, or other libraries used in the implementation. |
| Experiment Setup | Yes | For all experiments, we use Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) and train with Low-Rank Adaptation (Hu et al., 2021) with a rank of 64 and alpha of 128. We use a learning rate of 5e-5 with a cosine decay schedule, a warm-up ratio of 0.05, Adam W as the optimizer, and train for 1 epoch. |