reproducibilityindex.ai

RLVF: Learning from Verbal Feedback without Overgeneralization

Authors: Moritz Pascal Stephan, Alexander Khazatsky, Eric Mitchell, Annie S Chen, Sheryl Hsu, Archit Sharma, Chelsea Finn

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results indicate that our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts more than current methods. For both humanand GPT-4-generated high-level feedback, C3PO effectively adheres to the given feedback comparably to in-context baselines while reducing overgeneralization by 30%. Our experiments are intended to answer several research questions about learning from verbal feedback.
Researcher Affiliation	Academia	Moritz Stephan 1 Alexander Khazatsky 1 Eric Mitchell 1 Annie S Chen 1 Sheryl Hsu 1 Archit Sharma 1 Chelsea Finn 1 1Department of Computer Science, Stanford University, CA, USA.
Pseudocode	No	The paper describes the data generation scheme and fine-tuning objective in text and figures (Figure 3 and Figure 4) but does not include explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a link to a project website (https://austrian-code-wizard.github.io/c3po-website/) but does not include an explicit statement in the paper text about releasing the source code or a direct link to a code repository.
Open Datasets	Yes	We sample the prompts for Dout-of-scope from the Open Instruction Generalist (OIG) Dataset (LAION, 2023) which contains a mix of diverse prompts ranging from math to QA and chat.
Dataset Splits	Yes	Within each prompt sub-dataset, we randomly select 80% of the prompts to be used for training and validation and the remainder are used for testing.
Hardware Specification	No	The paper mentions compute credits from “Open AI Researcher Access Program and Modal.com” but does not specify any particular hardware components like GPU models, CPU types, or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions using ‘Mistral-7B-Instruct-v0.2’ as the base model and ‘Low-Rank Adaptation (LoRA)’ for training, along with ‘Adam W’ as the optimizer. However, it does not provide specific version numbers for general software dependencies such as Python, PyTorch/TensorFlow, or other libraries used in the implementation.
Experiment Setup	Yes	For all experiments, we use Mistral-7B-Instruct-v0.2 (Jiang et al., 2023) and train with Low-Rank Adaptation (Hu et al., 2021) with a rank of 64 and alpha of 128. We use a learning rate of 5e-5 with a cosine decay schedule, a warm-up ratio of 0.05, Adam W as the optimizer, and train for 1 epoch.