Coactive Learning for Large Language Models using Implicit User Feedback

Authors: Aaron David Tucker, Kianté Brantley, Adam Cahall, Thorsten Joachims

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results indicate that Co RLL is effective even for weak and noisy coactive preference feedback, making it a promising algorithm for training and personalization of LLMs from feedback that is naturally collected in many use cases. We conducted experiments on various RLHF benchmarks to compare Co RLL against conventional RLHF techniques.
Researcher Affiliation Academia Aaron D. Tucker 1 Kiant e Brantley 1 Adam Cahall 1 Thorsten Joachims 1 1Department of Computer Science, Cornell University, Ithaca, NY. Correspondence to: Aaron Tucker <aarondtucker@cs.cornell.edu>.
Pseudocode Yes Algorithm 1 Generic Coactive Learning Algorithm. Algorithm 2 Co RLL Algorithm for Coactive RLHF.
Open Source Code Yes We thus simulate coactive feedback, and our simulator is available at https://github.com/atucker/coactive_learning.
Open Datasets Yes We evaluate minimally informative and edit-based coactive feedback using a 7B parameter model for the Reddit TL;DR Summarization task (V olske et al. (2017), full details in A.1.1) and using a 13B parameter model for the Helpfulness split of the Anthropic Helpful and Harmless task (Bai et al. (2022a), full details in A.1.2). The first task is the Reddit TL;DR summarization task (V olske et al., 2017)... retrieved from Huggingface as openai/summarize_from_feedback s comparisons dataset. The second task is the Helpful and Harmless Assistant (Bai et al., 2022a)... We retrieved the dataset from Huggingface as anthropic/hh-rlhf...
Dataset Splits No No. The paper refers to using 'training data' and 'held out test set' but does not provide specific proportions or counts for training, validation, and test splits (e.g., '80/10/10 split' or specific sample counts for each split). While standard datasets are used, the paper does not explicitly detail the splits used for reproduction.
Hardware Specification No No. The paper mentions fitting policies on 'a single GPU' but does not specify any particular GPU model (e.g., NVIDIA A100, RTX 3090) or other hardware details like CPU, memory, or specific cloud instance types.
Software Dependencies No No. The paper mentions using Adam for optimization, LoRA adapters, DPO, and retrieving models from Huggingface, but it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or tools used in the experiments.
Experiment Setup Yes In the 7B+ experiments, wherever DPO is used we follow Rafailov et al. (2023) and set the learning rate to 5e-7, use Adam for optimization (Kingma & Ba, 2017), and warm up the learning rate from 0 to its full value over the first 10% of the data. All learned policies are Lo RA adapters (Hu et al., 2022) with r = 8, α = 64, and dropout 0.1... Expert and policy training used a learning rate of 1e-5, and a batch size of 32. If not mentioned otherwise, we approximate the argmax with k = 9 samples, draw l = 100 samples to generate coactive feedback with α = 0.6...