Learning Rewards From Linguistic Feedback

Authors: Theodore R. Sumers, Mark K. Ho, Robert D. Hawkins, Karthik Narasimhan, Thomas L. Griffiths6002-6010

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate our approach, we first collect a corpus of teaching behavior in a cooperative task where both teacher and learner are human. We implement three artificial learners: sentimentbased literal and pragmatic models, and an inference network trained end-to-end to predict rewards. We then re-run our initial experiment, pairing human teachers with these artificial learners. All three models successfully learn from interactive human feedback.
Researcher Affiliation Academia 1Department of Computer Science, Princeton University, Princeton, NJ 2Department of Psychology, Princeton University, Princeton, NJ {sumers, mho, rdhawkins, karthikn, tomg}@princeton.edu
Pseudocode No The paper describes methods and processes but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and data: github.com/tsumers/rewards.
Open Datasets Yes Code and data: github.com/tsumers/rewards.
Dataset Splits Yes We used ten-fold CV with 8-1-1 train-validate-test splits, splitting both teachers and reward functions.
Hardware Specification No The paper does not specify any particular hardware (e.g., GPU, CPU models, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions tools like VADER and logistic regression, but does not specify version numbers for these or other software components like programming languages or libraries.
Experiment Setup Yes VADER provides an output ζ [−1, 1], which we scaled by 30 (set via grid search). We initialized our belief state as µ0 = 0, Σ0 = diag(25). We use σ2 ζ = 1/2 for all updates, which we set via grid search. We used stochastic gradient descent with a learning rate of .005 and weight decay of 0.0001, stopping when validation set error increased.