Learning Rewards From Linguistic Feedback
Authors: Theodore R. Sumers, Mark K. Ho, Robert D. Hawkins, Karthik Narasimhan, Thomas L. Griffiths6002-6010
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our approach, we first collect a corpus of teaching behavior in a cooperative task where both teacher and learner are human. We implement three artificial learners: sentimentbased literal and pragmatic models, and an inference network trained end-to-end to predict rewards. We then re-run our initial experiment, pairing human teachers with these artificial learners. All three models successfully learn from interactive human feedback. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Princeton University, Princeton, NJ 2Department of Psychology, Princeton University, Princeton, NJ {sumers, mho, rdhawkins, karthikn, tomg}@princeton.edu |
| Pseudocode | No | The paper describes methods and processes but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data: github.com/tsumers/rewards. |
| Open Datasets | Yes | Code and data: github.com/tsumers/rewards. |
| Dataset Splits | Yes | We used ten-fold CV with 8-1-1 train-validate-test splits, splitting both teachers and reward functions. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., GPU, CPU models, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions tools like VADER and logistic regression, but does not specify version numbers for these or other software components like programming languages or libraries. |
| Experiment Setup | Yes | VADER provides an output ζ [−1, 1], which we scaled by 30 (set via grid search). We initialized our belief state as µ0 = 0, Σ0 = diag(25). We use σ2 ζ = 1/2 for all updates, which we set via grid search. We used stochastic gradient descent with a learning rate of .005 and weight decay of 0.0001, stopping when validation set error increased. |