reproducibilityindex.ai

Learning Rewards From Linguistic Feedback

Authors: Theodore R. Sumers, Mark K. Ho, Robert D. Hawkins, Karthik Narasimhan, Thomas L. Griffiths6002-6010

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate our approach, we ﬁrst collect a corpus of teaching behavior in a cooperative task where both teacher and learner are human. We implement three artiﬁcial learners: sentimentbased literal and pragmatic models, and an inference network trained end-to-end to predict rewards. We then re-run our initial experiment, pairing human teachers with these artiﬁcial learners. All three models successfully learn from interactive human feedback.
Researcher Affiliation	Academia	1Department of Computer Science, Princeton University, Princeton, NJ 2Department of Psychology, Princeton University, Princeton, NJ {sumers, mho, rdhawkins, karthikn, tomg}@princeton.edu
Pseudocode	No	The paper describes methods and processes but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data: github.com/tsumers/rewards.
Open Datasets	Yes	Code and data: github.com/tsumers/rewards.
Dataset Splits	Yes	We used ten-fold CV with 8-1-1 train-validate-test splits, splitting both teachers and reward functions.
Hardware Specification	No	The paper does not specify any particular hardware (e.g., GPU, CPU models, or cloud instances) used for running the experiments.
Software Dependencies	No	The paper mentions tools like VADER and logistic regression, but does not specify version numbers for these or other software components like programming languages or libraries.
Experiment Setup	Yes	VADER provides an output ζ [−1, 1], which we scaled by 30 (set via grid search). We initialized our belief state as µ0 = 0, Σ0 = diag(25). We use σ2 ζ = 1/2 for all updates, which we set via grid search. We used stochastic gradient descent with a learning rate of .005 and weight decay of 0.0001, stopping when validation set error increased.