Learning Human Objectives by Evaluating Hypothetical Behavior

Authors: Siddharth Reddy, Anca Dragan, Sergey Levine, Shane Legg, Jan Leike

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Re Que ST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that Re Que ST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions.
Researcher Affiliation Collaboration 1University of California, Berkeley 2Deep Mind. Correspondence to: Siddharth Reddy <sgr@berkeley.edu>, Jan Leike <leike@google.com>.
Pseudocode Yes Algorithm 1 Reward Query Synthesis via Trajectory Optimization (Re Que ST)
Open Source Code No The paper does not contain an explicit statement about the release of its own source code, nor does it provide a direct link to a code repository for the described methodology.
Open Datasets Yes MNIST classification... MNIST (Le Cun, 1998)... image-based Car Racing from the Open AI Gym (Brockman et al., 2016)
Dataset Splits No The paper describes training and test environments with different initial state distributions for MNIST, but it does not specify explicit numerical splits for a validation set (e.g., percentages or sample counts).
Hardware Specification No The paper does not mention any specific hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions "Adam (Kingma & Ba, 2014)" as an optimizer and "Open AI Gym (Brockman et al., 2016)" as a platform, but it does not provide specific version numbers for any software libraries or dependencies used for the implementation.
Experiment Setup No While the paper describes the experimental domains and evaluation metrics, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed system-level training configurations in the main text.