Learning Human Objectives by Evaluating Hypothetical Behavior
Authors: Siddharth Reddy, Anca Dragan, Sergey Levine, Shane Legg, Jan Leike
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Re Que ST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. The results show that Re Que ST significantly outperforms prior methods in learning reward models that transfer to new environments with different initial state distributions. |
| Researcher Affiliation | Collaboration | 1University of California, Berkeley 2Deep Mind. Correspondence to: Siddharth Reddy <sgr@berkeley.edu>, Jan Leike <leike@google.com>. |
| Pseudocode | Yes | Algorithm 1 Reward Query Synthesis via Trajectory Optimization (Re Que ST) |
| Open Source Code | No | The paper does not contain an explicit statement about the release of its own source code, nor does it provide a direct link to a code repository for the described methodology. |
| Open Datasets | Yes | MNIST classification... MNIST (Le Cun, 1998)... image-based Car Racing from the Open AI Gym (Brockman et al., 2016) |
| Dataset Splits | No | The paper describes training and test environments with different initial state distributions for MNIST, but it does not specify explicit numerical splits for a validation set (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not mention any specific hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions "Adam (Kingma & Ba, 2014)" as an optimizer and "Open AI Gym (Brockman et al., 2016)" as a platform, but it does not provide specific version numbers for any software libraries or dependencies used for the implementation. |
| Experiment Setup | No | While the paper describes the experimental domains and evaluation metrics, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed system-level training configurations in the main text. |