Learning the Reward Function for a Misspecified Model

Authors: Erik Talvitie

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, this approach to reward learning can yield dramatic improvements in control performance when the dynamics model is flawed. ... Section 5 demonstrates empirically that the approach suggested by the theoretical results can produce good planning performance with a flawed model
Researcher Affiliation Academia Erik Talvitie 1 Department of Computer Science, Franklin & Marshall College, Lancaster, Pennsylvania, USA.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code Yes Source code for these experiments is available at http:// github.com/etalvitie/hdaggermc.
Open Datasets No The paper describes generating training rollouts within the simulated environment ("500 training rollouts were generated") rather than using a pre-existing, publicly available dataset with concrete access information.
Dataset Splits No The paper describes an experimental setup within a simulated environment where data is generated for training and evaluation. It does not provide specific dataset split information (percentages, sample counts, or citations to predefined splits) for traditional training, validation, and test sets.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions learning models using "Context Tree Switching (Veness et al., 2012)" and the "FAC-CTW algorithm (Veness et al., 2011)" but does not provide specific software names with version numbers for reproducibility.
Experiment Setup Yes The one-ply MC planner used 50 uniformly random rollouts of depth 20 per action at every step. The exploration distribution was generated by following the optimal policy with (1 γ) probability of termination at each step. The discount factor was γ = 0.9. In each iteration 500 training rollouts were generated and the resulting policy was evaluated in an episode of length 30. The discounted return obtained by the policy in each iteration is reported, averaged over 50 trials. ... Here, in each experiment the best performing step size for each approach is selected from 0.005, 0.01, 0.05, 0.1, and 0.5.