reproducibilityindex.ai

Learning the Reward Function for a Misspecified Model

Authors: Erik Talvitie

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, this approach to reward learning can yield dramatic improvements in control performance when the dynamics model is ﬂawed. ... Section 5 demonstrates empirically that the approach suggested by the theoretical results can produce good planning performance with a ﬂawed model
Researcher Affiliation	Academia	Erik Talvitie 1 Department of Computer Science, Franklin & Marshall College, Lancaster, Pennsylvania, USA.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such.
Open Source Code	Yes	Source code for these experiments is available at http:// github.com/etalvitie/hdaggermc.
Open Datasets	No	The paper describes generating training rollouts within the simulated environment ("500 training rollouts were generated") rather than using a pre-existing, publicly available dataset with concrete access information.
Dataset Splits	No	The paper describes an experimental setup within a simulated environment where data is generated for training and evaluation. It does not provide specific dataset split information (percentages, sample counts, or citations to predefined splits) for traditional training, validation, and test sets.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions learning models using "Context Tree Switching (Veness et al., 2012)" and the "FAC-CTW algorithm (Veness et al., 2011)" but does not provide specific software names with version numbers for reproducibility.
Experiment Setup	Yes	The one-ply MC planner used 50 uniformly random rollouts of depth 20 per action at every step. The exploration distribution was generated by following the optimal policy with (1 γ) probability of termination at each step. The discount factor was γ = 0.9. In each iteration 500 training rollouts were generated and the resulting policy was evaluated in an episode of length 30. The discounted return obtained by the policy in each iteration is reported, averaged over 50 trials. ... Here, in each experiment the best performing step size for each approach is selected from 0.005, 0.01, 0.05, 0.1, and 0.5.