Self-Correcting Models for Model-Based Reinforcement Learning

Authors: Erik Talvitie

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we illustrate the practical impact of optimizing hallucinated error by comparing DAgger, DAgger-MC, and H-DAgger-MC in the Shooter example described in Section 2.12. The experimental setup matches that of Talvitie (2015) for comparison s sake, though the qualitative comparison presented here is robust to the parameter settings. The results can be seen in Figure 3a and 3b.
Researcher Affiliation Academia Erik Talvitie Department of Mathematics and Computer Science Franklin & Marshall College Lancaster, PA 17604-3003 erik.talvitie@fandm.edu
Pseudocode Yes Algorithm 1 Hallucinated DAgger-MC
Open Source Code Yes Source code for these experiments may be found at github.com/ etalvitie/hdaggermc.
Open Datasets No The paper describes the "Shooter domain" which is an experimental environment used in previous work by the author, but it does not provide concrete access information (link, DOI, citation with authors/year for a dataset) for the specific data generated or used in the experiments.
Dataset Splits No The paper mentions generating "500 training rollouts" and evaluating policy performance by averaging over "50 trials" but does not specify explicit train/validation/test dataset splits or their sizes/percentages for reproducibility.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., CPU, GPU models, memory, cloud instances).
Software Dependencies No The paper mentions "Context Tree Switching (Veness et al. 2012)" and the "FAC-CTW algorithm (Veness et al. 2011)" as methods used. While these refer to specific algorithms and their associated papers, no version numbers for software libraries or dependencies are provided, which is crucial for reproducibility.
Experiment Setup Yes In all cases one-ply MC was used with 50 uniformly random rollouts of depth 15 at every step. The model for each pixel was learned using Context Tree Switching (Veness et al. 2012), similar to the FAC-CTW algorithm (Veness et al. 2011), and used a 7 × 7 neighborhood around the pixel in the previous timestep as input. Data was shared across all positions. The discount factor was γ = 0.9. In each iteration 500 training rollouts were generated and the resulting policy was evaluated in an episode of length 30. The discounted return obtained by the policy in each iteration is reported, averaged over 50 trials.