Learning the Reward Function for a Misspecified Model
Authors: Erik Talvitie
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, this approach to reward learning can yield dramatic improvements in control performance when the dynamics model is flawed. ... Section 5 demonstrates empirically that the approach suggested by the theoretical results can produce good planning performance with a flawed model |
| Researcher Affiliation | Academia | Erik Talvitie 1 Department of Computer Science, Franklin & Marshall College, Lancaster, Pennsylvania, USA. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such. |
| Open Source Code | Yes | Source code for these experiments is available at http:// github.com/etalvitie/hdaggermc. |
| Open Datasets | No | The paper describes generating training rollouts within the simulated environment ("500 training rollouts were generated") rather than using a pre-existing, publicly available dataset with concrete access information. |
| Dataset Splits | No | The paper describes an experimental setup within a simulated environment where data is generated for training and evaluation. It does not provide specific dataset split information (percentages, sample counts, or citations to predefined splits) for traditional training, validation, and test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions learning models using "Context Tree Switching (Veness et al., 2012)" and the "FAC-CTW algorithm (Veness et al., 2011)" but does not provide specific software names with version numbers for reproducibility. |
| Experiment Setup | Yes | The one-ply MC planner used 50 uniformly random rollouts of depth 20 per action at every step. The exploration distribution was generated by following the optimal policy with (1 γ) probability of termination at each step. The discount factor was γ = 0.9. In each iteration 500 training rollouts were generated and the resulting policy was evaluated in an episode of length 30. The discounted return obtained by the policy in each iteration is reported, averaged over 50 trials. ... Here, in each experiment the best performing step size for each approach is selected from 0.005, 0.01, 0.05, 0.1, and 0.5. |