Teaching Inverse Reinforcement Learners via Features and Demonstrations
Authors: Luis Haug, Sebastian Tschiatschek, Adish Singla
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Experiments Our experimental setup is similar to the one in [Abbeel and Ng, 2004], i.e., we use N N gridworlds in which non-overlapping square regions of neighbouring cells are grouped together to form n n macrocells for some n dividing N. ... The plots in Figure 3 illustrate the significance of the teaching risk for the problem of teaching a learner under worldview mismatch. ... We compared the performance of TRGREEDY (Algorithm 1) to two variants of the algorithm... |
| Researcher Affiliation | Collaboration | Luis Haug Department of Computer Science ETH Zurich lhaug@inf.ethz.ch Sebastian Tschiatschek Microsoft Research Cambridge, UK setschia@microsoft.com Adish Singla Max Planck Institute for Software Systems Saarbrücken, Germany adishs@mpi-sws.org |
| Pseudocode | Yes | Algorithm 1 TRGREEDY: Featureand demo-based teaching with TR-greedy feature selection Require: Reward vector w , set of teachable features F, feature budget B, initial worldview AL, teacher policy πT , initial learner policy πL, performance threshold ε. for i = 1, . . . , B do if | w , µ(πL) w , µ(πT ) | > ε then f arg minf F ρ(AL f, ; w ) T selects feature to teach AL AL f, L s worldview gets updated πL LEARNING(πL, ALµ(πT )) L trains a new policy else return πL end if end for return πL |
| Open Source Code | No | The paper does not provide any statement or link regarding the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper describes using 'N N gridworlds' and sampling 'reward weights w Rk randomly', indicating a synthetic environment setup rather than a specific publicly available dataset with concrete access information. While it references [Abbeel and Ng, 2004] for the setup type, it does not provide access information for a dataset. |
| Dataset Splits | No | The paper discusses training and learning rounds, but it does not provide specific details on dataset splits such as explicit training, validation, and test sets with percentages or counts. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'the projection version of the apprenticeship learning algorithm from [Abbeel and Ng, 2004]' but does not list specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks). |
| Experiment Setup | Yes | To obtain these plots, we used a gridworld with N = 20, n = 2; for each value ℓ [1, 100], we sampled five random worldview matrices AL Rℓ 100, and let L train a policy πL using the projection algorithm in [Abbeel and Ng, 2004]... The discount factor used was γ = 0.9 in all cases. |