Cooperative Inverse Reinforcement Learning

Authors: Dylan Hadfield-Menell, Stuart J. Russell, Pieter Abbeel, Anca Dragan

NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Our experimental domain is a 2D navigation problem on a discrete grid. In the learning phase of the game, H teleoperates a trajectory while R observes. In the deployment phase, R is placed in a random state and given control of the robot. We use a finite horizon H, and let the first H/2 timesteps be the learning phase. There are Nφ state features defined as radial basis functions where the centers are common knowledge. Rewards are linear in these features and θ. The initial world state is in the middle of the map. We use a uniform distribution on [-1, 1]Nφ for the prior on θ. Actions move in one of the four cardinal directions {N, S, E, W} and there is an additional no-op that each actor executes deterministically on the other agent’s turn. Figure 1 shows an example comparison between demonstration-by-expert and the approximate best response policy in Section 3.4. The leftmost image is the ground truth reward function. Next to it are demonstration trajectories produce by these two policies. Each path is superimposed on the maximum a-posteriori reward function the robot infers from the demonstration. We can see that the demonstration-by-expert policy immediately goes to the highest reward and stays there. In contrast, the best response policy moves to both areas of high reward. The robot reward function the robot infers from the best response demonstration is much more representative of the true reward function, when compared with the reward function it infers from demonstration-by-expert. and Results. There was relatively little correlation between the measures (Cronbach’s α of .47), so we ran a factorial repeated measures ANOVA for each measure. Across all measures, we found a significant effect for H-policy, with br outperforming πE on all measures as we hypothesized (all with F > 962, p < .0001).
Researcher Affiliation Academia Electrical Engineering and Computer Science University of California at Berkeley Berkeley, CA 94709
Pseudocode No The paper describes algorithms and formulations but does not contain a dedicated block or figure labeled as “Pseudocode” or “Algorithm”.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository for the methodology described.
Open Datasets No The paper describes a simulated 2D navigation problem and uses a uniform distribution for prior on θ, but does not refer to any external, publicly available dataset with access information.
Dataset Splits No The paper describes a learning and deployment phase and uses N=500 reward parameters for testing, but it does not specify explicit train/validation/test splits, percentages, or cross-validation details for a formal dataset partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions using “Maximum-Entropy IRL (Ziebart et al., 2008)” but does not provide specific version numbers for any software libraries, packages, or programming languages.
Experiment Setup Yes Our experimental domain is a 2D navigation problem on a discrete grid. In the learning phase of the game, H teleoperates a trajectory while R observes. In the deployment phase, R is placed in a random state and given control of the robot. We use a finite horizon H, and let the first H/2 timesteps be the learning phase. There are Nφ state features defined as radial basis functions where the centers are common knowledge. Rewards are linear in these features and θ. The initial world state is in the middle of the map. We use a uniform distribution on [-1, 1]Nφ for the prior on θ. Actions move in one of the four cardinal directions {N, S, E, W} and there is an additional no-op that each actor executes deterministically on the other agent’s turn. and The trade-off between reward and communication η is set by cross-validation before the game begins. The num-features factor varies the dimensionality of φacross two levels: 3 features and 10 features. and Maximum-Entropy IRL includes a free parameter λ that controls how optimal R expects H to behave.