Robust Asymmetric Learning in POMDPs
Authors: Andrew Warrington, Jonathan W Lavington, Adam Scibior, Mark Schmidt, Frank Wood
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply A2D to two pedagogical gridworld environments, and an autonomous vehicle scenario, where AIL fails. We show A2D recovers the optimal partially observed policy with fewer samples, lower computational cost, and less variance compared to similar methods. These experiments demonstrate the efficacy of A2D, which makes learning via imitation and reinforcement safer and more efficient, even in difficult high dimensional control problems such as autonomous driving. |
| Researcher Affiliation | Collaboration | 1Department of Engineering Science, Uni versity of Oxford 2Department of Computer Science, University of British Columbia 3Inverted AI 4Alberta Machine Learning In telligence Institute (AMII) 5Montre al Institute for Learning Al gorithms (MILA). |
| Pseudocode | Yes | Algorithm 1 Adaptive Asymmetric DAgger (A2D) 1: Input: MDP MΘ, POMDP MΦ, Annealing schedule Anneal Beta(n, β). 2: Return: Variational trainee parameters ψ. 3: θ, ψ, νm, νp, Init Nets (MΘ, MΦ) 4: β 1, D 5: for n = 0, . . . , N do 6: β Anneal Beta (n, β) 7: πβ βπθ + (1 β)πψ 8: T = {τi}I (τ) i=1 qπβ 9: D Update Buffer (D, T ) πθ πψ 10: V πβ βVνm + (1 β)Vνp 11: θ, νm, νp RLStep (T , V πβ , πβ ) 12: ψ AILStep (D, πθ, πψ ) 13: end for Algorithm 1: Adaptive asymmetric DAgger (A2D) algo rithm. Additional steps we introduce beyond DAgger (Ross et al., 2011) are highlighted in blue, and implement the feed back loop in Figure 1. RLStep is a policy gradient step, updating the expert, using the gradient estimator in (27). AILStep is an AIL variational policy update, as in (18). |
| Open Source Code | Yes | Code and additional materials are available at https://github.com/plai-group/a2d. |
| Open Datasets | No | The paper uses pedagogical gridworld environments (Frozen Lake, Tiger Door) and the CARLA simulator. While these environments are open-source or described, the specific datasets generated for the experiments are not explicitly stated as publicly available, nor are links or citations provided for them as datasets. |
| Dataset Splits | No | The paper does not explicitly provide details about training, validation, or test dataset splits (e.g., percentages, sample counts, or specific predefined splits). |
| Hardware Specification | No | The paper mentions using computational resources from West Grid and Compute Canada, but does not provide specific hardware details such as GPU/CPU models, memory, or specific cloud instance types used for experiments. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We note that many of the hyperparameters are largely consistent between A2D and RL in the MDP, which is easy to tune. However, A2D did often benefit from in creased entropy regularization and reduced λ (see Appendix B). The IL hyperparameters are largely independent of the RL hyperparameters, further simplifying tuning overall. |