reproducibilityindex.ai

Robust Asymmetric Learning in POMDPs

Authors: Andrew Warrington, Jonathan W Lavington, Adam Scibior, Mark Schmidt, Frank Wood

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply A2D to two pedagogical gridworld environments, and an autonomous vehicle scenario, where AIL fails. We show A2D recovers the optimal partially observed policy with fewer samples, lower computational cost, and less variance compared to similar methods. These experiments demonstrate the efﬁcacy of A2D, which makes learning via imitation and reinforcement safer and more efﬁcient, even in difﬁcult high dimensional control problems such as autonomous driving.
Researcher Affiliation	Collaboration	1Department of Engineering Science, Uni versity of Oxford 2Department of Computer Science, University of British Columbia 3Inverted AI 4Alberta Machine Learning In telligence Institute (AMII) 5Montre al Institute for Learning Al gorithms (MILA).
Pseudocode	Yes	Algorithm 1 Adaptive Asymmetric DAgger (A2D) 1: Input: MDP MΘ, POMDP MΦ, Annealing schedule Anneal Beta(n, β). 2: Return: Variational trainee parameters ψ. 3: θ, ψ, νm, νp, Init Nets (MΘ, MΦ) 4: β 1, D 5: for n = 0, . . . , N do 6: β Anneal Beta (n, β) 7: πβ βπθ + (1 β)πψ 8: T = {τi}I (τ) i=1 qπβ 9: D Update Buffer (D, T ) πθ πψ 10: V πβ βVνm + (1 β)Vνp 11: θ, νm, νp RLStep (T , V πβ , πβ ) 12: ψ AILStep (D, πθ, πψ ) 13: end for Algorithm 1: Adaptive asymmetric DAgger (A2D) algo rithm. Additional steps we introduce beyond DAgger (Ross et al., 2011) are highlighted in blue, and implement the feed back loop in Figure 1. RLStep is a policy gradient step, updating the expert, using the gradient estimator in (27). AILStep is an AIL variational policy update, as in (18).
Open Source Code	Yes	Code and additional materials are available at https://github.com/plai-group/a2d.
Open Datasets	No	The paper uses pedagogical gridworld environments (Frozen Lake, Tiger Door) and the CARLA simulator. While these environments are open-source or described, the specific datasets generated for the experiments are not explicitly stated as publicly available, nor are links or citations provided for them as datasets.
Dataset Splits	No	The paper does not explicitly provide details about training, validation, or test dataset splits (e.g., percentages, sample counts, or specific predefined splits).
Hardware Specification	No	The paper mentions using computational resources from West Grid and Compute Canada, but does not provide specific hardware details such as GPU/CPU models, memory, or specific cloud instance types used for experiments.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We note that many of the hyperparameters are largely consistent between A2D and RL in the MDP, which is easy to tune. However, A2D did often beneﬁt from in creased entropy regularization and reduced λ (see Appendix B). The IL hyperparameters are largely independent of the RL hyperparameters, further simplifying tuning overall.