A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning

Authors: Bo Liu, Xidong Feng, Jie Ren, Luo Mai, Rui Zhu, Haifeng Zhang, Jun Wang, Yaodong Yang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we develop a unified framework that describes variations of GMRL algorithms and points out that existing stochastic meta-gradient estimators adopted by GMRL are actually biased. Such meta-gradient bias comes from two sources: 1) the compositional bias incurred by the two-level problem structure, which has an upper bound of O! U ˆf In|g| 0.5" w.r.t. inner-loop update step , learning rate U, estimate variance ˆf2 In and sample size |g|, and 2) the multi-step Hessian estimation bias ˆΔ퐻 due to the use of autodiff, which has a polynomial impact O!( 1)( ˆΔ퐻) 1" on the meta-gradient bias. We study tabular MDPs empirically and offer quantitative evidence that testifies our theoretical findings on existing stochastic meta-gradient estimators. Furthermore, we conduct experiments on Iterated Prisoner s Dilemma and Atari games to show how other methods such as off-policy learning and low-bias estimator can help fix the gradient bias for GMRL algorithms in general.
Researcher Affiliation Collaboration Bo Liu Institute of Automation, Chinese Academy of Sciences benjaminliu.eecs@gmail.com Xidong Feng University College London xidong.feng.20@ucl.ac.uk Jie Ren University of Edinburgh jieren9806@gmail.com Luo Mai University of Edinburgh luo.mai@ed.ac.uk Rui Zhu Deep Mind ruizhu@google.com Haifeng Zhang Institute of Automation, CAS Nanjing Artificial Intelligence Research of IA haifeng.zhang@ia.ac.cn Jun Wang University College London jun.wang@cs.ucl.ac.uk Yaodong Yang Institute for AI, Peking University Beijing Institute for General AI yaodong.yang@pku.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We open source our code at https://github.com/Benjamin-eecs/Theoretical-GMRL.
Open Datasets Yes To align with existing works in the literature, we adopt the settings of random MDPs in [33] with the focus on meta-gradient estimation. ... The IPD environments are taken from [13]. ... We evaluate our methods on 8 Atari games with image observations following [25].
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification Yes Our experiments are run on a server with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 256GB RAM and NVIDIA Tesla V100 SXM2 32GB GPUs.
Software Dependencies No The paper mentions software like 'Pytorch implementations of reinforcement learning algorithms' [21] and 'Torchopt' [28], but does not provide specific version numbers for these software dependencies, only the year for one of the references.
Experiment Setup Yes We conduct our ablation studies by comparing the correlation between of meta-gradient with the exact one. The correlation metric, which is determined by bias and variance, can show how the final estimation quality is influenced by these two bias terms. ... For example, in Fig. 3(a), the inner-loop estimation refers to r)퐽In(5, )) while outer-loop estimation refers to r)1퐽Out(5, )1) and r5퐽Out(5, )1). The return shown in Fig. 3(a) reveals us two findings: 1) The inner-loop gradient estimation plays an important role for making LOLA work the default batch size 128 fails while the batch size 1024 succeeds. ... The "3-step" means we take 3 inner-loop RL virtual updates for calculating meta-gradient.