Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction

Authors: Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, Sergey Levine

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks. Through systematic experiments, we show the effectiveness of our method on continuous-control Mu Jo Co tasks, with a variety of off-policy datasets: generated by a random, suboptimal, or optimal policies. In our experiments, we study how BEAR performs when learning from static off-policy data on a variety of continuous control benchmark tasks.
Researcher Affiliation Collaboration Aviral Kumar UC Berkeley aviralk@berkeley.edu Justin Fu UC Berkeley justinjfu@eecs.berkeley.edu George Tucker Google Brain gjt@google.com Sergey Levine UC Berkeley, Google Brain svlevine@eecs.berkeley.edu
Pseudocode Yes Algorithm 1 BEAR Q-Learning (BEAR-QL)
Open Source Code No The paper does not provide an explicit statement about releasing the source code for the described methodology or a direct link to a code repository.
Open Datasets No The paper states using data from 'continuous-control Mu Jo Co tasks' and mentions collecting 'one million transitions from a partially trained policy'. While MuJoCo is a known environment, the specific datasets generated or used for training are not provided with concrete access information (e.g., a link, DOI, or formal citation to a public dataset repository).
Dataset Splits No The paper does not specify exact percentages, sample counts, or refer to predefined citations for training, validation, or test dataset splits. It mentions 'evaluation' but not a formal split of the static datasets used for training.
Hardware Specification No The acknowledgements section states 'We thank Google, NVIDIA, and Amazon for providing computational resources.' However, this is a general statement and does not include specific hardware details such as GPU/CPU models, memory specifications, or cloud instance types used for the experiments.
Software Dependencies No The paper mentions that the algorithm is built on 'TD3 [13] or SAC [18]' and refers to 'Mu Jo Co' tasks, but it does not specify software library names with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8') that would be necessary for reproducibility.
Experiment Setup Yes The paper explicitly lists several experimental parameters in Algorithm 1 and surrounding text: 'input : Dataset D, target network update rate τ, mini-batch size N, sampled actions for MMD n, minimum λ'. It also states 'We choose a threshold of ε = 0.05 in our experiments.' and mentions 'At test time, we sample p actions from πφ(s)'.