Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
Authors: Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, Sergey Levine
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that BEAR is able to learn robustly from different off-policy distributions, including random and suboptimal demonstrations, on a range of continuous control tasks. Through systematic experiments, we show the effectiveness of our method on continuous-control Mu Jo Co tasks, with a variety of off-policy datasets: generated by a random, suboptimal, or optimal policies. In our experiments, we study how BEAR performs when learning from static off-policy data on a variety of continuous control benchmark tasks. |
| Researcher Affiliation | Collaboration | Aviral Kumar UC Berkeley aviralk@berkeley.edu Justin Fu UC Berkeley justinjfu@eecs.berkeley.edu George Tucker Google Brain gjt@google.com Sergey Levine UC Berkeley, Google Brain svlevine@eecs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 BEAR Q-Learning (BEAR-QL) |
| Open Source Code | No | The paper does not provide an explicit statement about releasing the source code for the described methodology or a direct link to a code repository. |
| Open Datasets | No | The paper states using data from 'continuous-control Mu Jo Co tasks' and mentions collecting 'one million transitions from a partially trained policy'. While MuJoCo is a known environment, the specific datasets generated or used for training are not provided with concrete access information (e.g., a link, DOI, or formal citation to a public dataset repository). |
| Dataset Splits | No | The paper does not specify exact percentages, sample counts, or refer to predefined citations for training, validation, or test dataset splits. It mentions 'evaluation' but not a formal split of the static datasets used for training. |
| Hardware Specification | No | The acknowledgements section states 'We thank Google, NVIDIA, and Amazon for providing computational resources.' However, this is a general statement and does not include specific hardware details such as GPU/CPU models, memory specifications, or cloud instance types used for the experiments. |
| Software Dependencies | No | The paper mentions that the algorithm is built on 'TD3 [13] or SAC [18]' and refers to 'Mu Jo Co' tasks, but it does not specify software library names with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8') that would be necessary for reproducibility. |
| Experiment Setup | Yes | The paper explicitly lists several experimental parameters in Algorithm 1 and surrounding text: 'input : Dataset D, target network update rate τ, mini-batch size N, sampled actions for MMD n, minimum λ'. It also states 'We choose a threshold of ε = 0.05 in our experiments.' and mentions 'At test time, we sample p actions from πφ(s)'. |