Learning Pessimism for Reinforcement Learning
Authors: Edoardo Cetin, Oya Celiktutan
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of GPL, we integrate it with two popular off-policy RL algorithms. ... We show that GPL significantly improves the performance and robustness of off-policy RL, concretely surpassing prior algorithms and setting new state-of-the-art results. In our evaluation, we repeat each experiment with five random seeds and record both mean and standard deviation over the episodic returns. Moreover, we validate statistical significance using tools from Rliable (Agarwal et al. 2021). In the extended version (Cetin and Celiktutan 2021), we report all details of our experimental settings and utilized hyper-parameters. We also provide comprehensive extended results analyzing the impact of all relevant design choices, testing several alternative implementations, and reporting all training times. |
| Researcher Affiliation | Academia | Edoardo Cetin1, Oya Celiktutan1 1 King s College London edoardo.cetin@kcl.ac.uk, oya.celiktutan@kcl.ac.uk |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We share our code to facilitate future extensions. |
| Open Datasets | Yes | On challenging Mujoco tasks from Open AI Gym (Todorov, Erez, and Tassa 2012; Brockman et al. 2016), GPL-SAC outperforms both model-based (Janner et al. 2019) and model-free (Chen et al. 2021) state-of-the-art algorithms, while being more computationally efficient. Additionally, on pixel-based environments from the Deep Mind Control Suite (Tassa et al. 2018), GPL-Dr Q provides significant performance improvements from the recent state-of-the-art Dr Qv2 algorithm. |
| Dataset Splits | No | The paper describes evaluation procedures ('We collect the returns over five evaluation episodes every 1000 environment steps', 'For each run, we average the returns from 100 evaluation episodes') and repetitions with random seeds, but it does not specify explicit training, validation, or test dataset splits in terms of data samples or percentages. |
| Hardware Specification | No | The paper mentions evaluating the algorithm 'under the same hardware' but does not provide specific details about the hardware used (e.g., GPU model, CPU model, memory). |
| Software Dependencies | No | The paper mentions the use of popular RL algorithms (SAC, Dr Q) and environments (OpenAI Gym, Deep Mind Control Suite), along with a tool for statistical significance (Rliable), but it does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | Specifically, we only substitute SAC s clipped double Q-learning with our uncertainty regularizer, initialized with β = 0.5. Inline with the other considered state-of-the-art baselines (Chen et al. 2021; Janner et al. 2019), we use an increased ensemble size and update-to-data (UTD) ratio for the critic. |