Tactical Optimism and Pessimism for Deep Reinforcement Learning
Authors: Ted Moskovitz, Jack Parker-Holder, Aldo Pacchiano, Michael Arbel, Michael Jordan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show in a series of continuous control tasks that TOP outperforms existing methods which rely on a fixed degree of optimism, setting a new state of the art in challenging pixel-based environments. Our experiments demonstrate that these insights, which require only simple changes to popular algorithms, lead to state-of-the-art results on both stateand pixel-based control. |
| Researcher Affiliation | Collaboration | Ted Moskovitz Gatsby Unit, UCL ted@gatsby.ucl.ac.uk Jack Parker-Holder University of Oxford jackph@robots.ox.ac.uk Aldo Pacchiano Microsoft Research apacchiano@microsoft.com Michael Arbel Université Grenoble Alpes, Inria, CNRS michael.n.arbel@gmail.com Michael I. Jordan University of California, Berkeley jordan@cs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1: TOP-TD3 |
| Open Source Code | Yes | Our code is available at https://github.com/tedmoskovitz/TOP. |
| Open Datasets | Yes | We augmented TD3 [20] with TOP (TOP-TD3) and evaluated its performance on seven state-based continuous-control tasks from the Mu Jo Co framework [54] via Open AI Gym [12]. We introduce TOP-RAD, a new algorithm that dynamically switches between optimism and pessimism while using SAC with data augmentation (as in [31]). We evaluate TOP-RAD on both the 100k and 500k benchmarks on six tasks from the Deep Mind (DM) Control Suite [52]. |
| Dataset Splits | No | The paper states training duration (e.g., "trained all algorithms for one million time steps") and number of random seeds, and mentions "Hyperparameters were kept constant across all environments. Further details can be found in Appendix B.". However, it does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts), which are less common in reinforcement learning environments where data is generated through interaction rather than being static. |
| Hardware Specification | No | The provided text mentions in the ethics checklist that details on compute resources are in Appendix B ("Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix B."). However, Appendix B is not included in the provided paper excerpt, so specific hardware details are not available in the main text. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the experiments. It refers to frameworks like "TD3", "SAC", "OAC", but not their underlying software versions. |
| Experiment Setup | Yes | TD3, SAC, and OAC use their default hyperparameter settings, with TOP and its ablations using the same settings as TD3. For tactical optimism, we set the possible β values to be { 1, 0}, such that β = 1 corresponds to a pessimistic lower bound, and β = 0 corresponds to simply using the average of the critic. It s important to note that β = 0 is an optimistic setting, as the mean is biased towards optimism. We also tested the effects of different settings for β (Appendix, Figure 6). Hyperparameters were kept constant across all environments. Further details can be found in Appendix B. |