reproducibilityindex.ai

Tactical Optimism and Pessimism for Deep Reinforcement Learning

Authors: Ted Moskovitz, Jack Parker-Holder, Aldo Pacchiano, Michael Arbel, Michael Jordan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show in a series of continuous control tasks that TOP outperforms existing methods which rely on a ﬁxed degree of optimism, setting a new state of the art in challenging pixel-based environments. Our experiments demonstrate that these insights, which require only simple changes to popular algorithms, lead to state-of-the-art results on both stateand pixel-based control.
Researcher Affiliation	Collaboration	Ted Moskovitz Gatsby Unit, UCL ted@gatsby.ucl.ac.uk Jack Parker-Holder University of Oxford jackph@robots.ox.ac.uk Aldo Pacchiano Microsoft Research apacchiano@microsoft.com Michael Arbel Université Grenoble Alpes, Inria, CNRS michael.n.arbel@gmail.com Michael I. Jordan University of California, Berkeley jordan@cs.berkeley.edu
Pseudocode	Yes	Algorithm 1: TOP-TD3
Open Source Code	Yes	Our code is available at https://github.com/tedmoskovitz/TOP.
Open Datasets	Yes	We augmented TD3 [20] with TOP (TOP-TD3) and evaluated its performance on seven state-based continuous-control tasks from the Mu Jo Co framework [54] via Open AI Gym [12]. We introduce TOP-RAD, a new algorithm that dynamically switches between optimism and pessimism while using SAC with data augmentation (as in [31]). We evaluate TOP-RAD on both the 100k and 500k benchmarks on six tasks from the Deep Mind (DM) Control Suite [52].
Dataset Splits	No	The paper states training duration (e.g., "trained all algorithms for one million time steps") and number of random seeds, and mentions "Hyperparameters were kept constant across all environments. Further details can be found in Appendix B.". However, it does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts), which are less common in reinforcement learning environments where data is generated through interaction rather than being static.
Hardware Specification	No	The provided text mentions in the ethics checklist that details on compute resources are in Appendix B ("Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix B."). However, Appendix B is not included in the provided paper excerpt, so specific hardware details are not available in the main text.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies, libraries, or programming languages used in the experiments. It refers to frameworks like "TD3", "SAC", "OAC", but not their underlying software versions.
Experiment Setup	Yes	TD3, SAC, and OAC use their default hyperparameter settings, with TOP and its ablations using the same settings as TD3. For tactical optimism, we set the possible β values to be { 1, 0}, such that β = 1 corresponds to a pessimistic lower bound, and β = 0 corresponds to simply using the average of the critic. It s important to note that β = 0 is an optimistic setting, as the mean is biased towards optimism. We also tested the effects of different settings for β (Appendix, Figure 6). Hyperparameters were kept constant across all environments. Further details can be found in Appendix B.