Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation

Authors: Vincent Mai, Kaustubh Mani, Liam Paull

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show significant improvement in terms of sample efficiency on discrete and continuous control tasks. We propose a method whereby two complementary uncertainty estimation methods account for both the Q-value and the environment stochasticity to better mitigate the negative impacts of noisy supervision. Our results show significant improvement in terms of sample efficiency on discrete and continuous control tasks. 3. Our experiments show that IV-RL can lead to significant improvements in sample efficiency when applied to Deep Q-Networks (DQN) (Mnih et al., 2013) and Soft-Actor Critic (SAC) (Haarnoja et al., 2018).
Researcher Affiliation Academia Vincent Mai, Kaustubh Mani and Liam Paull Robotics and Embodied AI Lab Mila Quebec Institute of Artificial Intelligence Universit e de Montr eal, Quebec, Canada {vincent.mai,kaustubh.mani,liam.paull}@umontreal.ca
Pseudocode Yes Algorithm 1 Bootstrap DQN Training, Algorithm 2 IV-DQN Training, Algorithm 3 IV-SAC Training, Algorithm 4 Variance Estimation for IV-SAC
Open Source Code Yes 1The code for IV-RL is available at https://github.com/montrealrobotics/iv_rl.
Open Datasets Yes We tested IV-DQN on discrete control environments selected to present different characteristics. From Open AI Gym (Brockman et al., 2016), Lunar Lander is a sparse reward control environment and Mountain Car is a sparse reward exploration environment. From BSuite6 (Osband et al., 2020), Cartpole-Noise is a dense reward control environment. The environments and implementation we used (Open AI Gym, BSuite, MBBL) are all publicly accessible, although a Mujoco license is needed to run some of them.
Dataset Splits No The paper mentions hyperparameters were tuned and selected based on runs, but it does not provide specific dataset split percentages, sample counts, or methods (e.g., k-fold cross-validation) for validation.
Hardware Specification Yes A cumulative of 12367 days, or 296808 hours, of computation was mainly performed on the hardware of type RTX 8000 (TDP of 260W).
Software Dependencies No The paper mentions software components like "rlkit" and "Adam (Kingma & Ba, 2015) optimizer" but does not provide specific version numbers for these or other relevant libraries (e.g., Python, PyTorch, CUDA).
Experiment Setup Yes For every result presented in this paper, the hyperparameters for each algorithm were tuned using grid search. Table 3 describes the type and sets of parameters that were optimized for the relevant algorithms. The hyperparameters used for each curve can be found in the configuration file of the code submitted as additional material.