Sample Efficient Deep Reinforcement Learning via Uncertainty Estimation
Authors: Vincent Mai, Kaustubh Mani, Liam Paull
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show significant improvement in terms of sample efficiency on discrete and continuous control tasks. We propose a method whereby two complementary uncertainty estimation methods account for both the Q-value and the environment stochasticity to better mitigate the negative impacts of noisy supervision. Our results show significant improvement in terms of sample efficiency on discrete and continuous control tasks. 3. Our experiments show that IV-RL can lead to significant improvements in sample efficiency when applied to Deep Q-Networks (DQN) (Mnih et al., 2013) and Soft-Actor Critic (SAC) (Haarnoja et al., 2018). |
| Researcher Affiliation | Academia | Vincent Mai, Kaustubh Mani and Liam Paull Robotics and Embodied AI Lab Mila Quebec Institute of Artificial Intelligence Universit e de Montr eal, Quebec, Canada {vincent.mai,kaustubh.mani,liam.paull}@umontreal.ca |
| Pseudocode | Yes | Algorithm 1 Bootstrap DQN Training, Algorithm 2 IV-DQN Training, Algorithm 3 IV-SAC Training, Algorithm 4 Variance Estimation for IV-SAC |
| Open Source Code | Yes | 1The code for IV-RL is available at https://github.com/montrealrobotics/iv_rl. |
| Open Datasets | Yes | We tested IV-DQN on discrete control environments selected to present different characteristics. From Open AI Gym (Brockman et al., 2016), Lunar Lander is a sparse reward control environment and Mountain Car is a sparse reward exploration environment. From BSuite6 (Osband et al., 2020), Cartpole-Noise is a dense reward control environment. The environments and implementation we used (Open AI Gym, BSuite, MBBL) are all publicly accessible, although a Mujoco license is needed to run some of them. |
| Dataset Splits | No | The paper mentions hyperparameters were tuned and selected based on runs, but it does not provide specific dataset split percentages, sample counts, or methods (e.g., k-fold cross-validation) for validation. |
| Hardware Specification | Yes | A cumulative of 12367 days, or 296808 hours, of computation was mainly performed on the hardware of type RTX 8000 (TDP of 260W). |
| Software Dependencies | No | The paper mentions software components like "rlkit" and "Adam (Kingma & Ba, 2015) optimizer" but does not provide specific version numbers for these or other relevant libraries (e.g., Python, PyTorch, CUDA). |
| Experiment Setup | Yes | For every result presented in this paper, the hyperparameters for each algorithm were tuned using grid search. Table 3 describes the type and sets of parameters that were optimized for the relevant algorithms. The hyperparameters used for each curve can be found in the configuration file of the code submitted as additional material. |