Randomized Exploration in Reinforcement Learning with General Value Function Approximation

Authors: Haque Ishfaq, Qiwen Cui, Viet Nguyen, Alex Ayoub, Zhuoran Yang, Zhaoran Wang, Doina Precup, Lin Yang

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We complement the theory with an empirical evaluation across known difficult exploration tasks. In the experiments, we find that a small sampling time M is sufficient to achieve good performance, which suggests that the theoretical choice of M = e O(d) is too conservative in practice. We run our experiments on River Swim (Strehl & Littman, 2008), Deep Sea (Osband et al., 2016b) and sparse Mountain Car (Brockman et al., 2016) environments as these are considered to be hard exploration problems where ε-greedy is known to have poor performance.
Researcher Affiliation Collaboration Haque Ishfaq * 1 2 Qiwen Cui * 3 Viet Nguyen 1 2 Alex Ayoub 4 Zhuoran Yang 5 Zhaoran Wang 6 Doina Precup 1 2 7 Lin F. Yang 8 1Mila 2School of Computer Science, Mc Gill University 3School of Mathematical Science, Peking University 4Amii and Department of Computing Science, University of Alberta 5Department of Operations Research and Financial Engineering, Princeton University 6 Industrial Engineering Management Sciences, Northwestern University 7Deep Mind, Montreal 8Department of Electrical and Computer Engineering, University of California, Los Angeles.
Pseudocode Yes Algorithm 1 F-LSVI-PHE; Algorithm 2 LSVI-PHE with Linear function class.
Open Source Code No The paper states: "Our experiments are based on the baseline implementations of (Lan, 2019)." and cites "Lan, Q. A pytorch reinforcement learning framework for exploring new ideas. https://github.com/qlan3/Explorer, 2019." This indicates they used existing code, but does not provide a statement or link for the open-sourcing of *their own* method's code.
Open Datasets Yes We run our experiments on River Swim (Strehl & Littman, 2008), Deep Sea (Osband et al., 2016b) and sparse Mountain Car (Brockman et al., 2016) environments as these are considered to be hard exploration problems where ε-greedy is known to have poor performance.
Dataset Splits No No specific details on train/validation/test splits (percentages, counts, or explicit procedures for validation) were found. The paper mentions training details but not dataset splitting for validation purposes.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or cluster specifications) used for running experiments were mentioned in the paper.
Software Dependencies No The paper mentions optimizing with Adam (Kingma & Ba, 2014) and using a PyTorch framework (Lan, 2019), but does not provide specific version numbers for these or other software components.
Experiment Setup Yes For this experiment, we swept over the exploration parameters in both LSVI-UCB (Jin et al., 2020) and LSVI-PHE and report the best performing run on a 12 state River Swim... We sweep over β for LSVI-UCB and σ2 for LSVI-PHE, where M is chosen according to our theory (Theorem 4.7)... The size of the replay buffer was 10,000. The weights of neural networks were optimized by Adam (Kingma & Ba, 2014) with gradient clip 5. We used a batch size of 32. The target network was updated every 100 steps. The best learning rate was chosen from [10−3, 5 × 10−4, 10−4]. For LSVI-PHE, we set M = 8 and we chose the best value of σ from [10−4, 10−3, 10−2].