reproducibilityindex.ai

Exploration by Distributional Reinforcement Learning

Authors: Yunhao Tang, Shipra Agrawal

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 Experiments In all experiments, we implement Algorithm 2 and refer to it as GE (Gauss exploration) in the following. We aim to answer the following questions, In environments that require consistent exploration, does GE achieve more efﬁcient exploration than conventional naive exploration strategies like ϵ greedy in DQN and direct parameter randomization in Noisy Net? When a deterministic critic in an off-policy algorithm like DDPG [Lillicrap et al., 2016] is replaced by a randomized critic, does the algorithm achieve better exploration? 6.1 Testing Environment Chain MDP. The chain MDP [Osband et al., 2016] (Figure 1) serves as a benchmark to test if an algorithm entails consistent exploration. ... 6.2 Experiment Results In Figure 2 (a) (c) we compare DQN vs Noisy Net vs GE in Chain MDP environments with different number of states N. ... In Figure 3 (a) (c) we present the comparison of three algorithms in sparse reward environments. ... Figure 4: Comparison of original Q function (DQN) vs Noisy Net vs GE as baselines for DDPG on sparse reward environments (a) Inverted Pendulum (b) Inverted Double Pendulum (c) Sparse Inverted Pendulum (d) Sparse Inverted Double Pendulum.
Researcher Affiliation	Academia	Yunhao Tang, Shipra Agrawal Columbia University IEOR yt2541@columbia.edu, sa3305@columbia.edu
Pseudocode	Yes	Algorithm 1 Exploration by Distributional RL: Generic ... Algorithm 2 Exploration by Distributional RL: Gaussian
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	6.1 Testing Environment Chain MDP. The chain MDP [Osband et al., 2016] (Figure 1) serves as a benchmark to test if an algorithm entails consistent exploration. ... Sparse Reward Environments. All RL agents require reward signals to learn good policies. In sparse reward environment, agents with naive exploration strategies randomly stumble around for most of the time and require many more samples to learn good policies than agents that explore consistently. We modify the reward signals in Open AI gym [Brockman et al., 2016] and Mu Jo Co benchmark tasks [Todorov et al., 2012] to be sparse as follows.
Dataset Splits	No	The paper describes the environments used for testing and plots performance over iterations/episodes. However, it does not explicitly specify dataset splits for training, validation, or testing, or how data was partitioned for those purposes. It refers to 'training' and 'evaluation' but not explicit data splits.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions names of algorithms and frameworks like 'DQN', 'Noisy Net', 'DDPG', 'Open AI gym', and 'Mu Jo Co', but it does not specify version numbers for any of these software dependencies or libraries.
Experiment Setup	Yes	Hyper-parameter. In all experiments, we set qφ(θ) to be factorized Gaussian. In GE, as in Noisy Net [Fortunato et al., 2017], each parameter θ in a fully connected layer (weight and bias) has two distributional parameters: the mean µθ and standard error σθ. Set σθ = log(1 + exp( ρθ)) and let ρθ be the actual hyper-parameter to tune. If ρθ is large, the distribution over θ is widespread and the agent can execute a larger range of policies before committing to a solution. For both Noisy Net and GE, we require all ρθ to be the same, denoted as ρ, and set the range ρ [ 1, 10] for grid search. A second hyper-parameter for GE is the Gauss parameter σ2 to determine the balance between expected Bellman error and entropy in (7). In our experiments, we tune σ on the log scale log10 σ [ 1, 8]. ... In DQN, we set the exploration constant to be ϵ = 0.1. In all experiments, we tune the learning rate α {10 3, 10 4, 10 5}.